# EDS 232 Discussion Week 1

Date: 2025/01/09
Jordan Sibley 

### Data Loading 

In [10]:
# Import packages 
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from ipywidgets import interact
import ipywidgets as widgets
from ipywidgets import interact, FloatSlider
from IPython.display import display, clear_output

# Load the data 
file_path = 'data/Hurricane_Irene.xlsx'
do_data = pd.read_excel(file_path, sheet_name=5).drop(['Piermont D.O. (ppm)'], axis = 1)
rainfall_data = pd.read_excel(file_path, sheet_name='Rainfall').drop(['Piermont  Rainfall Daily Accumulation (Inches)'], axis=1)
turbidity_data = pd.read_excel(file_path, sheet_name='Turbidity').drop(['Piermont Turbidity in NTU'], axis=1)

### Data Cleaning 

Perform the following data wrangling steps to get our data ready for our model.

1. Merge the three dataframes together. While merging, or after, drop all columns for the Piedmont location.
2. Update the column names to be shorter and not have spaces. Use snake case.
3. Make your date column a datetime obect.
4. Set the data as the index for the merged dataframe.


In [2]:
# Merge the three datasets 

data = do_data.merge(rainfall_data, on = 'Date Time (ET)')
data = data.merge(turbidity_data, on = 'Date Time (ET)')
data.head()

Unnamed: 0,Date Time (ET),Port of Albany D.O. (ppm),Norrie Point D.O. (ppm),Port of Albany Rainfall Daily Accumulation (Inches),Norrie Point Rainfall Daily Accumulation (Inches),Port of Albany Turbidity in NTU,Norrie Point Turbidity in NTU
0,2011-08-25 00:00:00,7.68,7.81,0.0,0.0,4.0,9.3
1,2011-08-25 00:15:00,7.6,7.73,0.0,0.0,3.9,8.4
2,2011-08-25 00:30:00,7.57,7.63,0.0,0.0,4.3,7.9
3,2011-08-25 00:45:00,7.72,7.67,0.0,0.0,4.7,8.1
4,2011-08-25 01:00:00,7.74,7.63,0.0,0.0,4.4,8.4


In [3]:
# Update the column names
data.columns = ['date', 'albany_do', 'norrie_do', 'albany_rainfall', 'norrie_rainfall', 'albany_turbidity', 'norrie_turbidity']

# Convert data to datetime format and set it as index 
data['date'] = pd.to_datetime(data['date'])

# Set the date as the index 
data.set_index('date', inplace = True)

In [4]:
# Check our work 
data.head()

Unnamed: 0_level_0,albany_do,norrie_do,albany_rainfall,norrie_rainfall,albany_turbidity,norrie_turbidity
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011-08-25 00:00:00,7.68,7.81,0.0,0.0,4.0,9.3
2011-08-25 00:15:00,7.6,7.73,0.0,0.0,3.9,8.4
2011-08-25 00:30:00,7.57,7.63,0.0,0.0,4.3,7.9
2011-08-25 00:45:00,7.72,7.67,0.0,0.0,4.7,8.1
2011-08-25 01:00:00,7.74,7.63,0.0,0.0,4.4,8.4


## Multiple regression 

Now that our data is cleaned, let’s do the following to carry out a multiple linear regression.

1. Define your predictors and target variables.
2. Split the data into training and testing sets
3. Create and fit the model
4. Predict and Evaluate your model

In [5]:
# Define predictors and the target variable 
X = data[['albany_do', 'albany_rainfall']]
y = data[['albany_turbidity']]

# Split the data into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)


In [6]:
# Create and fit the model 
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate 
y_pred = model.predict(X_test)

print(f"RSME: {np.sqrt(mean_squared_error(y_test, y_pred))}")
print(f"R-squared: {r2_score(y_test, y_pred)}")

RSME: 221.9143474905527
R-squared: 0.4907389518457509


Root mean squared error is more subjective to what your units are. On average our difference of or predicted to the actual is 221.9.

R-squared is 0.49 tells us it is not a great model 

### Create a Widget for updating the predictor and target variables.

1. Create the four different pieces to the widget: the predictor selector, the target selector, the evaluate button, and the output
2. Wrap our worfklow into a function called evaluate_model(). This function will run a linear regression model based on what the user selects as predictors and the outcome variable. It will print the R squared, MSE, and a scatterplot of the actual versus predicted target variable.
3. Create a warning for your widget to ensure that the user does not select the same variable as both a predictor variable and a target variable.
4. Play around with your widget and see how your R squared changes based on your selected variables!

In [9]:
# Create a widget for selecting predictors
predictor_selector = widgets.SelectMultiple(
    options = data.columns,
    value = (data.columns[0],),
    description = 'Predictors'
)

# Create a dropdown for selecting the target variable
target_selector = widgets.Dropdown(
    options = data.columns,
    value = data.columns[1],
    description = 'Target'
)

# Button to evaluate the  model
evaluate_button = widgets.Button(description = 'Evaluate Model')

# Output widgets to display results
output = widgets.Output()

# Define the function to handle button clicks
def evaluate_model(b):
    with output:
        clear_output(wait=True)
        
        # Make sure target is not in the predictors
        selected_predictors = [item for item in predictor_selector.value]
        if target_selector.value in selected_predictors :
            print("Target variable must not be in the predictors.")
            return 
        
        # Prepare the data
        X = data[selected_predictors]
        y = data[target_selector.value]
        
        # Split data into training and testing
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)
        
        # Create and fit model
        model = LinearRegression()
        model.fit(X_train, Y_train)
        
       # Predict and calculate R2 and MSe
        y_pred = model.predict(X_test)
        r2 = r2_score(Y_test, Y_pred)
        mse = mean_squared_error(Y_test, Y_pred)
        
        # Display the R2 score and MSE
        print(f"R2: {r2:.4f}")
        print(f"MSE: {mse:.4f}")
        
# Display the widget
display(predictor_selector, target_selector, evaluate_button, output)
evaluate_button.on_click(evaluate_model)

SelectMultiple(description='Predictors', index=(0,), options=('albany_do', 'norrie_do', 'albany_rainfall', 'no…

Dropdown(description='Target', index=1, options=('albany_do', 'norrie_do', 'albany_rainfall', 'norrie_rainfall…

Button(description='Evaluate Model', style=ButtonStyle())

Output()