# Creating Widgets for Multiple Linear Regression

Thursday, January 9th, 2025

[Link to Discussion](https://maro406.github.io/eds-232-machine-learning/discussion/week1.html)

## Set Up

### About the data:

### Purpose:

### Import packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from ipywidgets import interact
import ipywidgets as widgets
from ipywidgets import interact, FloatSlider
from IPython.display import display, clear_output

### Load data

In [2]:
# Create filepath to data
file_path = 'data/hurricane.xlsx'

# Load in DO, rainfall, and turbidity data
do_data = pd.read_excel(file_path, sheet_name=5).drop(['Piermont D.O. (ppm)'], axis = 1)
rainfall_data = pd.read_excel(file_path, sheet_name='Rainfall').drop(['Piermont  Rainfall Daily Accumulation (Inches)'], axis = 1)
turbidity_data = pd.read_excel(file_path, sheet_name='Turbidity').drop(['Piermont Turbidity in NTU'], axis = 1)

## Data Wrangling
Perform the following data wrangling steps to get our data ready for our model.

1. Merge the three dataframes together. While merging, or after, drop all columns for the Piedmont location.
2. Update the column names to be shorter and not have spaces. Use snake case.
3. Make your date column a datetime obect.
4. Set the data as the index for the merged dataframe.

In [3]:
# Merge DO and rainfall data
df = pd.merge(do_data, rainfall_data, how='left', on='Date Time (ET)')

# Add on turbidity data
df = pd.merge(df, turbidity_data, how='left', on='Date Time (ET)')

# View changes 
df

Unnamed: 0,Date Time (ET),Port of Albany D.O. (ppm),Norrie Point D.O. (ppm),Port of Albany Rainfall Daily Accumulation (Inches),Norrie Point Rainfall Daily Accumulation (Inches),Port of Albany Turbidity in NTU,Norrie Point Turbidity in NTU
0,2011-08-25 00:00:00,7.68,7.81,0.000000,0.000000,4.0,9.3
1,2011-08-25 00:15:00,7.60,7.73,0.000000,0.000000,3.9,8.4
2,2011-08-25 00:30:00,7.57,7.63,0.000000,0.000000,4.3,7.9
3,2011-08-25 00:45:00,7.72,7.67,0.000000,0.000000,4.7,8.1
4,2011-08-25 01:00:00,7.74,7.63,0.000000,0.000000,4.4,8.4
...,...,...,...,...,...,...,...
1147,2011-09-05 22:45:00,8.73,6.84,0.629999,1.219998,47.2,144.1
1148,2011-09-05 23:00:00,8.76,6.78,0.639999,1.239998,56.7,139.7
1149,2011-09-05 23:15:00,8.66,6.83,0.649999,1.259997,47.0,141.2
1150,2011-09-05 23:30:00,8.75,6.79,0.679999,1.269997,48.7,127.9


In [4]:
# Update column names
df = df.rename(columns={"Date Time (ET)": "date", 
                   " Port of Albany D.O. (ppm)": "albany_do", 
                   "Norrie Point D.O. (ppm)": "norrie_do",
                        " Port of Albany Rainfall Daily Accumulation (Inches)" : "albany_rainfall",
                        "Norrie Point  Rainfall Daily Accumulation (Inches)": "norrie_rainfall",
                   " Port of Albany Turbidity in NTU": "albany_turbidity",
                   "Norrie Point Turbidity in NTU": "norrie_turbidity"})
# Can also use df.columns = ['date', 'albany_do']

# View changes
df

Unnamed: 0,date,albany_do,norrie_do,albany_rainfall,norrie_rainfall,albany_turbidity,norrie_turbidity
0,2011-08-25 00:00:00,7.68,7.81,0.000000,0.000000,4.0,9.3
1,2011-08-25 00:15:00,7.60,7.73,0.000000,0.000000,3.9,8.4
2,2011-08-25 00:30:00,7.57,7.63,0.000000,0.000000,4.3,7.9
3,2011-08-25 00:45:00,7.72,7.67,0.000000,0.000000,4.7,8.1
4,2011-08-25 01:00:00,7.74,7.63,0.000000,0.000000,4.4,8.4
...,...,...,...,...,...,...,...
1147,2011-09-05 22:45:00,8.73,6.84,0.629999,1.219998,47.2,144.1
1148,2011-09-05 23:00:00,8.76,6.78,0.639999,1.239998,56.7,139.7
1149,2011-09-05 23:15:00,8.66,6.83,0.649999,1.259997,47.0,141.2
1150,2011-09-05 23:30:00,8.75,6.79,0.679999,1.269997,48.7,127.9


In [5]:
# Change to datetime format
df['date'] = pd.to_datetime(df['date'])

# Set date as the index
df.set_index('date', inplace=True)

# View changes
df.head()

Unnamed: 0_level_0,albany_do,norrie_do,albany_rainfall,norrie_rainfall,albany_turbidity,norrie_turbidity
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011-08-25 00:00:00,7.68,7.81,0.0,0.0,4.0,9.3
2011-08-25 00:15:00,7.6,7.73,0.0,0.0,3.9,8.4
2011-08-25 00:30:00,7.57,7.63,0.0,0.0,4.3,7.9
2011-08-25 00:45:00,7.72,7.67,0.0,0.0,4.7,8.1
2011-08-25 01:00:00,7.74,7.63,0.0,0.0,4.4,8.4


## Multiple Linear Regression
Now that our data is cleaned, let’s do the following to carry out a multiple linear regression.

1. Define your predictors and target variables.
2. Split the data into training and testing sets
3. Create and fit the model
4. Predict and Evaluate your model

In [17]:
# Define predictors and target
X = df[['albany_do', 'albany_rainfall']] # Turns into df
y = df[['albany_turbidity']]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
       
# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)
    
# Predict and evaluate
y_pred = model.predict(X_test) # Will print an array

# View RMSE and R-squared
rmse = np.sqrt(mean_squared_error(y_test, y_pred)) # It is on a scale based on your units. For turbidity, we are about 221 off (predicted vs actual)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred))}")
print(f"R-squared: {r2_score(y_test, y_pred)}")


RMSE: 221.9143474905527
R-squared: 0.4907389518457509


## Create a Widget for updating the predictor and target variables.
1. Create the four different pieces to the widget: the predictor selector, the target selector, the evaluate button, and the output
2. Wrap our worfklow into a function called evaluate_model(). This function will run a linear regression model based on what the user selects as predictors and the outcome variable. It will print the R^2, MSE, and a scatterplot of the actual versus predicted target variable.
3. Create a warning for your widget to ensure that the user does not select the same variable as both a predictor variable and a target variable.
4. Play around with your widget and see how your R^2 changes based on your selected variables!

In [None]:
# Create widget for selecting predictors
predictor_selector = widgets.SelectMultiple(
    options = df.columns,
    value = [df.columns[0]], # Where to start
    description = 'Predictors' # Name   
)

# Create a dropdown for selecting the target variable
target_selector = widgets.Dropdown(
    options = df.columns,
    value = df.columns[1],
    description = 'Target'
) 

# Button to evaulate model
evaluate_button= widgets.Button(description = 'Evaluate Model')

# Output widget to display results
output = widgets.Output()

# Define the function to handle button clicks
def evaluate_model(b):
    with output:
        clear_output(wait = True) # Clear output of display area
        
        # Make sure target is not in predictors
        selected_predictors = [item for item in predictor_selector.value]
        if target_selector.value in selected_predictors :
            print("Target variable must not be in the predictors")
            return
        # Prepare data
        X = df[[selected_predictors]]
        y = df[[target_selector.value]]
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        
        # Create and fit model
        model = LinearRegression()
        model.fit(X_train, y_train)
        
        # Predict and calculate R^2 and MSE
        y_pred = model.predict(X_test)
        r2 = r2_score(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        
# Display the widgets and connect the button to function
display(predictor_selector, target_selector, evaluate_button, output)
evaluate_button.on_click(evaluate_model)