# DTSC-670 Foundations of Machine Learning
## Assignment: Ridge Regression
### Name: Kurt Brown

## Copyright & Academic Integrity Notice
<span style="color:red">This material is for enrolled students' academic use only and protected under U.S. Copyright Laws. This content must not be shared outside the confines of this course, in line with Eastern University's academic integrity policies. Unauthorized reproduction, distribution, or transmission of this material, including but not limited to posting on third-party platforms like GitHub, is strictly prohibited and may lead to disciplinary action. You may not alter or remove any copyright or other notice from copies of any content taken from BrightSpace or Eastern University’s website.</span>
 
<span style="color:red">© Copyright Notice 2024, Eastern University - All Rights Reserved.</span> 

## Student Learning Objectives

- Evaluate different regression models and assess their performance relative to ridge regression regularization.
- Explore the utility of ridge regression in mitigating multicollinearity concerns in the data.
- Gain experience in employing grid search to identify optimal hyperparameters for a given model.

## CodeGrade
This assignment will be automatically graded through CodeGrade, and you will have unlimited submission attempts. To ensure successful grading, please follow these instructions carefully: Name your notebook as `ridge_regression_assignment.ipynb` before submission, as CodeGrade requires this specific filename for grading purposes. Additionally, make sure there are no errors in your notebook, as CodeGrade will not be able to grade it if errors are present. Before submitting, we highly recommend restarting your kernel and running all cells again to ensure that there will be no errors when CodeGrade runs your script.

## Assignment Overview
The objective of this assignment is to demonstrate that sometimes regularization techniques can help address multicollinearity problems in your dataset. You will gain experience in building diverse regression models and evaluating their performance in comparison to a ridge regression regularized model. Additionally, you will practice utilizing grid search to obtain the optimal hyperparameters for your models.

### Data
The data for this assignment is made-up and has been intentionally created in such a way that there is high correlation between predictor variables.  If your dataset includes predictor variables that exhibit strong correlations among themselves, ridge regression can sometimes be employed to alleviate multicollinearity by introducing a penalty term into the regression coefficients. This penalty term serves to restrain the coefficients, thus enhancing their stability.

Please download the `ridge_reg_data.csv` file from Brightspace and put it in the same folder as this notebook.

### Assignment Instructions
Walk through the assignment and follow the directions as requested.  Once you have completed all the tasks, you are ready to submit your assignment to CodeGrade for testing. Please restart your notebook's kernel and run your code from the beginning to ensure there are no error messages. Once you have verified that the code runs without any issues, submit your .ipynb notebook file to CodeGrade for evaluation. Your notebook should be called `ridge_regression_assignment.ipynb`. You have unlimited attempts for this assignment. 

## Standard Imports<a name="import"></a>
Run the code block below to import your standard imports and setup the notebook for CodeGrade grading.

In [1]:
# standard imports
import pandas as pd
import numpy as np

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)
import warnings
warnings.filterwarnings("ignore") 

## Get the Data
**Exercise 1:** Load the file named `ridge_reg_data.csv` and store its contents in a DataFrame named `ridge_data`.

In [2]:
### ENTER CODE HERE ###
ridge_data = pd.read_csv("ridge_reg_data.csv")

In [3]:
# take a look at the data
ridge_data

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,target
0,0.548814,0.715189,0.602763,0.544883,1.317953,4.232345
1,0.645894,0.437587,0.891773,0.963663,1.329360,5.235632
2,0.791725,0.528895,0.568045,0.925597,1.096939,5.245756
3,0.087129,0.020218,0.832620,0.778157,0.852838,2.990405
4,0.978618,0.799159,0.461479,0.780529,1.260638,6.951112
...,...,...,...,...,...,...
995,0.377716,0.918172,0.579258,0.369722,1.497429,6.335631
996,0.924504,0.673493,0.230904,0.819261,0.904397,5.958141
997,0.698047,0.486354,0.940865,0.068375,1.427219,7.331734
998,0.194568,0.221001,0.235428,0.152850,0.456428,3.313825


**Exercise 2:** Let's create a training and a test set.
1) Drop the `target` column from the `ridge_data` DataFrame and store the remaining features in a DataFrame named `X` (*uppercase X*).
2) Save the `target` column as a Series named `y` (*lowercase y*).
3) Utilize Scikit-learn's `train_test_split` function, using `X` and `y`to create a training set and a test set. Allocate 80% of the instances for training and 20% for testing. Set the random_state to 42 to ensure reproducibility of our results.  Assign the DataFrames the following names: `X_train`, `X_test`, `y_train`, and `y_test`.

In [4]:
### ENTER CODE HERE ###
from sklearn.model_selection import train_test_split
ridge_data = pd.read_csv('ridge_reg_data.csv')
X = ridge_data.drop(columns=['target'])
y = ridge_data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Ordinary Least Squares Linear Regression
**Exercise 3:** 
1) Create an instance of Scikit-Learn's `LinearRegression()` class and name this object `ols_model`.
2) Train your `ols_model` by fitting it with the `X_train` and `y_train` data.

In [6]:
### ENTER CODE HERE ###
from sklearn.linear_model import LinearRegression
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)

## Stochastic Gradient Descent
**Exercise 4:** 
1) Create an instance of Scikit-Learn's `SGDRegressor()` class and name this object `sgd_model`.  Include the following hyperparameters:
    - max_iter = 10000
    - penalty = None
    - learning_rate = 'constant'
    - n_iter_no_change = 100
    - random_state = 42
    - Take some time to look at [Scikit-Learn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html) to understand the definitions of these hyperparameters.  You can also play around with different hyperparameter values on your own (just please do not include your work in this notebook when you submit it) to see how the model changes.
2) Note: Do not train your model yet.  We will do that in the next section.

In [7]:
### ENTER CODE HERE ###
from sklearn.linear_model import SGDRegressor
sgd_model = SGDRegressor(
    max_iter = 10000,
    penalty=None, 
    learning_rate='constant',
    n_iter_no_change=100,
    random_state=42
)

### Grid Search for SGDRegressor Model
In the module videos, we discussed the importance of the initial learning rate (called `eta0` in sklearn) and stopping criterion (called `tol` in sklearn).  We've created a simple grid search below that:
1) Creates a dictionary of options for `eta0` and `tol` 
2) Instantiates a `GridSearchCV()` class that passes the `sgd_model` that you created earlier along with the `sgd_param_grid` dictionary.  Take a look at [Scikit-Learn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to understand the `verbose` and `cv` parameters.   
3) Fits your grid search using your `X_train` and `y_train` data and prints out the best parameters found during the grid search.
4) Saves the best model from the grid search as `sgd_model`.

Run the code block below to run through the steps outlined above and make sure that you understand what is going on in the code.

In [8]:
from sklearn.model_selection import GridSearchCV

# create a dictionary of hyperparameter values
sgd_param_grid = {'eta0': [.1, .01, .001],
                  'tol': [1e-1, 1e-3, 1e-5, 1e-7]}

# instantiate GridSearchCV()
grid_search_cv_sgd = GridSearchCV(sgd_model,
                              sgd_param_grid, verbose=1, cv=10)

# fits the grid search model and printes the best parameters
grid_search_cv_sgd.fit(X_train, y_train)
print("The best parameters are: ", grid_search_cv_sgd.best_params_)

# saves the best model using the hyperparameters from the grid search
sgd_model = grid_search_cv_sgd.best_estimator_

Fitting 10 folds for each of 12 candidates, totalling 120 fits
The best parameters are:  {'eta0': 0.001, 'tol': 0.001}


It can be beneficial to conduct additional rounds of grid searches to explore non-searched areas of the hyperparameter space. For instance, if the optimal `eta0` parameter was found to be 0.001, it might be worthwhile to explore values both below and above that threshold to determine if further model enhancements can be achieved. However, we will not perform further grid searches for this assignment as they are unlikely to yield significant benefits.

## Ridge Regression
As we mentioned in the data description, this dataset was specifically created with high correlation between predictor values.  When working with data that contains multicollinearity issues, ridge regression can sometimes help to mitigate this problem.

**Exercise 5:** 
1) Create an instance of Scikit-Learn's `RidgeRegression()` class and name this object `ridge_model`.  Set your `random_state` to 42.
2) Note: Do not train your model yet.  We will do that in the next section.

In [9]:
### ENTER CODE HERE ###
from sklearn.linear_model import Ridge
ridge_model = Ridge(random_state=42)

### Grid Search for Ridge Regression Model
In this grid search, we will search for the best `alpha` value to use in our `ridge_model`.  Remember from the module that alpha (sometimes called lambda) represents the amount of penalty added to the traditional least squares method.
    
**Exercise 6:** Similar to how the grid search was performed with the SGDRegressor model:
1) Create a dictionary of values for `alpha` and call this dictionary `ridge_param_grid`.  The values of alpha that you will search are: 0.25, 0.50, 0.75, 1, 2, 3, 5, 10, 100
2) Instantiate the `GridSearchCV()` class and pass it your `ridge_model` and your `ridge_param_grid` dictionary.  Use `verbose=1` and `cv=10`. Save this as `grid_search_cv_ridge`.
3) Fit your grid search model using the `X_train` and `y_train`.
4) Save the best model from the grid search as `ridge_model`.

In [10]:
### ENTER CODE HERE ###
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

ridge_param_grid = {'alpha': [0.25, 0.50, 0.75, 1, 2, 3, 5, 10, 100]}

grid_search_cv_ridge = GridSearchCV(ridge_model, ridge_param_grid, verbose=1, cv=10)

grid_search_cv_ridge.fit(X_train, y_train)

ridge_model = grid_search_cv_ridge.best_estimator_

Fitting 10 folds for each of 9 candidates, totalling 90 fits


## Calculate Metrics
Let's now see how the three models compare using the root mean squared error (RMSE) and R-squared metrics.

**Exercise 7:** 
1) Using your `ols_model`, `sgd_model`, and `ridge_model` make predictions using your `X_test` data and save these predictions as `ols_pred`, `sgd_pred`, and `ridge_pred` respectively.
2) Find the RMSE for each model using Scikit-Learn's `root_mean_squared_error` function passing it your `y_test` and predictions from step 1 above.  Round these RMSE scores to 4 decimal places.  Save these scores as `rmse_ols`, `rmse_sgd`, and `rmse_ridge` respectively.  
3) Find the R-squared values for each model using Scikit-Learn's `r2_score` function passing it your `y_test` and predictions from step 1 above.  Round these R-squared scores to 4 decimal places.  Save these scores as `r2_ols`, `r2_sgd`, and `r2_ridge` respectively.
4) Print the RMSE and R-squared scores for each model.

In [12]:
### ENTER CODE HERE ###
from sklearn.metrics import mean_squared_error, r2_score

ols_pred = ols_model.predict(X_test)
sgd_pred = sgd_model.predict(X_test)
ridge_pred = ridge_model.predict(X_test)

In [13]:
### ENTER CODE HERE ###
rmse_ols = np.round(np.sqrt(mean_squared_error(y_test, ols_pred)), 4)
rmse_sgd = np.round(np.sqrt(mean_squared_error(y_test, sgd_pred)), 4)
rmse_ridge = np.round(np.sqrt(mean_squared_error(y_test, ridge_pred)), 4)

In [14]:
### ENTER CODE HERE ###
r2_ols = np.round(r2_score(y_test, ols_pred), 4)
r2_sgd = np.round(r2_score(y_test, sgd_pred), 4)
r2_ridge = np.round(r2_score(y_test, ridge_pred), 4)

In [15]:
print(f"RMSE (OLS): {rmse_ols}")
print(f"RMSE (SGD): {rmse_sgd}")
print(f"RMSE (Ridge): {rmse_ridge}")
print("----------")
print(f"R-squared (OLS): {r2_ols}")
print(f"R-squared (SGD): {r2_sgd}")
print(f"R-squared (Ridge): {r2_ridge}")

RMSE (OLS): 0.987
RMSE (SGD): 0.9853
RMSE (Ridge): 0.9841
----------
R-squared (OLS): 0.6741
R-squared (SGD): 0.6752
R-squared (Ridge): 0.676


It's worth noting that the ridge regression model demonstrated a small enhancement compared to both the ordinary least squares and stochastic gradient descent models. Naturally, it's important to assess whether these differences hold statistical significance. Additionally, the selection of different hyperparameter values can significantly influence the final model outcome.

While ridge regression may not consistently help with multicollinearity, it's typically a technique that is worthy to explore to see if there are potential improvements in the model.