# **Lab 4A: Linear regression and regularization**

**WHAT** This nonmandatory lab consists of several programming and insight exercises/questions.

**WHY** The exercises are meant to familiarize yourself with regularization methods for linear regression.

**HOW** Follow the exercises in this notebook either on your own or with a fellow student. Work your way through these exercises at your own pace and be sure to ask questions to the TA's when you don't understand something.

$\newcommand{\q}[1]{\rightarrow \textbf{Question #1.}}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1.}}$

**Goal of the Lab:**    
In this lab, we aim to fit a ridge regression model and lasso model to the `Hitters` data, as well as an ElasticNet model. We wish to predict a baseball player's `Salary` based on the statistics associated with performance in the previous year.  From several models we wish to select the best one and determine its RMSE.

For more information about the data set, including a description of the features/variables, click [here](https://www.rdocumentation.org/packages/ISLR/versions/1.2/topics/Hitters).

**Remark:**  
This lab is based on a previous version of what now is Section 6.5 Lab 1 and 2: Linear Models and Regularization methods of _An Introduction to Statitistical Learning_. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# To produce static images embedded in the notebook
%matplotlib inline 
# To produce interactive images embedded in the notebook
# %matplotlib notebook

# sci-kit learn specifics
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")


# make Pandas display 2 digits after decimal point; some output then fits in window
# change this if you like
pd.set_option('display.precision', 2)  

## 1. Loading and viewing the `Hitters` data set

We import the `Hitters` data set as a pandas dataframe.

In [None]:
hitters = pd.read_csv("./Hitters.csv", index_col = 0)

# Display information about the data set
hitters.info()

<div style="background-color:#c2eafa">
$\ex{1}$ Look at the first rows of the dataset, print some summary statistics, and find out which features contains `NAs`.

In [None]:
# Look at first rows of the data set
hitters.head()

In [None]:
# Return summary statistics for each column

# START ANSWER
# END ANSWER

In [None]:
# Check for NA values

# START ANSWER 
# END ANSWER

<div style="background-color:#c2eafa">
$\ex{2}$ Which features of the dataset are categorical and what are their (unique) values? 

In [None]:
# Categorical variables:

# START ANSWER 
# END ANSWER

## 2. Preprocessing the data
In this part, we preprocess the `Hitters` data so that it is ready for the data fitting. 

<div style="background-color:#c2eafa">
$\ex{3}$ Remove all `NA` values.

In [None]:
hitters_clean = None

# START ANSWER 
# END ANSWER

# Display information about the data set
hitters_clean.info()

We obtain the "cleaned" data with 263 rows and 20 columns. This is in agreement with the result in Section 6.5 of the book and also with our earlier observation when calling `hitters.info()`. 

We continue with transforming the *categorical* variables `League`, `Division`, and `NewLeague` to *indicator* variables. Let's see what methods `Pandas` has to help us:

In [None]:
#create dummies variable
dummies = pd.get_dummies(hitters_clean[["League", "Division", "NewLeague"]])

# First 10 rows of data set
dummies.head()

In [None]:
# Return summary statistics for each column
dummies.describe()

<div style="background-color:#c2eafa">
    
$\ex{4}$ Note that for each categorical variable we should only use *one* of the generated dummies. Replace the categorical columns of `hitters_ind` with the corresponding binary values, as described in the comments below. 

In [None]:
# Make a copy of hitters_clean data set
hitters_ind = hitters_clean.copy()

# Replace the columns with their 0/1 values such that
# League = 'N' is assigned the value 1 and 'A' is assigned the value 0
# Division = 'W' is assigned the value 1 and 'E' is assigned the value 0
# NewLeague = 'N' is assigned the value 1 and 'A' is assigned the value 0 

# START ANSWER 
# END ANSWER 

In [None]:
# Note that hitters_clean is not changed, but hitters_ind is:
hitters_clean.head()

hitters_ind.head()

### 2A. Splitting and scaling the dataset

Below we split the data 60/20/20 into training, validation, and test data.

<div style="background-color:#c2eafa">
$\ex{5}$ After the splitting, scale the predictor variables. All three data sets must be scaled with mean and standard deviation of the training set.

In [None]:
# The design matrix X containing the predictors
X = hitters_ind.drop("Salary", axis = 1)
colnames = list(X.columns)
# The y variable containing the response
y = hitters_ind.Salary

# Splitting (don't change this)
X_train, X_rest, y_train, y_rest = train_test_split(X, y, train_size = 0.6, random_state = 1267)
X_val, X_test, y_val, y_test = train_test_split(X_rest, y_rest, train_size = 0.5, random_state = 9001)

# Standardize the design matrices (don't changes their names)
# START ANSWER
# END ANSWER 

X = X_val
# Check if X has mean zero and variance 1 for each column
print("\nMeans of the standardized X:")
print(X.mean(axis = 0))
print("\nVariances of the standardized X:")
print(X.var(axis = 0))

<div style="background-color:#c2eafa">
    
$\q{1}$ The resulting means of the validation and training set are not all close to zero. The variances not all close to 1. However, we _must_ do the scaling this way. Why is that?

<div style="background-color:#ffa500">
    
Write your answer in this colored box:

[//]: # (START ANSWER)
[//]: # (END ANSWER)

## 3. Ridge Regression

We perform ridge regression in order to predict the baseball players' salaries based on their performance statistics. 

In [None]:
# Range of values for lambda, the tuning parameter
lambdas = 10**np.linspace(-2, 10, 25)

This range for `lambdas` covers the full range of scenarios from the least squares fit ($\lambda = 10^{-2} $) to the null model containing only the intercept ($\lambda = 10^{10}$).
For each particular value in `lambdas`, we store a vector of ridge regression coefficients plus an intercept. 

In [None]:
RidgeRegr = Ridge(fit_intercept = True)

# Create a pandas dataframe to store the coefficients
coefsRR = pd.DataFrame(columns = colnames)
coefsRR["Intercept"] = ""
# Loop through lambdas

for index,l in enumerate(lambdas):
    # set the ridge model with corresponding lambda value 
    RidgeRegr.set_params(alpha = l)
    # fit the model 
    RidgeRegr.fit(X_train, y_train)
    # Add the coefficients and intercept to the dataframe 
    coeff_intercept = np.append(RidgeRegr.coef_, RidgeRegr.intercept_)
    coefsRR.loc[index, :] = coeff_intercept

coefsRR


In [None]:
# Plot the coefficients
plt.figure(figsize = (6, 6))
ax = plt.axes()
for i in range(len(coefsRR.columns)):
    ax.plot(lambdas, coefsRR.iloc[:, i], label=coefsRR.columns[i])
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('$\lambda$')
plt.ylabel('estimated coefficients')
ax.legend(loc='upper left', bbox_to_anchor=(0.75, 1), ncol=1, fontsize='small')  # Adjust bbox_to_anchor for positioning
plt.show()

<div style="background-color:#c2eafa">
    
$\q{2}$ Some coefficients go to zero monotonically, others do not, first growing in size or changing sign before going to zero. How can this happen?

<div style="background-color:#ffa500">
    
Write your answer in this colored box:

[//]: # (START ANSWER)
[//]: # (END ANSWER)

<div style="background-color:#c2eafa">
$\q{3}$ Why is the intercept not going to zero as $\lambda$ gets large?

<div style="background-color:#ffa500">
    
Write your answer in this colored box:
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

<div style="background-color:#c2eafa">
$\ex{6}$ (Skip this in your first pass; see section 6 for explanation) You might add code in the loop to compute training and validation MSEs and could consider printing them with the L2 norms or plotting them simultaneously versus the lambdas.

In [None]:
# Testing for specific lambdas;
lambdas_sel = lambdas # [705, 11000, 10**7] 
l2_lambdas_sel = []

coefsRR_sel = pd.DataFrame(columns = colnames)
coefsRR_sel["Intercept"] = ""

# Loop through lambdas
for index,l in enumerate(lambdas_sel):
    RidgeRegr.set_params(alpha = l)
    RidgeRegr.fit(X_train, y_train)
    coeff_intercept = np.append(RidgeRegr.coef_, RidgeRegr.intercept_)
    coefsRR_sel.loc[index, :] = coeff_intercept
    l2_lambdas_sel.append(np.linalg.norm(RidgeRegr.coef_))

# Print the $ l_{2} norm $ of the coefficients
for index, l in enumerate(lambdas_sel):
    print(f"For lambda = {l:10.2}, the L2-norm = {l2_lambdas_sel[index]:7.2f}")
    
print("")

## 4. The Lasso

We have seen that ridge regression with a wise choice of $ \lambda $ can outperform least squares as well as the null model (having only the Intercept) on the `Hitters` data set. 
We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. 

In [None]:
# Range of values for lambda, the tuning parameter
lambdas = 10**np.linspace(-2, 3, 25)

In [None]:
lassoEST_lambdas = Lasso(fit_intercept = True, max_iter=10000)

# Create a pandas dataframe to store the coefficients
coefsLassoEST_lambdas = pd.DataFrame(columns = colnames)
coefsLassoEST_lambdas["Intercept"] = ""

for index,l in enumerate(lambdas):
    lassoEST_lambdas.set_params(alpha = l)
    lassoEST_lambdas.fit(X_train, y_train)
    coeff_intercept = np.append(lassoEST_lambdas.coef_, lassoEST_lambdas.intercept_)
    coefsLassoEST_lambdas.loc[index, :] = coeff_intercept

coefsLassoEST_lambdas

In [None]:
# Plot the coefficients
plt.figure(figsize = (6, 6))
ax = plt.axes()
for i in range(len(coefsLassoEST_lambdas.columns)):
    ax.plot(lambdas, coefsLassoEST_lambdas.iloc[:, i], label=coefsRR.columns[i])
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('$\lambda$')
plt.ylabel('estimated coefficients')
ax.legend(loc='upper left', bbox_to_anchor=(0.75, 1), ncol=1, fontsize='small')  # Adjust bbox_to_anchor for positioning
plt.show()

<div style="background-color:#c2eafa">
$\q{4}$ Notice the difference with the Ridge Regression plot. Can you explain the "kinks"?

<div style="background-color:#ffa500">
    
Write your answer in this colored box:
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

## 5. Elastic Net Regression

In [None]:
# Range of values for lambda, the tuning parameter
lambdas = 10**np.linspace(-2, 4, 25)

elasticEST_lambdas = ElasticNet(fit_intercept = True, max_iter=10000, l1_ratio=0.5)

# Create a pandas dataframe to store the coefficients
coefsElasticNetEST_lambdas = pd.DataFrame(columns = colnames)
coefsElasticNetEST_lambdas["Intercept"] = ""


for index,l in enumerate(lambdas):
    elasticEST_lambdas.set_params(alpha = l)
    elasticEST_lambdas.fit(X_train, y_train)
    coeff_intercept = np.append(elasticEST_lambdas.coef_, elasticEST_lambdas.intercept_)
    coefsElasticNetEST_lambdas.loc[index, :] = coeff_intercept

coefsElasticNetEST_lambdas

In [None]:
# Plot the coefficients
plt.figure(figsize = (6, 6))
ax = plt.axes()
for i in range(len(coefsElasticNetEST_lambdas.columns)):
    ax.plot(lambdas, coefsElasticNetEST_lambdas.iloc[:, i], label=coefsElasticNetEST_lambdas.columns[i])
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('lambda')
plt.ylabel('estimated coefficients')
ax.legend(loc='upper left', bbox_to_anchor=(0.75, 1), ncol=1, fontsize='small')  # Adjust bbox_to_anchor for positioning
plt.show()

## 6. A model selection task

<div style="background-color:#c2eafa">

Please carry out the following:
1. Also carry out the ElasticNet fit with `l1_ratio` 0.25 and 0.75.
2. For each of the five models use the validation set MSEs to determine the regularization parameter that yields the best model.
3. Construct a table with as rows the five best models: Ridge in the first row, Lasso in the last, and the ElasticNet ones nicely ordered in between. For each, display the RMSEs for training and validation.

**Notes**
1. It is a good idea to add some code in the "regularization loops" to compute/store MSEs for the training set and for the validation set. See the suggestions in Exercise 6.
2. You might be tempted to use a finer grid for the $\lambda$'s; that's not necessary.

In [None]:
# START ANSWER
# END ANSWER

<div style="background-color:#c2eafa">
$\q{5}$ Which model would you choose? Explain your choice.

<div style="background-color:#ffa500">
    
Write your answer in this colored box:
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

<div style="background-color:#c2eafa">
$\ex{7}$ Determine the RMSE for your best model, using the test set.