# Regularization and Model Selection

## Remember the 3 integral steps!


1.   Model construction
2.   Model usage
3.   Model selection



## We've talked about step 1 (Model construction)...

We use our training data to find our estimates

## And we've talked about step 2 (Model usage)...

We use our data against test data

## "...So what about model selection?

How do we know which model will be the best fit for our data? How do we prevent overfitting?

## Variable selection

We can use stepwise variable selection to select the best predictors using stepwise variable selection & cross validation!

In stepwise variable selection we iteratively find the optimal set of predictors by slowly building up how many predictors we are using.

1. Start with no predictors
2. Choose the predictor with the highest R^2 value (or other metric)
3. Select the next predictor that increases the R^2 value until no new candidates remain that will increase the R^2 (again, or other metric we are using the evaluate our model's accuracy)



#Variable (forward) selection example!

Download the .csv file from this link here: https://drive.google.com/file/d/10UeKpTSuqPnTeydHmqrSzYGbvVoCX4zF/view?usp=sharing

Import your libraries...

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Before going forward, let's uninstall and reinstall scikit learn
This is because colab's default sklearn is version 0.22.2, and we want 0.24.2 for this since 0.22.2 does not have sequential feature selection!

Load our dataset...

In [3]:
dataset = pd.read_csv('student_scores_extended.csv')

Explore our dataset...

In [4]:
dataset.head(10)

Unnamed: 0,Previous score,Prereqs taken,ID number,Hours,Scores
0,33,0,8751,2.5,21
1,45,0,4186,5.1,47
2,28,0,1096,3.2,27
3,80,1,5943,8.5,75
4,50,0,2723,3.5,30
5,18,0,7472,1.5,20
6,95,2,7122,9.2,88
7,75,1,7776,5.5,60
8,100,1,4072,8.3,81
9,10,0,6087,2.7,25


In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Previous score  25 non-null     int64  
 1   Prereqs taken   25 non-null     int64  
 2   ID number       25 non-null     int64  
 3   Hours           25 non-null     float64
 4   Scores          25 non-null     int64  
dtypes: float64(1), int64(4)
memory usage: 1.1 KB


Let's plot each attribute and see what it looks like!

In [6]:
# Let's try using matplotlib 
# and then seaborn

**What do you notice about these plots already?**

Now let's do forward selection! First, get out training and testing data

In [7]:
from sklearn.model_selection import train_test_split # remember this from yesterday?

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

Next, let's decide what model we are using (let's go with linear regression!)

In [8]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

Now, let's implement our step forward feature selector

In [10]:
from sklearn.feature_selection import SequentialFeatureSelector

selector = SequentialFeatureSelector(
    lr,
    n_features_to_select='auto',
    direction='forward',
    scoring='r2',
    cv = 5
)

selector.fit(X_train,y_train)

Now, let's print out which columns (features) were chosen as the optimal featuers!

In [11]:
selector.get_support()

array([False,  True, False,  True])

This means it chose 1,3 as our optimal features! `n_features_to_select=None` means it chooses half the features by default

**Do those results make sense to you?**

You can also do `selector.transform(X)` to directly get a transformed version without the other features

In [12]:
selector.transform(X_train).shape

(20, 2)

# Cross Validation

We want to prevent overfitting! So we can cross validate our models onto various data sets and average our performance. There are various methods explained more in lecture

*   K-fold
*   Leave-One-Out

Let's take a look at a dummy example...




In [13]:
from sklearn.model_selection import KFold
import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])

kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
     print("TRAIN:", train_index, "TEST:", test_index)
     X_train, X_test = X[train_index], X[test_index]
     y_train, y_test = y[train_index], y[test_index]

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]


# Regularization

We don't want to deal with extremes/outliers that may skew our data. Regularization helps us modify our loss function (how much our predictions differed from the actual values). We can use two types of regularization methods. This also helps with overfitting!

1.   L1 - Lasso Regularization
2.   L2 - Ridge Regularization

Let's use MSE (Mean Squared Error) as our loss function



## Lasso Regularization (L1)

AKA Least Absolute Shrinkage and Selection Operator

We want to reduce overfitting and control our regularization parameter lambda. Lasso regularization takes the magnitude of our lambda into account by adding a penalty to our loss function which is the absolute value of the magnitude of the coefficient. This method basically shrinks our data until it reaches some middle point. Beware it can lead to a sparse model (small number of coefficients)!

Let's look at an example in code!

In [14]:
from sklearn.linear_model import Lasso
import numpy as np

In [15]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [16]:
l1 = Lasso(alpha=0.1)
l1.fit(X_train, y_train)
l1.score(X_test, y_test)

0.3200000000000002

In [17]:
print(l1.coef_) # prints the parameters/coefficients for each of our features that will reduce our loss function
print(l1.intercept_) # intercept for our loss function

[0.4 0. ]
0.7


Note: Lasso regularization tends to make coefficients zero which ends up reducing features. So it can help us with model/feature selection like we just explored!

# Ridge Regularization (L2)

Similar to lasso except lasso tends to make its coefficients zero whereas ridge will never do that. In ridge, our penalty is the square of our coefficient.

Let's look at an example in code!

In [18]:
from sklearn.linear_model import Ridge

In [19]:
l2 = Ridge(alpha=0.1)
l2.fit(X_train, y_train)
l2.score(X_test, y_test)

0.989886972040452

In [20]:
print(l2.coef_)
print(l2.intercept_)

[0.24390244 0.24390244]
0.28048780487804814


**Let's compare the results from Ridge to our results from Lasso!**