# Regularization and Model Selection

## Remember the 3 integral steps!


1.   Model construction
2.   Model usage
3.   Model selection



## We've talked about step 1 (Model construction)...

We use our training data to find our estimates

## And we've talked about step 2 (Model usage)...

We use our data against test data

## "...So what about model selection?

How do we know which model will be the best fit for our data? How do we prevent overfitting?

## Variable selection

We can use stepwise variable selection to select the best predictors using stepwise variable selection & cross validation!

In stepwise variable selection we iteratively find the optimal set of predictors by slowly building up how many predictors we are using.

1. Start with no predictors
2. Choose the predictor with the highest R^2 value (or other metric)
3. Select the next predictor that increases the R^2 value until no new candidates remain that will increase the R^2 (again, or other metric we are using the evaluate our model's accuracy)



#Variable (forward) selection example!

Download the .csv file from this link here: https://drive.google.com/file/d/10UeKpTSuqPnTeydHmqrSzYGbvVoCX4zF/view?usp=sharing

Import your libraries...

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Before going forward, let's uninstall and reinstall scikit learn
This is because colab's default sklearn is version 0.22.2, and we want 0.24.2 for this since 0.22.2 does not have sequential feature selection!

Load our dataset...

In [2]:
dataset = pd.read_csv('student_scores_extended.csv')

Explore our dataset...

In [3]:
dataset.head(10)

Unnamed: 0,Previous score,Prereqs taken,ID number,Hours,Scores
0,33,0,8751,2.5,21
1,45,0,4186,5.1,47
2,28,0,1096,3.2,27
3,80,1,5943,8.5,75
4,50,0,2723,3.5,30
5,18,0,7472,1.5,20
6,95,2,7122,9.2,88
7,75,1,7776,5.5,60
8,100,1,4072,8.3,81
9,10,0,6087,2.7,25


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Previous score  25 non-null     int64  
 1   Prereqs taken   25 non-null     int64  
 2   ID number       25 non-null     int64  
 3   Hours           25 non-null     float64
 4   Scores          25 non-null     int64  
dtypes: float64(1), int64(4)
memory usage: 1.1 KB


Let's plot each attribute and see what it looks like!

In [5]:
# Let's try using matplotlib 
# and then seaborn

**What do you notice about these plots already?**

Now let's do forward selection! First, get out training and testing data

In [5]:
from sklearn.model_selection import train_test_split # remember this from yesterday?

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

Next, let's decide what model we are using (let's go with linear regression!)

In [6]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

Now, let's implement our step forward feature selector

In [7]:
from sklearn.feature_selection import SequentialFeatureSelector

selector = SequentialFeatureSelector(
    lr,
    # n_features_to_select='auto',
#     direction='forward',
#     scoring='r2',
#     cv = 5
)

selector.fit(X_train,y_train)

SequentialFeatureSelector(estimator=LinearRegression())

Now, let's print out which columns (features) were chosen as the optimal featuers!

In [8]:
selector.get_support()

array([False,  True, False,  True])

This means it chose 1,3 as our optimal features! `n_features_to_select=None` means it chooses half the features by default

**Do those results make sense to you?**

You can also do `selector.transform(X)` to directly get a transformed version without the other features

In [9]:
selector.transform(X_train).shape

(20, 2)

## Train, Validation, Test

In a standard machine learning task:
- We split the dataset into train, validation, and test, where:
  - Training dataset is used to __train__ the model, that is, optimize the model parameters (such as $\beta$ in Linear Regression model).
  - Validation dataset is used to perform __model selection__, i.e., pick the hyperparameters (such as $\lambda$ in a Lasso or Ridge Regression model).
  - Test dataset is used to perform __model evaluation__, which should be the result to report and a mirror of model's true performance.


In some cases (such as when the model is simple enough and without any hyperparameters) we can also just do a train-test split. In this case: __one doesn't touch the test dataset until the final evaluation!__

A common trick is to use the `sklearn.train_test_split` function twice

In [10]:
from sklearn.model_selection import train_test_split

X = np.random.randn(1000, 20)
Y = np.random.randn(1000)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(800, 20)
(800,)
(200, 20)
(200,)


In [11]:
from sklearn.model_selection import train_test_split

X = np.random.randn(1000, 20)
Y = np.random.randn(1000)

# what Can I do if I want to split into 7:2:1?
X_train_and_val, X_test, Y_train_and_val, Y_test = train_test_split(X,Y, test_size=0.1)
X_train, X_val, Y_train, Y_val = train_test_split(X_train_and_val, Y_train_and_val, test_size=2/9)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(700, 20)
(200, 20)
(100, 20)


# Cross Validation

We want to prevent overfitting! So we can cross validate our models onto various data sets and average our performance. There are various methods explained more in lecture

*   K-fold
*   Leave-One-Out

Let's take a look at a dummy example...




Here we show three different ways to do it

In [12]:
from sklearn.model_selection import KFold
import numpy as np

X = np.random.randn(10, 2)
y = np.random.randn(10, 1)
print(X)
print(y)

kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
     print("TRAIN:", train_index, "TEST:", test_index)
     X_train, X_test = X[train_index], X[test_index]
     y_train, y_test = y[train_index], y[test_index]

[[ 0.71751226  0.07990351]
 [-0.32354951  0.78929814]
 [ 0.66311697  0.49846643]
 [ 1.35059174 -1.71411862]
 [-1.11691643 -1.12927171]
 [ 0.6897902   0.44821461]
 [ 0.16038214 -1.16062144]
 [ 1.18361594 -0.02903249]
 [ 0.43686078 -0.39934185]
 [ 0.51059285 -0.74870723]]
[[ 0.1463936 ]
 [-1.10139765]
 [ 0.08066114]
 [-0.12159536]
 [-0.53065554]
 [ 0.01220013]
 [-2.33123525]
 [-0.80534093]
 [-2.00930942]
 [ 0.91650753]]
TRAIN: [5 6 7 8 9] TEST: [0 1 2 3 4]
TRAIN: [0 1 2 3 4] TEST: [5 6 7 8 9]


In [13]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import Lasso
from sklearn.metrics import confusion_matrix

lr = Lasso()
cv_results = cross_validate(lr, X, y, cv=5, return_estimator=True)
sorted(cv_results.keys())

['estimator', 'fit_time', 'score_time', 'test_score']

In [14]:
cv_results

{'fit_time': array([0.00901103, 0.00133395, 0.00101209, 0.00099993, 0.0010252 ]),
 'score_time': array([0.        , 0.00099921, 0.00100636, 0.        , 0.        ]),
 'estimator': [Lasso(), Lasso(), Lasso(), Lasso(), Lasso()],
 'test_score': array([-3.76721979e-02, -4.68763248e+01, -2.10641728e+00, -2.65171000e+00,
        -5.71431324e-04])}

In [15]:
cv_results["test_score"]

array([-3.76721979e-02, -4.68763248e+01, -2.10641728e+00, -2.65171000e+00,
       -5.71431324e-04])

In [16]:
cv_results["test_score"].mean()

-10.33453913476973

In [19]:
cv_results["estimator"]

[Lasso(), Lasso(), Lasso(), Lasso(), Lasso()]

In [17]:
# easiest one
from sklearn.model_selection import cross_val_score

print(cross_val_score(lr, X,y, cv=5).mean())

-10.33453913476973


__Q: why do we want to use cross validation__? 

- prevent overfitting by model selection 
- more stable, less variance, more thorough indication of model performance on validation dataset.


__Q: When may we avoid using cross validation?__
- when the model is too expensive to train.
- when the data is too big

# Regularization

We don't want to deal with extremes/outliers that may skew our data. Regularization helps us modify our loss function (how much our predictions differed from the actual values). We can use two types of regularization methods. This also helps with overfitting!

1.   L1 - Lasso Regularization
2.   L2 - Ridge Regularization

Let's use MSE (Mean Squared Error) as our loss function



## Lasso Regularization (L1)

AKA Least Absolute Shrinkage and Selection Operator

We want to reduce overfitting and control our regularization parameter lambda. Lasso regularization takes the magnitude of our lambda into account by adding a penalty to our loss function which is the absolute value of the magnitude of the coefficient. This method basically shrinks our data until it reaches some middle point. Beware it can lead to a sparse model (small number of coefficients)!

Let's look at an example in code!

In [18]:
from sklearn.linear_model import Lasso
import numpy as np

In [19]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [20]:
l1 = Lasso(alpha=0.1)
l1.fit(X_train, y_train)
l1.score(X_test, y_test)

0.9995061728395062

In [21]:
print(l1.coef_) # prints the parameters/coefficients for each of our features that will reduce our loss function
print(l1.intercept_) # intercept for our loss function

[0.48888889 0.        ]
0.5444444444444443


Note: Lasso regularization tends to make coefficients zero which ends up reducing features. So it can help us with model/feature selection like we just explored!

# Ridge Regularization (L2)

Similar to lasso except lasso tends to make its coefficients zero whereas ridge will never do that. In ridge, our penalty is the square of our coefficient.

Let's look at an example in code!

In [18]:
from sklearn.linear_model import Ridge

In [19]:
l2 = Ridge(alpha=0.1)
l2.fit(X_train, y_train)
l2.score(X_test, y_test)

0.989886972040452

In [20]:
print(l2.coef_)


[0.24390244 0.24390244]
0.28048780487804814


## A Complete Pipeline

In this example we demonstrate a standard way to do model selection and evaluation
- Step 1: Split the dataset into train-val and test 
- Step 2: Define the base estimator (in this case, Lasso)
- Step 3: Train and preform model selection ($\lambda) using cross-validation 
- Step 4: Model Evaluation

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

In [23]:
X = np.random.randn(1000, 20)
beta = np.random.randn(20)
Y = X @ beta

In [24]:
# step1 
X_train_val, X_test, Y_train_val, Y_test = train_test_split(X,Y, test_size=0.1)

In [29]:
alphas = [0.01, 0.1, 1.0, 10, 20]

In [30]:
for alpha in alphas:
    lr = Lasso(alpha=alpha)
    val_score = cross_val_score(lr, X_train_val, Y_train_val, cv=5).mean()
    print(f"alpha = {alpha}, val_score = {val_score}")

alpha = 0.01, val_score = 0.9999393095696005
alpha = 0.1, val_score = 0.994648878736583
alpha = 1.0, val_score = 0.6987415636801432
alpha = 10, val_score = -0.0016915166883781473
alpha = 20, val_score = -0.0016915166883781473


In [35]:
best_model_config = None 
best_score = -np.inf 

for alpha in alphas:
    lr = Lasso(alpha=alpha)
    val_score = cross_val_score(lr, X_train_val, Y_train_val, cv=5).mean()
    print(f"alpha = {alpha}, val_score = {val_score}")
    if val_score >= best_score:
        # print("better")
        best_model_config = alpha
        best_score = val_score

        
print(best_model_config)

alpha = 0.01, val_score = 0.9999393095696005
alpha = 0.1, val_score = 0.994648878736583
alpha = 1.0, val_score = 0.6987415636801432
alpha = 10, val_score = -0.0016915166883781473
alpha = 20, val_score = -0.0016915166883781473
0.01


In [37]:
lr = Lasso(alpha=best_model_config)
lr.fit(X_train_and_val, Y_train_and_val)
y_pred = lr.predict(X_test)
print(mean_squared_error(y_pred, Y_test))

32.82826013948301
