# Regularization

## Why Regularize?

In an attempt to fit a good model to data, we often tend to overfit. Regularization discourages overly complex models by penalizing the loss function.

### The Bias-Variance Tradeoff

When we did Linear Regression, we briefly talked about the Bias-Variance Tradeoff.

![](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)

![](https://miro.medium.com/max/544/1*Y-yJiR0FzMgchPA-Fm5c1Q.jpeg)

overfitting => learn every noise in training data => generalize bad on testing data

**High bias** 

 - Systematic error in predictions (i.e. the average)
 - Bias is about the strength of assumptions the model makes
 - Underfit models tend to have high bias


**High variance**

 - The model is highly sensitive to changes in the data
 - Overfit models tend to have low bias and high variance
    
    
![](https://gblobscdn.gitbook.com/assets%2F-LvBP1svpACTB1R1x_U4%2F-LvNWUoWieQqaGmU_gl9%2F-LvNoby-llz4QzAK15nL%2Fimage.png?alt=media&token=41720ce9-bb66-4419-9bd8-640abf1fc415)

 - Underfit Models fail to capture all of the information in the data
 - Overfit models fit to the noise in the data and fail to generalize


**How would we know if our model is over or underfit?**
 - Train test split & look at the testing error
 - As model complexity increases so does the possibility for overfitting

## Ridge and Lasso

Ridge and Lasso regression are two examples of penalized estimation. Penalized estimation makes some or all of the coefficients smaller in magnitude (closer to zero). Some of the penalties have the property of performing both variable selection (setting some coefficients exactly equal to zero) and shrinking the other coefficients. 

In Ridge regression, the cost function is changed by adding a penalty term to the square of the magnitude of the coefficients. 

$$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p m_j^2$$

Lasso regression (Least Absolute Shrinkage and Selection Operator) is very similar to Ridge regression, except that the magnitude of the coefficients are not squared in the penalty term.

$$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p \mid m_j \mid$$

So we're penalizing large coefficients -- what are the effects/implications of that?

### Standardization before Regularization

An important step before using either Lasso or Ridge regularization is to first standardize your data such that it is all on the same scale. Regularization is based on the concept of penalizing larger coefficients, so **if you have features that are on different scales, some will get unfairly penalized**. A downside of standardization is that the value of the coefficients become less interpretable and must be transformed back to their original scale if you want to interpret how a one unit change in a feature impacts the target variable.

**Scaler documentation:**

* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

## Let's Code! 

Start with a regular Linear Regression.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

# import warnings
# warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('data/ames_train.csv') # Ames housing data

# Drop sale detail columns 
df = df.drop(columns = ['Id', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])

# Create X and y
y = df['SalePrice']
X = df.drop(columns=['SalePrice'], axis=1)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Time to Clean/Process

In [3]:
# Explore X_train
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1023 to 1126
Data columns (total 75 columns):
MSSubClass       1095 non-null int64
MSZoning         1095 non-null object
LotFrontage      895 non-null float64
LotArea          1095 non-null int64
Street           1095 non-null object
Alley            70 non-null object
LotShape         1095 non-null object
LandContour      1095 non-null object
Utilities        1095 non-null object
LotConfig        1095 non-null object
LandSlope        1095 non-null object
Neighborhood     1095 non-null object
Condition1       1095 non-null object
Condition2       1095 non-null object
BldgType         1095 non-null object
HouseStyle       1095 non-null object
OverallQual      1095 non-null int64
OverallCond      1095 non-null int64
YearBuilt        1095 non-null int64
YearRemodAdd     1095 non-null int64
RoofStyle        1095 non-null object
RoofMatl         1095 non-null object
Exterior1st      1095 non-null object
Exterior2nd      1095 no

In [4]:
# Let's check the percentage of our training data that's null per column
null_perc = X_train.isna().sum() / len(X_train)
null_perc.sort_values(ascending=False).head(10)

PoolQC          0.994521
MiscFeature     0.960731
Alley           0.936073
Fence           0.800913
FireplaceQu     0.467580
LotFrontage     0.182648
GarageQual      0.052968
GarageType      0.052968
GarageYrBlt     0.052968
GarageFinish    0.052968
dtype: float64

In [5]:
# Drop where nulls are more than 10% of column
null_cols_to_drop = list(null_perc.loc[null_perc > 0.1].index)
print(null_cols_to_drop)

X_train = X_train.drop(null_cols_to_drop, axis=1)
X_test = X_test.drop(null_cols_to_drop, axis=1)

['LotFrontage', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']


In [6]:
# Start with the continuous variables
print(X_train['Fireplaces'].dtype)
print(X_train['GarageYrBlt'].dtype)

num_tpyes = ['int64', 'float64']

# Grab only numeric features
num_cols = []
for col in X_train.columns:
    if X_train[col].dtype in num_tpyes:
        num_cols.append(col)

# list comprehension
# num_cols = [c for c in X_train.columns if X_train[c].dtype in num_types]

int64
float64


In [7]:
X_train[num_cols].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1023 to 1126
Data columns (total 33 columns):
MSSubClass       1095 non-null int64
LotArea          1095 non-null int64
OverallQual      1095 non-null int64
OverallCond      1095 non-null int64
YearBuilt        1095 non-null int64
YearRemodAdd     1095 non-null int64
MasVnrArea       1091 non-null float64
BsmtFinSF1       1095 non-null int64
BsmtFinSF2       1095 non-null int64
BsmtUnfSF        1095 non-null int64
TotalBsmtSF      1095 non-null int64
1stFlrSF         1095 non-null int64
2ndFlrSF         1095 non-null int64
LowQualFinSF     1095 non-null int64
GrLivArea        1095 non-null int64
BsmtFullBath     1095 non-null int64
BsmtHalfBath     1095 non-null int64
FullBath         1095 non-null int64
HalfBath         1095 non-null int64
BedroomAbvGr     1095 non-null int64
KitchenAbvGr     1095 non-null int64
TotRmsAbvGrd     1095 non-null int64
Fireplaces       1095 non-null int64
GarageYrBlt      1037 non-null float6

In [8]:
X_train_cont = X_train[num_cols]
X_test_cont = X_test[num_cols]

In [9]:
# Impute missing values with 0 using SimpleImputer
# (most columns look like they just don't have details)
imputer = SimpleImputer(strategy='constant', fill_value=0)

X_train_impute = imputer.fit_transform(X_train_cont)
X_test_impute = imputer.transform(X_test_cont)

In [10]:
# Scale the train and test data
scaler = MinMaxScaler()

X_train_imsc = scaler.fit_transform(X_train_impute)
X_test_imsc = scaler.transform(X_test_impute)

In [11]:
X_train['MSZoning']

1023    RL
810     RL
1384    RL
626     RL
813     RL
        ..
1095    RL
1130    RL
1294    RL
860     RL
1126    RL
Name: MSZoning, Length: 1095, dtype: object

In [12]:
# Now time for the categorical columns

# Create X_cat which contains only the categorical variables
cat_cols = [c for c in X_train.columns if X_train[c].dtype in ['object']]

X_train_cat = X_train[cat_cols]
X_test_cat = X_test[cat_cols]

# Fill missing values with the string 'missing'
X_train_cat = X_train_cat.fillna(value='missing')
X_test_cat = X_test_cat.fillna(value='missing')

# Same as:
# imputer_cat = SimpleImputer(strategy='constant', fill_value='missing')
# imputer_cat.fit_transform(X_train_cat)
# imputer_cat.transform(X_test_cat)

In [13]:
# Exploring column percentages

# Let's remove any column where the most common value is more than 90% of that col
col_to_drop = []
for col in X_train_cat.columns:
    col_series = X_train_cat[col].value_counts()
    display(col_series/len(X_train_cat))
    
    if col_series[0]/len(X_train_cat) > .9:
        col_to_drop.append(col)

RL         0.790868
RM         0.149772
FV         0.042922
RH         0.012785
C (all)    0.003653
Name: MSZoning, dtype: float64

Pave    0.996347
Grvl    0.003653
Name: Street, dtype: float64

Reg    0.621918
IR1    0.338813
IR2    0.031963
IR3    0.007306
Name: LotShape, dtype: float64

Lvl    0.905936
Bnk    0.041096
HLS    0.029224
Low    0.023744
Name: LandContour, dtype: float64

AllPub    0.999087
NoSeWa    0.000913
Name: Utilities, dtype: float64

Inside     0.699543
Corner     0.190868
CulDSac    0.073059
FR2        0.033790
FR3        0.002740
Name: LotConfig, dtype: float64

Gtl    0.946119
Mod    0.045662
Sev    0.008219
Name: LandSlope, dtype: float64

NAmes      0.152511
CollgCr    0.102283
OldTown    0.079452
Edwards    0.075799
Somerst    0.056621
NWAmes     0.054795
Gilbert    0.053881
NridgHt    0.052968
Sawyer     0.046575
BrkSide    0.040183
SawyerW    0.036530
Crawfor    0.035616
Mitchel    0.034703
NoRidge    0.028311
Timber     0.024658
IDOTRR     0.022831
StoneBr    0.018265
ClearCr    0.017352
SWISU      0.016438
Blmngtn    0.013699
BrDale     0.011872
MeadowV    0.009132
Veenker    0.008219
NPkVill    0.006393
Blueste    0.000913
Name: Neighborhood, dtype: float64

Norm      0.863927
Feedr     0.053881
Artery    0.033790
RRAn      0.015525
PosN      0.012785
RRAe      0.009132
PosA      0.005479
RRNn      0.004566
RRNe      0.000913
Name: Condition1, dtype: float64

Norm      0.991781
Feedr     0.002740
PosN      0.001826
Artery    0.001826
RRAe      0.000913
RRAn      0.000913
Name: Condition2, dtype: float64

1Fam      0.835616
TwnhsE    0.076712
Duplex    0.033790
Twnhs     0.029224
2fmCon    0.024658
Name: BldgType, dtype: float64

1Story    0.494064
2Story    0.309589
1.5Fin    0.104110
SLvl      0.045662
SFoyer    0.021005
1.5Unf    0.010046
2.5Unf    0.009132
2.5Fin    0.006393
Name: HouseStyle, dtype: float64

Gable      0.769863
Hip        0.205479
Flat       0.010046
Gambrel    0.008219
Mansard    0.004566
Shed       0.001826
Name: RoofStyle, dtype: float64

CompShg    0.982648
Tar&Grv    0.008219
WdShngl    0.003653
WdShake    0.002740
Metal      0.000913
ClyTile    0.000913
Roll       0.000913
Name: RoofMatl, dtype: float64

VinylSd    0.359817
HdBoard    0.152511
MetalSd    0.146119
Wd Sdng    0.145205
Plywood    0.068493
CemntBd    0.038356
BrkFace    0.034703
Stucco     0.017352
WdShing    0.017352
AsbShng    0.014612
BrkComm    0.001826
CBlock     0.000913
Stone      0.000913
ImStucc    0.000913
AsphShn    0.000913
Name: Exterior1st, dtype: float64

VinylSd    0.351598
HdBoard    0.140639
Wd Sdng    0.140639
MetalSd    0.139726
Plywood    0.094064
CmentBd    0.037443
Wd Shng    0.029224
Stucco     0.019178
AsbShng    0.015525
BrkFace    0.013699
Brk Cmn    0.005479
ImStucc    0.005479
Stone      0.002740
AsphShn    0.002740
CBlock     0.000913
Other      0.000913
Name: Exterior2nd, dtype: float64

None       0.579909
BrkFace    0.315068
Stone      0.090411
BrkCmn     0.010959
missing    0.003653
Name: MasVnrType, dtype: float64

TA    0.622831
Gd    0.331507
Ex    0.035616
Fa    0.010046
Name: ExterQual, dtype: float64

TA    0.876712
Gd    0.098630
Fa    0.021918
Ex    0.001826
Po    0.000913
Name: ExterCond, dtype: float64

PConc     0.449315
CBlock    0.427397
BrkTil    0.098630
Slab      0.017352
Stone     0.004566
Wood      0.002740
Name: Foundation, dtype: float64

TA         0.442922
Gd         0.421005
Ex         0.085845
Fa         0.025571
missing    0.024658
Name: BsmtQual, dtype: float64

TA         0.892237
Gd         0.047489
Fa         0.034703
missing    0.024658
Po         0.000913
Name: BsmtCond, dtype: float64

No         0.656621
Av         0.148858
Gd         0.089498
Mn         0.080365
missing    0.024658
Name: BsmtExposure, dtype: float64

Unf        0.287671
GLQ        0.285845
ALQ        0.155251
BLQ        0.102283
Rec        0.092237
LwQ        0.052055
missing    0.024658
Name: BsmtFinType1, dtype: float64

Unf        0.863927
Rec        0.038356
LwQ        0.030137
missing    0.024658
BLQ        0.018265
ALQ        0.015525
GLQ        0.009132
Name: BsmtFinType2, dtype: float64

GasA     0.977169
GasW     0.013699
Grav     0.003653
Wall     0.002740
OthW     0.001826
Floor    0.000913
Name: Heating, dtype: float64

Ex    0.506849
TA    0.291324
Gd    0.165297
Fa    0.035616
Po    0.000913
Name: HeatingQC, dtype: float64

Y    0.928767
N    0.071233
Name: CentralAir, dtype: float64

SBrkr      0.915068
FuseA      0.060274
FuseF      0.021005
FuseP      0.002740
missing    0.000913
Name: Electrical, dtype: float64

TA    0.502283
Gd    0.401826
Ex    0.066667
Fa    0.029224
Name: KitchenQual, dtype: float64

Typ     0.925114
Min2    0.025571
Min1    0.024658
Mod     0.011872
Maj1    0.008219
Maj2    0.003653
Sev     0.000913
Name: Functional, dtype: float64

Attchd     0.594521
Detchd     0.263927
BuiltIn    0.063014
missing    0.052968
Basment    0.013699
CarPort    0.006393
2Types     0.005479
Name: GarageType, dtype: float64

Unf        0.410046
RFn        0.293151
Fin        0.243836
missing    0.052968
Name: GarageFinish, dtype: float64

TA         0.900457
missing    0.052968
Fa         0.031963
Gd         0.010959
Ex         0.002740
Po         0.000913
Name: GarageQual, dtype: float64

TA         0.908676
missing    0.052968
Fa         0.024658
Gd         0.008219
Po         0.003653
Ex         0.001826
Name: GarageCond, dtype: float64

Y    0.916895
N    0.061187
P    0.021918
Name: PavedDrive, dtype: float64

In [14]:
col_to_drop

['Street',
 'LandContour',
 'Utilities',
 'LandSlope',
 'Condition2',
 'RoofMatl',
 'Heating',
 'CentralAir',
 'Electrical',
 'Functional',
 'GarageQual',
 'GarageCond',
 'PavedDrive']

In [15]:
# Now drop those
X_train_cat = X_train_cat.drop(col_to_drop, axis=1)
X_test_cat = X_test_cat.drop(col_to_drop, axis=1)

In [16]:
# OneHotEncode categorical variables
ohe = OneHotEncoder(handle_unknown='ignore')

X_train_ohe = ohe.fit_transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)

# Convert these columns into a DataFrame 
ohe_col_names = ohe.get_feature_names(input_features=X_train_cat.columns)
 
cat_train_df = pd.DataFrame(X_train_ohe.todense(), columns=ohe_col_names)
cat_test_df = pd.DataFrame(X_test_ohe.todense(), columns=ohe_col_names)

.todense() returns a matrix

In [17]:
cat_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Columns: 167 entries, MSZoning_C (all) to GarageFinish_missing
dtypes: float64(167)
memory usage: 1.4 MB


In [18]:
# Put it all back together
X_train_all = pd.concat([pd.DataFrame(X_train_imsc), cat_train_df], axis=1)
X_test_all = pd.concat([pd.DataFrame(X_test_imsc), cat_test_df], axis=1)

# Fit the model
linreg = LinearRegression()
linreg.fit(X_train_all, y_train)

LinearRegression()

In [19]:
# Write a quick evaluation function
def evaluate(train_actual, train_predicted, test_actual, test_predicted):
    '''
    Takes in both actual and predicted values, for the train and test set
    Then prints the scores based on those values
    
    Inputs:
    -------
    train_actual - actual target values for the train set
    train_predicted - predicted target values for the train set
    test_actual - actual target values for the test set
    test_predicted - predicted target values for the test set
    '''
    print('Train R2:', r2_score(train_actual, train_predicted))
    print('Test R2:', r2_score(test_actual, test_predicted))
    print("*****")
    print('Train MSE:', mean_squared_error(train_actual, train_predicted))
    print('Test MSE:', mean_squared_error(test_actual, test_predicted))
    print("*****")
    print('Train RMSE:', mean_squared_error(train_actual, train_predicted, squared=False))
    print('Test RMSE:', mean_squared_error(test_actual, test_predicted, squared=False))

In [20]:
# Grab predictions and evaluate
train_preds = linreg.predict(X_train_all)
test_preds = linreg.predict(X_test_all)

evaluate(y_train, train_preds, y_test, test_preds)

Train R2: 0.8902014242430607
Test R2: -2.1663313578071472e+17
*****
Train MSE: 666636028.1309931
Test MSE: 1.517582067121294e+27
*****
Train RMSE: 25819.29565520704
Test RMSE: 38956155702549.68


In [21]:
# Let's wrap up that coefficient exploration in a function
def eval_coefficients(model, column_names):
    '''
    Prints an exploration of the coefficients
    
    Inputs:
    model - a fit linear model (sklearn)
    column_names - a list of feature names that matches the order passed into the model
    
    Outputs:
    coefs - a Series, sorted by coefficient value
    '''

    print("Total number of coefficients: ", len(model.coef_))
    print("Coefficients close to zero: ", sum(abs(model.coef_) < 10**(-10)))
    print(f"Intercept: {model.intercept_}")
    
    coefs = pd.Series(model.coef_, index= column_names)
    display(coefs.sort_values(ascending=False))
    return coefs.sort_values(ascending=False)

In [22]:
X_train_all.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_missing,GarageFinish_Fin,GarageFinish_RFn,GarageFinish_Unf,GarageFinish_missing
0,0.588235,0.008797,0.666667,0.5,0.963768,0.933333,0.01016,0.002835,0.0,0.569349,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.041319,0.555556,0.625,0.73913,0.816667,0.071843,0.11747,0.334516,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.176471,0.036271,0.555556,0.5,0.485507,0.0,0.0,0.036145,0.0,0.152397,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.051611,0.444444,0.5,0.637681,0.466667,0.0,0.0,0.0,0.418664,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.039496,0.555556,0.625,0.623188,0.133333,0.176343,0.107725,0.0,0.357021,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [23]:
model_cols = [*X_train_cont.columns, *ohe_col_names]

In [24]:
len(model_cols)

200

In [25]:
# Explore coefficients
linreg_coefs = eval_coefficients(linreg, model_cols)

Total number of coefficients:  200
Coefficients close to zero:  0
Intercept: -1256971920367655.8


BsmtFinSF1            2.126604e+16
HeatingQC_Ex          1.288165e+16
HeatingQC_Fa          1.288165e+16
HeatingQC_TA          1.288165e+16
HeatingQC_Gd          1.288165e+16
                          ...     
GarageType_Attchd    -1.353207e+16
GarageType_BuiltIn   -1.353207e+16
GarageType_CarPort   -1.353207e+16
GarageType_2Types    -1.353207e+16
TotalBsmtSF          -2.302188e+16
Length: 200, dtype: float64

**Evaluate**

- Train R2 > Test R2 : super overfitting
- Too many dimensions => learn too much noise in this data
- Some of the coefficients are huge numbers

## Fitting Ridge and Lasso

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

### LASSO - L1 norm
- penalty use absolute term
- penalize coefficient to the point they're driven to zero
- work as feature selection

In [26]:
from sklearn.linear_model import Lasso

lasso = Lasso() # Lasso is also known as the L1 norm 

# Fit
lasso.fit(X_train_all, y_train)

# Predict
l_train_preds = lasso.predict(X_train_all)
l_test_preds = lasso.predict(X_test_all)

# Evaluate
evaluate(y_train, l_train_preds, y_test, l_test_preds)

Train R2: 0.8901917658981585
Test R2: 0.8615905414873006
*****
Train MSE: 666694668.2421192
Test MSE: 969601032.6483967
*****
Train RMSE: 25820.431217199282
Test RMSE: 31138.417311231424


  positive)


In [27]:
# Adjust HYPERPARAMETERS -- check documentation!
lasso_v2 = Lasso(alpha=10)

lasso_v2.fit(X_train_all, y_train)

l_train_preds_v2 = lasso_v2.predict(X_train_all)
l_test_preds_v2 = lasso_v2.predict(X_test_all)

evaluate(y_train, l_train_preds_v2, y_test, l_test_preds_v2)

Train R2: 0.8896209715692706
Test R2: 0.8670106565368062
*****
Train MSE: 670160214.6908449
Test MSE: 931631451.6273745
*****
Train RMSE: 25887.452842851206
Test RMSE: 30522.638346436805


In [28]:
# Check Lasso Coefficients
lasso_coefs = eval_coefficients(lasso_v2, model_cols)

Total number of coefficients:  200
Coefficients close to zero:  42
Intercept: -32987.261732734885


GrLivArea               144199.948743
OverallQual              76181.228725
LotArea                  65937.181741
2ndFlrSF                 60865.832960
Neighborhood_StoneBr     57520.664444
                            ...      
BldgType_TwnhsE         -21610.237813
BldgType_Twnhs          -24887.607300
LotShape_IR3            -31765.339691
KitchenAbvGr            -44399.066444
Exterior1st_ImStucc     -54895.225279
Length: 200, dtype: float64

- Training score getting a little bit worse but testing score matches it => not over learning
- More contrained and readable

### Ridge
- deal better for multicollinearity

In [29]:
from sklearn.linear_model import Ridge

ridge = Ridge() # Ridge is also known as the L2 norm

# Fit
ridge.fit(X_train_all, y_train)

# Predict
r_train_preds = ridge.predict(X_train_all)
r_test_preds = ridge.predict(X_test_all)


# Evaluate
evaluate(y_train, r_train_preds, y_test, r_test_preds)

Train R2: 0.8887013079926787
Test R2: 0.8668141030170738
*****
Train MSE: 675743901.6347734
Test MSE: 933008369.778425
*****
Train RMSE: 25995.074564901202
Test RMSE: 30545.18570541723


In [30]:
# Adjust HYPERPARAMETERS
ridge = Ridge(alpha=5)

ridge.fit(X_train_all, y_train)

r_train_preds = ridge.predict(X_train_all)
r_test_preds = ridge.predict(X_test_all)

evaluate(y_train, r_train_preds, y_test, r_test_preds)

Train R2: 0.8816228437958735
Test R2: 0.8658942423728896
*****
Train MSE: 718720408.6148962
Test MSE: 939452277.8760194
*****
Train RMSE: 26808.961349050736
Test RMSE: 30650.485769005674


In [31]:
# Check Ridge Coefficients
ridge_coefs = eval_coefficients(ridge, model_cols)

Total number of coefficients:  200
Coefficients close to zero:  0
Intercept: 19532.59058687676


OverallQual             56677.063670
2ndFlrSF                50632.302560
GrLivArea               43975.219016
Neighborhood_StoneBr    39737.296291
Neighborhood_NoRidge    37688.530485
                            ...     
Neighborhood_OldTown   -13479.540465
KitchenAbvGr           -13974.639202
Neighborhood_Mitchel   -15485.495773
LotShape_IR3           -15668.899146
Neighborhood_Edwards   -20190.545360
Length: 200, dtype: float64

In [32]:
X_train_all.corr()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_missing,GarageFinish_Fin,GarageFinish_RFn,GarageFinish_Unf,GarageFinish_missing
0,1.000000,-0.114885,0.033643,-0.051452,0.000084,0.029745,-0.013149,-0.074653,-0.065171,-0.134089,...,-0.167395,0.068825,0.079729,0.032320,0.083048,0.056668,-0.005776,-0.039089,0.015414,0.056668
1,-0.114885,1.000000,0.099322,0.003762,0.009906,0.019718,0.127069,0.229307,0.126493,-0.016504,...,0.117571,-0.007379,0.027619,0.007353,-0.117884,-0.068886,0.104782,0.014525,-0.073553,-0.068886
2,0.033643,0.099322,1.000000,-0.083437,0.562225,0.544446,0.424824,0.210886,-0.046931,0.328416,...,0.342532,-0.022689,0.200976,-0.099880,-0.325294,-0.261229,0.383563,0.213486,-0.413493,-0.261229
3,-0.051452,0.003762,-0.083437,1.000000,-0.380957,0.046364,-0.138211,-0.043705,0.055532,-0.148196,...,-0.125329,-0.010919,-0.055347,-0.020579,0.196266,-0.018464,-0.128902,-0.084744,0.199375,-0.018464
4,0.000084,0.009906,0.562225,-0.380957,1.000000,0.597193,0.320719,0.225073,-0.052704,0.181246,...,0.462264,-0.039453,0.198728,-0.068080,-0.485023,-0.215428,0.382807,0.322288,-0.534387,-0.215428
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GarageType_missing,0.056668,-0.068886,-0.261229,-0.018464,-0.215428,-0.092199,-0.118816,-0.095614,-0.062527,-0.064328,...,-0.286367,-0.027871,-0.061330,-0.018970,-0.141614,1.000000,-0.134297,-0.152302,-0.197166,1.000000
GarageFinish_Fin,-0.005776,0.104782,0.383563,-0.128902,0.382807,0.328394,0.181602,0.197135,-0.032383,0.063264,...,0.200399,0.006266,0.246612,-0.045549,-0.282130,-0.134297,1.000000,-0.365698,-0.473421,-0.134297
GarageFinish_RFn,-0.039089,0.014525,0.213486,-0.084744,0.322288,0.202104,0.145028,0.048799,0.016474,0.107660,...,0.376571,0.010403,-0.001877,-0.051656,-0.335555,-0.152302,-0.365698,1.000000,-0.536894,-0.152302
GarageFinish_Unf,0.015414,-0.073553,-0.413493,0.199375,-0.534387,-0.431765,-0.238665,-0.173731,0.041498,-0.125579,...,-0.393074,-0.002407,-0.185635,0.096212,0.621357,-0.197166,-0.473421,-0.536894,1.000000,-0.197166


### Let's Discuss

- For Ridge non of the coefficients is constrained to 0
- If have mulicollinearity in data, constrain certain coefficients
- (If two features are driving in the same trend in y, one of the feature is constrained and allow the other one to catch up)
- Help better differentiate which one is the driving change of y

## Ridge & Lasso: Other benefits

### Ridge:
* We can "shrink down" prediction variables effects instead of deleting/zeroing them
* When you have features with high multicollinearity, the coefficients are automatically spread across them (you won't have redundancy)
* Since includes all features it can be computationally expensive (for many variables)

### Lasso:
* When you have a lot of variables it performs feature selection for you!
* Multicollinearity is also dealt with


### Por que no los dos??

Enter ElasticNet: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html