# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge of Ridge and Lasso regression!

## Objectives

You will be able to:

* Use Lasso and ridge regression in Python
* Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [21]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [22]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [23]:
df['SalePrice'].dtype

dtype('int64')

In [28]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in ['float64', 'int64'] and col!='SalePrice']
X = df[features]

# Impute null values
for col in X.columns:
    X[col] = X[col].fillna(X[col].median())

# Create y
y = df['SalePrice']

In [30]:
y

0       208500
1       181500
2       223500
3       140000
4       250000
5       143000
6       307000
7       200000
8       129900
9       118000
10      129500
11      345000
12      144000
13      279500
14      157000
15      132000
16      149000
17       90000
18      159000
19      139000
20      325300
21      139400
22      230000
23      129900
24      154000
25      256300
26      134800
27      306000
28      207500
29       68500
         ...  
1430    192140
1431    143750
1432     64500
1433    186500
1434    160000
1435    174000
1436    120500
1437    394617
1438    149700
1439    197000
1440    191000
1441    149300
1442    310000
1443    121000
1444    179600
1445    129000
1446    157900
1447    240000
1448    112000
1449     92000
1450    136000
1451    287090
1452    145000
1453     84500
1454    185000
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

Look at the information of `X` again

In [29]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [35]:
X_train.shape, y_train.shape, X_test.shape

((978, 37), (482, 37), (978,))

In [51]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.33, random_state=42)

# Fit the model and print R2 and MSE for train and test
lin = LinearRegression()

X_train_lin = lin.fit(X_train, y_train)

y_hat_train = X_train_lin.predict(X_train)
y_hat_test = X_train_lin.predict(X_test)

print('Training r^2:', X_train_lin.score(X_train, y_train))
print('Testing r^2:', X_train_lin.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, y_hat_train))
print('Testing MSE:', mean_squared_error(y_test, y_hat_test))

Training r^2: 0.8118554543955109
Testing r^2: 0.7849696889294511
Training MSE: 1090644660.211904
Testing MSE: 1578621667.9588974


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [56]:
from sklearn import preprocessing

# Scale the data and perform train test split
transformed = preprocessing.scale(X)
# X = pd.DataFrame(transformed, columns = X.columns)
X_train, X_test, y_train, y_test= train_test_split(transformed, y, test_size=0.33, random_state=42)

Perform the same linear regression on this data and print out R-squared and MSE.

In [57]:
# Your code here
# Fit the model and print R2 and MSE for train and test
lin = LinearRegression()

X_train_lin = lin.fit(X_train, y_train)

y_hat_train = X_train_lin.predict(X_train)
y_hat_test = X_train_lin.predict(X_test)

print('Training r^2:', X_train_lin.score(X_train, y_train))
print('Testing r^2:', X_train_lin.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, y_hat_train))
print('Testing MSE:', mean_squared_error(y_test, y_hat_test))

Training r^2: 0.811868391461654
Testing r^2: 0.7850455638734782
Training MSE: 1090569666.052161
Testing MSE: 1578064640.3003318


In [65]:
X_scaled = pd.DataFrame(transformed, columns = X.columns)
X_scaled

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.517200,1.050994,0.878668,0.514104,0.575425,...,0.351000,-0.752176,0.216503,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,0.138777
1,-1.728492,-0.872563,0.460320,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.570750,1.171992,...,-0.060731,1.626195,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-0.489110,-0.614439
2,-1.726120,0.073375,-0.084636,0.073480,0.651479,-0.517200,0.984752,0.830215,0.325915,0.092907,...,0.631726,-0.752176,-0.070361,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.990891,0.138777
3,-1.723747,0.309859,-0.447940,-0.096897,0.651479,-0.517200,-1.863632,-0.720298,-0.570750,-0.499274,...,0.790804,-0.752176,-0.176048,4.092524,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,-1.367655
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.517200,0.951632,0.733308,1.366489,0.463568,...,1.698485,0.780197,0.563760,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,2.100892,0.138777
5,-1.719002,-0.163109,0.687385,0.360616,-0.795151,-0.517200,0.719786,0.491040,-0.570750,0.632450,...,0.032844,-0.432931,-0.251539,-0.359325,10.802446,-0.270208,-0.068692,1.323736,1.360892,0.891994
6,-1.716629,-0.872563,0.233255,-0.043379,1.374795,-0.517200,1.084115,0.975575,0.458754,2.029558,...,0.762732,1.283007,0.156111,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.620891,-0.614439
7,-1.714256,0.073375,-0.039223,-0.013513,0.651479,0.381743,0.057371,-0.574938,0.757643,0.910994,...,0.051559,1.123385,2.375537,3.372372,-0.116339,-0.270208,-0.068692,0.618024,1.730892,0.891994
8,-1.711883,-0.163109,-0.856657,-0.440659,0.651479,-0.517200,-1.333700,-1.689368,-0.570750,-0.973018,...,-0.023301,-0.033876,-0.704483,2.995929,-0.116339,-0.270208,-0.068692,-0.087688,-0.859110,0.138777
9,-1.709511,3.147673,-0.902070,-0.310370,-0.795151,0.381743,-1.068734,-1.689368,-0.570750,0.893448,...,-1.253816,-0.752176,-0.644091,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.969111,0.138777


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [58]:
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype in ['object']]
X_cat = df[features_cat]

np.shape(X_cat)

(1460, 43)

In [59]:
# Make dummies
X_cat = pd.get_dummies(X_cat, drop_first=True)
np.shape(X_cat)

(1460, 209)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [66]:
# Your code here

X_pred = pd.concat([X_scaled, X_cat], axis = 1)
X_pred

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.517200,1.050994,0.878668,0.514104,0.575425,...,0,0,0,0,1,0,0,0,1,0
1,-1.728492,-0.872563,0.460320,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.570750,1.171992,...,0,0,0,0,1,0,0,0,1,0
2,-1.726120,0.073375,-0.084636,0.073480,0.651479,-0.517200,0.984752,0.830215,0.325915,0.092907,...,0,0,0,0,1,0,0,0,1,0
3,-1.723747,0.309859,-0.447940,-0.096897,0.651479,-0.517200,-1.863632,-0.720298,-0.570750,-0.499274,...,0,0,0,0,1,0,0,0,0,0
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.517200,0.951632,0.733308,1.366489,0.463568,...,0,0,0,0,1,0,0,0,1,0
5,-1.719002,-0.163109,0.687385,0.360616,-0.795151,-0.517200,0.719786,0.491040,-0.570750,0.632450,...,0,0,0,0,1,0,0,0,1,0
6,-1.716629,-0.872563,0.233255,-0.043379,1.374795,-0.517200,1.084115,0.975575,0.458754,2.029558,...,0,0,0,0,1,0,0,0,1,0
7,-1.714256,0.073375,-0.039223,-0.013513,0.651479,0.381743,0.057371,-0.574938,0.757643,0.910994,...,0,0,0,0,1,0,0,0,1,0
8,-1.711883,-0.163109,-0.856657,-0.440659,0.651479,-0.517200,-1.333700,-1.689368,-0.570750,-0.973018,...,0,0,0,0,1,0,0,0,0,0
9,-1.709511,3.147673,-0.902070,-0.310370,-0.795151,0.381743,-1.068734,-1.689368,-0.570750,0.893448,...,0,0,0,0,1,0,0,0,1,0


Perform the same linear regression on this data and print out R-squared and MSE.

In [67]:
# Your code here
X_train, X_test, y_train, y_test= train_test_split(X_pred, y, test_size=0.33, random_state=42)
lin = LinearRegression()

X_train_lin = lin.fit(X_train, y_train)

y_hat_train = X_train_lin.predict(X_train)
y_hat_test = X_train_lin.predict(X_test)

print('Training r^2:', X_train_lin.score(X_train, y_train))
print('Testing r^2:', X_train_lin.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, y_hat_train))
print('Testing MSE:', mean_squared_error(y_test, y_hat_test))

Training r^2: 0.927027888285902
Testing r^2: -1.1735770978584158e+19
Training MSE: 423007978.9433374
Testing MSE: 8.615688767207276e+28


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [73]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

lasso_1 = Lasso(alpha=1)
lasso_1.fit(X_train, y_train)
y_h_lasso_1_train = lasso_1.predict(X_train)
y_h_lasso_1_test = lasso_1.predict(X_test)

print('Training r^2:', lasso_1.score(X_train, y_train))
print('Testing r^2:', lasso_1.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso_1.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso_1.predict(X_test)))

Training r^2: 0.9379918083723702
Testing r^2: 0.7423726995758829
Training MSE: 359451839.9180062
Testing MSE: 1891342837.5864437


With a higher regularization parameter (alpha = 10)

In [74]:
# Your code here
lasso_10 = Lasso(alpha=10)
lasso_10.fit(X_train, y_train)

y_h_lasso_10_train = lasso_10.predict(X_train)
y_h_lasso_10_test = lasso_10.predict(X_test)

print('Training r^2:', lasso_10.score(X_train, y_train))
print('Testing r^2:', lasso_10.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso_10.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso_10.predict(X_test)))

Training r^2: 0.9327073254944938
Testing r^2: 0.8351077643904454
Training MSE: 390085165.0256775
Testing MSE: 1210538433.9328072


## Ridge

With default parameter (alpha = 1)

In [75]:
# Your code here
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
y_h_ridge_train = ridge.predict(X_train)
y_h_ridge_test = ridge.predict(X_test)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.9166444310491575
Testing r^2: 0.8544011261776161
Training MSE: 483199265.13454294
Testing MSE: 1068898314.3916821


With default parameter (alpha = 10)

In [76]:
# Your code here
ridge_10 = Ridge(alpha=10)
ridge_10.fit(X_train, y_train)
y_h_ridge_10_train = ridge_10.predict(X_train)
y_h_ridge_10_test = ridge_10.predict(X_test)
print('Training r^2:', ridge_10.score(X_train, y_train))
print('Testing r^2:', ridge_10.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge_10.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge_10.predict(X_test)))

Training r^2: 0.8982005040558715
Testing r^2: 0.8457146588702551
Training MSE: 590115840.4938531
Testing MSE: 1132669070.436013


In [72]:
print('Train Error Ridge 1 Model', np.sum((y_train - y_h_ridge_train)**2))
print('Test Error Ridge 1 Model', np.sum((y_test - y_h_ridge_test)**2))
print('\n')

print('Train Error Ridge 10 Model', np.sum((y_train - y_h_ridge_10_train)**2))
print('Test Error Ridge 10 Model', np.sum((y_test - y_h_ridge_10_test)**2))
print('\n')

print('Train Error Lasso 1 Model', np.sum((y_train - y_h_lasso_1_train)**2))
print('Test Error Lasso 1 Model', np.sum((y_test - y_h_lasso_1_test)**2))
print('\n')

print('Train Error Lasso 10 Model', np.sum((y_train - y_h_lasso_10_train)**2))
print('Test Error Lasso 10 Model', np.sum((y_test - y_h_lasso_10_test)**2))
print('\n')

# print('Train Error Unpenalized Linear Model', np.sum((y_train - lin.predict(X_train))**2))
# print('Test Error Unpenalized Linear Model', np.sum((y_test - lin.predict(X_test))**2))

Train Error Ridge 1 Model 472568881301.583
Test Error Ridge 1 Model 515208987536.7908


Train Error Ridge 10 Model 577133292002.9883
Test Error Ridge 10 Model 545946491950.1582


Train Error Lasso 1 Model 351543899439.81006
Test Error Lasso 1 Model 911627247716.6659


Train Error Lasso 10 Model 381503291395.1126
Test Error Lasso 10 Model 583479525155.613




## Look at the metrics, what are your main conclusions?   

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [91]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))
# ridge.coef_

5


In [92]:
# number of Ridge params almost zero
print(sum(abs(ridge_10.coef_) < 10**(-10)))
# ridge_10.coef_

5


In [94]:
# number of Lasso params almost zero
print(sum(abs(lasso_1.coef_) < 10**(-10)))
# lasso_1.coef_

8


In [90]:
print(sum(abs(lasso_10.coef_) < 10**(-10)))
# lasso_10.coef_

46


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [82]:
# your code here
len(lasso_1.coef_)

246

In [86]:
sum(abs(lasso_10.coef_) < 10**(-10))/246

0.18699186991869918

## Summary

Great! You now know how to perform Lasso and Ridge regression.