# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [3]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [4]:
# Load necessary packages

# remove "object"-type features and SalePrice from `X`
X_cv = df.select_dtypes(include = ["int64", "float64"])
X_cv.drop('SalePrice', axis=1, inplace=True)
X_cv.columns

# Impute null values
X_cv.isna().sum()
X_cv['LotFrontage'].fillna(X_cv['LotFrontage'].median(), inplace=True)
X_cv['MasVnrArea'].fillna(X_cv['MasVnrArea'].median(), inplace=True)
X_cv['GarageYrBlt'].fillna(X_cv['MasVnrArea'].median(), inplace=True)
X_cv.isna().sum()

# Create y
y = df[['SalePrice']]
y.head()


Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000


Look at the information of `X` again

In [5]:
X_cv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [6]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X_cv, y, test_size = 0.2)

# Fit the model and print R2 and MSE for train and test
linreg_train = LinearRegression()
linreg_train.fit(X_train, y_train)
y_hat_train = linreg_train.predict(X_train)
print("Training R**2 = ",linreg_train.score(X_train, y_hat_train))
print("Training MSE = ",mean_squared_error(y_train,y_hat_train))

linreg_test = LinearRegression()
linreg_test.fit(X_test, y_test)
y_hat_test = linreg_train.predict(X_test)
print("Test R**2 = ",linreg_train.score(X_test, y_hat_test))
print("Test MSE = ",mean_squared_error(y_test,y_hat_test))


Training R**2 =  1.0
Training MSE =  1203527289.991308
Test R**2 =  1.0
Test MSE =  1033367552.2770178


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [7]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_cv_scaled = preprocessing.scale(X_cv)
X_train_sc, X_test_sc, y_train_sc, y_test_sc = train_test_split(X_cv_scaled, y, test_size = 0.2)
X_cv_scaled =pd.DataFrame(X_cv_scaled, columns=X_cv.columns) 
X_cv_scaled

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.517200,1.050994,0.878668,0.514104,0.575425,...,0.351000,-0.752176,0.216503,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,0.138777
1,-1.728492,-0.872563,0.460320,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.570750,1.171992,...,-0.060731,1.626195,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-0.489110,-0.614439
2,-1.726120,0.073375,-0.084636,0.073480,0.651479,-0.517200,0.984752,0.830215,0.325915,0.092907,...,0.631726,-0.752176,-0.070361,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.990891,0.138777
3,-1.723747,0.309859,-0.447940,-0.096897,0.651479,-0.517200,-1.863632,-0.720298,-0.570750,-0.499274,...,0.790804,-0.752176,-0.176048,4.092524,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,-1.367655
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.517200,0.951632,0.733308,1.366489,0.463568,...,1.698485,0.780197,0.563760,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,2.100892,0.138777
5,-1.719002,-0.163109,0.687385,0.360616,-0.795151,-0.517200,0.719786,0.491040,-0.570750,0.632450,...,0.032844,-0.432931,-0.251539,-0.359325,10.802446,-0.270208,-0.068692,1.323736,1.360892,0.891994
6,-1.716629,-0.872563,0.233255,-0.043379,1.374795,-0.517200,1.084115,0.975575,0.458754,2.029558,...,0.762732,1.283007,0.156111,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.620891,-0.614439
7,-1.714256,0.073375,-0.039223,-0.013513,0.651479,0.381743,0.057371,-0.574938,0.757643,0.910994,...,0.051559,1.123385,2.375537,3.372372,-0.116339,-0.270208,-0.068692,0.618024,1.730892,0.891994
8,-1.711883,-0.163109,-0.856657,-0.440659,0.651479,-0.517200,-1.333700,-1.689368,-0.570750,-0.973018,...,-0.023301,-0.033876,-0.704483,2.995929,-0.116339,-0.270208,-0.068692,-0.087688,-0.859110,0.138777
9,-1.709511,3.147673,-0.902070,-0.310370,-0.795151,0.381743,-1.068734,-1.689368,-0.570750,0.893448,...,-1.253816,-0.752176,-0.644091,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.969111,0.138777


Perform the same linear regression on this data and print out R-squared and MSE.

In [8]:
# Your code here
linreg_sctrain = LinearRegression()
linreg_sctrain.fit(X_train_sc, y_train_sc)
y_hat_sctrain = linreg_train.predict(X_train_sc)
print("Training R**2 = ",linreg_sctrain.score(X_train_sc, y_hat_sctrain))
print("Training MSE = ",mean_squared_error(y_train_sc,y_hat_sctrain))

linreg_sctest = LinearRegression()
linreg_sctest.fit(X_test_sc, y_test_sc)
y_hat_sctest = linreg_train.predict(X_test_sc)
print("Test R**2 = ",linreg_train.score(X_test_sc, y_hat_sctest))
print("Test MSE = ",mean_squared_error(y_test_sc,y_hat_sctest))


Training R**2 =  -22.056999591615465
Training MSE =  48895787112.55124
Test R**2 =  1.0
Test MSE =  49424845119.57732


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [10]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(include = ["object"])
X_cat.isna().sum()


MSZoning            0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinType2       38
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
KitchenQual         0
Functional          0
FireplaceQu       690
GarageType         81
GarageFinish       81
GarageQual         81
GarageCond         81
PavedDrive          0
PoolQC           1453
Fence            1179
MiscFeature      1406
SaleType            0
SaleCondition       0
dtype: int64

In [15]:
# Make dummies
df_dummies = pd.get_dummies(X_cat)
df_dummies.isna().sum()

MSZoning_C (all)         0
MSZoning_FV              0
MSZoning_RH              0
MSZoning_RL              0
MSZoning_RM              0
Street_Grvl              0
Street_Pave              0
Alley_Grvl               0
Alley_Pave               0
LotShape_IR1             0
LotShape_IR2             0
LotShape_IR3             0
LotShape_Reg             0
LandContour_Bnk          0
LandContour_HLS          0
LandContour_Low          0
LandContour_Lvl          0
Utilities_AllPub         0
Utilities_NoSeWa         0
LotConfig_Corner         0
LotConfig_CulDSac        0
LotConfig_FR2            0
LotConfig_FR3            0
LotConfig_Inside         0
LandSlope_Gtl            0
LandSlope_Mod            0
LandSlope_Sev            0
Neighborhood_Blmngtn     0
Neighborhood_Blueste     0
Neighborhood_BrDale      0
                        ..
GarageCond_TA            0
PavedDrive_N             0
PavedDrive_P             0
PavedDrive_Y             0
PoolQC_Ex                0
PoolQC_Fa                0
P

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [16]:
# Your code here
df_all = pd.concat([X_cv_scaled, df_dummies], axis = 1)
df_all.head()
print(df_all.shape)
df_all.columns
df_all.isna().sum()

(1460, 289)


Id                       0
MSSubClass               0
LotFrontage              0
LotArea                  0
OverallQual              0
OverallCond              0
YearBuilt                0
YearRemodAdd             0
MasVnrArea               0
BsmtFinSF1               0
BsmtFinSF2               0
BsmtUnfSF                0
TotalBsmtSF              0
1stFlrSF                 0
2ndFlrSF                 0
LowQualFinSF             0
GrLivArea                0
BsmtFullBath             0
BsmtHalfBath             0
FullBath                 0
HalfBath                 0
BedroomAbvGr             0
KitchenAbvGr             0
TotRmsAbvGrd             0
Fireplaces               0
GarageYrBlt              0
GarageCars               0
GarageArea               0
WoodDeckSF               0
OpenPorchSF              0
                        ..
GarageCond_TA            0
PavedDrive_N             0
PavedDrive_P             0
PavedDrive_Y             0
PoolQC_Ex                0
PoolQC_Fa                0
P

Perform the same linear regression on this data and print out R-squared and MSE.

In [12]:
# Your code here
lin = LinearRegression()
lin.fit(df_all, y)
y_hats = lin.predict(df_all)
print("New R**2 (scaled & dummified) = ", lin.score(df_all, y))
print("New MSE (scaled & dummified) = ", mean_squared_error(y,y_hats))


New R**2 (scaled & dummified) =  0.91392687341558
New MSE (scaled & dummified) =  542845012.2479452


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [61]:
# Your code here
from sklearn.linear_model import Lasso, Ridge, LinearRegression

X_train , X_test, y_train, y_test = train_test_split(df_all, y, test_size=0.2, random_state=12)

# print(X_train.shape)
# print(y_train.shape)
# print(X_test.shape)
# print(y_test.shape)

y_test = np.array(y_test)


lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)
lasso_coef_1 = lasso.coef_
print(lasso_coef_1)

y_h_lasso_train = lasso.predict(X_train)
y_h_lasso_train = pd.DataFrame(y_h_lasso_train, columns = y.columns)

y_h_lasso_test = lasso.predict(X_test)
y_h_lasso_test = pd.DataFrame(y_h_lasso_test, columns = y.columns)

print(y_h_lasso_train.shape)
print(y_h_lasso_test.shape)

print('Train Error Lasso Model', np.sum((y_train - y_h_lasso_train)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_lasso_test)**2))


[ 7.84350096e+02 -1.18713391e+03  9.16775416e+02  8.25322351e+03
  9.10062751e+03  5.86646003e+03  9.90650424e+03  1.97563744e+03
  2.66370976e+03  1.34779183e+04  3.48489318e+03  5.46779923e+03
  4.92724367e+03  1.64019516e+04  2.95996580e+04 -1.69992269e+02
  4.65029296e+03  1.10656594e+01 -5.62790227e+02  1.17900265e+03
  5.16578067e+02 -1.44466265e+03 -2.22977373e+03 -1.49977713e+03
  2.33976806e+03 -1.08383385e+03  2.89741441e+03  2.40132258e+03
  1.51486947e+03  8.31053233e+02 -7.65064427e+02  9.74831040e+02
  1.71341940e+03  2.84516045e+04  7.71443701e+02 -1.15446381e+03
 -1.48070708e+03 -2.43149550e+04  2.12664576e+04  0.00000000e+00
  1.06770764e+03 -4.62901112e+03 -3.18853149e+04  0.00000000e+00
  2.77882895e+03 -8.40646661e+02 -3.00999057e+02 -1.84363273e+02
  0.00000000e+00  9.35228143e+02 -6.82399962e+02  2.81055821e+03
 -1.64243114e+04  7.07727894e+01  0.00000000e+00  0.00000000e+00
  4.20653043e+02  7.47755737e+03 -7.81386815e+03 -1.69009510e+04
 -0.00000000e+00  0.00000

With a higher regularization parameter (alpha = 10)

In [52]:
# Your code here
X_train , X_test, y_train, y_test = train_test_split(df_all, y, test_size=0.2, random_state=12)

# print(X_train.shape)
# print(y_train.shape)
# print(X_test.shape)
# print(y_test.shape)

y_test = np.array(y_test)


lasso2 = Lasso(alpha=10)
lasso2.fit(X_train, y_train)
lasso2_coef_10 = lasso2.coef_

y_h_lasso_train2 = lasso2.predict(X_train)
y_h_lasso_train2 = pd.DataFrame(y_h_lasso_train2, columns = y.columns)

y_h_lasso_test2 = lasso2.predict(X_test)
y_h_lasso_test2 = pd.DataFrame(y_h_lasso_test2, columns = y.columns)

print(y_h_lasso_train2.shape)
print(y_h_lasso_test2.shape)

print('Train Error Lasso Model', np.sum((y_train - y_h_lasso_train2)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_lasso_test2)**2))


(1168, 1)
(292, 1)
Train Error Lasso Model SalePrice    1.115529e+13
dtype: float64
Test Error Lasso Model SalePrice    3.635065e+11
dtype: float64


## Ridge

With default parameter (alpha = 1)

In [48]:
# Your code here
X_train , X_test, y_train, y_test = train_test_split(df_all, y, test_size=0.2, random_state=12)

# print(X_train.shape)
# print(y_train.shape)
# print(X_test.shape)
# print(y_test.shape)

y_test = np.array(y_test)


ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
ridge_coef_1 = ridge.coef_
# print('Ridge coefficients (alpha = 1) = ', ridge.coef_)

y_h_ridge_train = ridge.predict(X_train)
y_h_ridge_train = pd.DataFrame(y_h_ridge_train, columns = y.columns)

y_h_ridge_test = ridge.predict(X_test)
y_h_ridge_test = pd.DataFrame(y_h_ridge_test, columns = y.columns)

print(y_h_ridge_train.shape)
print(y_h_ridge_test.shape)

print('Train Error Lasso Model', np.sum((y_train - y_h_ridge_train)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_ridge_test)**2))


(1168, 1)
(292, 1)
Train Error Lasso Model SalePrice    1.108524e+13
dtype: float64
Test Error Lasso Model SalePrice    3.396254e+11
dtype: float64


With default parameter (alpha = 10)

In [49]:
# Your code here
X_train , X_test, y_train, y_test = train_test_split(df_all, y, test_size=0.2, random_state=12)

# print(X_train.shape)
# print(y_train.shape)
# print(X_test.shape)
# print(y_test.shape)

y_test = np.array(y_test)


ridge2 = Ridge(alpha=10)
ridge2.fit(X_train, y_train)
ridge2_coef_10 = ridge2.coef_

y_h_ridge_train2 = ridge2.predict(X_train)
y_h_ridge_train2 = pd.DataFrame(y_h_ridge_train2, columns = y.columns)

y_h_ridge_test2 = ridge2.predict(X_test)
y_h_ridge_test2 = pd.DataFrame(y_h_ridge_test2, columns = y.columns)

print(y_h_ridge_train2.shape)
print(y_h_ridge_test2.shape)

print('Train Error Lasso Model', np.sum((y_train - y_h_ridge_train2)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_ridge_test2)**2))


(1168, 1)
(292, 1)
Train Error Lasso Model SalePrice    1.089224e+13
dtype: float64
Test Error Lasso Model SalePrice    3.087371e+11
dtype: float64


## Look at the metrics, what are your main conclusions?

Ridge test with alpha=10 seems to perform the best (i.e. had the lowest training and testing errors).

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [74]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [71]:
# number of Lasso params almost zero

print(sum(abs(lasso.coef_) < 10**(-10)))


34


Compare with the total length of the parameter space and draw conclusions!

In [45]:
# your code here
# df_all.shape -> there are 289 columns/parameters;

## Summary

Great! You now know how to perform Lasso and Ridge regression.