# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [None]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype != object and col != 'SalePrice']
X = df[features]

# Impute null values
for col in X:
    med = X[col].median()
    X[col].fillna(value = med, inplace = True)

# Create y
y = df.SalePrice

Look at the information of `X` again

In [29]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [30]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)

print('Train R2: ', linreg.score(X_train, y_train))
print('Test R2: ', linreg.score(X_test, y_test))
print('Train MSE: ', mean_squared_error(y_train, linreg.predict(X_train)))
print('Test MSE: ', mean_squared_error(y_test, linreg.predict(X_test)))

Train R2:  0.8118730326467865
Test R2:  0.7818118600758989
Train MSE:  1171889159.3350675
Test MSE:  1425050189.6543164


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [31]:
from sklearn import preprocessing

# scale the data and perform train test split
X_scaled = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [32]:
# Your code here
linreg_norm = LinearRegression()
linreg_norm.fit(X_train, y_train)

print('Train R2: ', linreg.score(X_train, y_train))
print('Test R2: ', linreg.score(X_test, y_test))
print('Train MSE: ', mean_squared_error(y_train, linreg.predict(X_train)))
print('Test MSE: ', mean_squared_error(y_test, linreg.predict(X_test)))

Train R2:  -41.266585466209854
Test R2:  -25.464888335976198
Train MSE:  232060790667.00577
Test MSE:  230021717359.75085


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [33]:
# Create X_cat which contains only the categorical variables
features_obj = [col for col in df.columns if df[col].dtypes == object]
X_cat = df[features_obj]

In [34]:
X_cat.shape

(1460, 43)

In [35]:
# Make dummies
X_cat = pd.get_dummies(X_cat)
X_cat.shape

(1460, 252)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [36]:
# Your code here
X_all = pd.concat([pd.DataFrame(X_scaled), X_cat], axis=1)

In [37]:
X_all.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.5172,1.050994,0.878668,0.514104,0.575425,...,0,0,0,1,0,0,0,0,1,0
1,-1.728492,-0.872563,0.46032,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.57075,1.171992,...,0,0,0,1,0,0,0,0,1,0
2,-1.72612,0.073375,-0.084636,0.07348,0.651479,-0.5172,0.984752,0.830215,0.325915,0.092907,...,0,0,0,1,0,0,0,0,1,0
3,-1.723747,0.309859,-0.44794,-0.096897,0.651479,-0.5172,-1.863632,-0.720298,-0.57075,-0.499274,...,0,0,0,1,1,0,0,0,0,0
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.5172,0.951632,0.733308,1.366489,0.463568,...,0,0,0,1,0,0,0,0,1,0


Perform the same linear regression on this data and print out R-squared and MSE.

In [38]:
# Your code here
# Split in train and test
X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(X_all, y)

# Fit the model and print R2 and MSE for train and test
linreg_all = LinearRegression()
linreg_all.fit(X_train_all, y_train_all)

print('Train R2: ', linreg_all.score(X_train_all, y_train_all))
print('Test R2: ', linreg_all.score(X_test_all, y_test_all))
print('Train MSE: ', mean_squared_error(y_train, linreg_all.predict(X_train_all)))
print('Test MSE: ', mean_squared_error(y_test, linreg_all.predict(X_test_all)))

Train R2:  0.9404202239130927
Test R2:  -1.5096321821221827e+20
Train MSE:  10541673389.714155
Test MSE:  1.0694805892542209e+30


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [45]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso() 
lasso.fit(X_train_all, y_train_all)
print('Training r^2:', lasso.score(X_train_all, y_train_all))
print('Testing r^2:', lasso.score(X_test_all, y_test_all))
print('Training MSE:', mean_squared_error(y_train_all, lasso.predict(X_train_all)))
print('Testing MSE:', mean_squared_error(y_test_all, lasso.predict(X_test_all)))

Training r^2: 0.940384300763371
Testing r^2: 0.6114389666201261
Training MSE: 360501273.54450905
Testing MSE: 2752713461.392247


With a higher regularization parameter (alpha = 10)

In [46]:
# Your code here
lasso = Lasso(alpha=10) 
lasso.fit(X_train_all, y_train_all)
print('Training r^2:', lasso.score(X_train_all, y_train_all))
print('Testing r^2:', lasso.score(X_test_all, y_test_all))
print('Training MSE:', mean_squared_error(y_train_all, lasso.predict(X_train_all)))
print('Testing MSE:', mean_squared_error(y_test_all, lasso.predict(X_test_all)))

Training r^2: 0.9389658961794641
Testing r^2: 0.6442193110770569
Training MSE: 369078488.3628769
Testing MSE: 2520485091.3192997


## Ridge

With default parameter (alpha = 1)

In [41]:
# Your code here
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.8410641988379755
Testing r^2: 0.7096425242120267
Training MSE: 872622362.940566
Testing MSE: 2523665483.908353


With default parameter (alpha = 10)

In [42]:
# Your code here
ridge = Ridge(alpha=10) 
ridge.fit(X_train, y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.8410261168923527
Testing r^2: 0.7103678742708416
Training MSE: 872831448.3519812
Testing MSE: 2517361045.208388


## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [47]:
# number of Ridge params almost zero
print('Ridge Coefficients: ', ridge.coef_)

Ridge Coefficients:  [-1.28609300e+03 -6.34232371e+03 -1.22208645e+03  4.86968891e+03
  2.22002484e+04  4.27778154e+03  8.47161592e+03  4.55650110e+03
  4.95300705e+03  3.63109844e+03  9.77654635e+02  3.27318592e+02
  4.46423319e+03  7.79679417e+03  7.22074144e+03  1.09743780e+03
  1.18359600e+04  5.45485492e+03  8.53427778e+02  2.44974516e+03
 -4.47831087e+02 -8.59115115e+03 -3.06587873e+03  1.03840215e+04
  2.88392414e+03  1.56419678e+03  7.63145825e+03  3.51939207e+02
  1.66692235e+03  1.11591591e+03  2.07090712e+03  2.90236847e+00
  3.73536823e+03 -9.84251217e+03 -3.83934794e+01  8.31289232e+02
 -1.08683119e+02]


In [48]:
# number of Lasso params almost zero
print('Lasso Coefficients: ', lasso.coef_)

Lasso Coefficients:  [ 7.62657683e+01 -2.57755442e+03  1.71763256e+03  7.61156199e+03
  8.26122256e+03  6.67460871e+03  1.15797714e+04  2.28186923e+03
  3.17383545e+03  6.54867118e+03  1.24200151e+03 -0.00000000e+00
  1.06396661e+04  1.11574352e+04  2.24955425e+04 -3.98355252e+02
  9.97436756e+03  1.18583766e+03  4.33757677e+02  2.24633638e+03
 -3.24964110e+01 -4.41693323e+03 -2.90966181e+03  1.66871669e+03
  1.92388864e+03  1.90618636e+02  5.43802099e+03  1.24601580e+03
  2.80579768e+03  1.23826136e+03  1.07280546e+03  1.25564808e+03
  2.03610760e+03 -1.01925019e+03 -1.90787877e+03 -1.17014776e+03
 -3.54671751e+02 -2.00783124e+04  4.84451434e+03 -0.00000000e+00
  9.19596236e+02 -5.28723271e+03 -3.21832182e+04  4.23461483e-09
  9.85267390e+02  1.99266510e+03  1.14118228e+01 -2.28605571e+03
 -6.18674070e+03  2.06572000e+03  7.10924377e+02  1.04998020e+02
 -1.03109725e+04 -4.63485375e+00  1.58217584e+04 -4.66086937e-10
  7.36154756e+02  1.00775401e+04 -5.59487808e+03 -4.03652455e+03
  0.

Compare with the total length of the parameter space and draw conclusions!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.