# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge of Ridge and Lasso regression!

## Objectives

You will be able to:

* Use Lasso and ridge regression in Python
* Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

### Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# remove "object"-type features and SalesPrice from `X`
df2 = df.select_dtypes(exclude='object')
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1452 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

In [4]:
# Impute null values
for col in df2:
    med = df2[col].median()
    df2[col].fillna(med, inplace=True)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

In [5]:
# Create y
X = df2.drop('SalePrice', axis=1)
y = df2['SalePrice']

Look at the information of `X` again

In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [7]:
from sklearn.metrics import mean_squared_error, r2_score

# Split in train and test
x_train, x_test, y_train, y_test = train_test_split(X, y)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(1095, 37)
(1095,)
(365, 37)
(365,)


In [8]:
# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression().fit(x_train, y_train)
y_hat_train = linreg.predict(x_train)

train_mse = mean_squared_error(y_train, y_hat_train)
train_r2 = linreg.score(x_train, y_train)

linreg = LinearRegression().fit(x_test, y_test)
y_hat_test = linreg.predict(x_test)

test_mse = mean_squared_error(y_test, y_hat_test)
test_r2 = linreg.score(x_test, y_test)

print('Train MSE:', train_mse, 'Train R2:', train_r2)

print('Test MSE:  ', test_mse, 'Test R2: ', test_r2)


Train MSE: 1241924531.6231084 Train R2: 0.801783453939509
Test MSE:   769571749.6528112 Test R2:  0.8803274409796734


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [9]:
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing

# Scale the data and perform train test split
# scale = MinMaxScaler()
# transformed = scale.fit_transform(X)
# X = pd.DataFrame(transformed, columns = X.columns)
# display(X.describe())

X_scaled = preprocessing.scale(X)

In [10]:
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(X_scaled, y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [11]:
# Rerun previous linear regression
# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression().fit(x_train, y_train)
y_hat_train = linreg.predict(x_train)

train_mse = mean_squared_error(y_train, y_hat_train)
train_r2 = linreg.score(x_train, y_train)

linreg = LinearRegression().fit(x_test, y_test)
y_hat_test = linreg.predict(x_test)

test_mse = mean_squared_error(y_test, y_hat_test)
test_r2 = linreg.score(x_test, y_test)

print('Train MSE:', train_mse, 'Train R2:', train_r2)

print('Test MSE:', test_mse, 'Test R2: ', test_r2)

Train MSE: 1260698718.4140344 Train R2: 0.8038053293890659
Test MSE: 684449863.5096565 Test R2:  0.8849598692177485


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [12]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(include='object')
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 43 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422

In [13]:
# Make dummies
X_cat = pd.get_dummies(X_cat, drop_first=True)
X_cat.shape

(1460, 209)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [14]:
all_X = pd.concat([pd.DataFrame(X_scaled), X_cat], axis=1)
all_X.shape

(1460, 246)

Perform the same linear regression on this data and print out R-squared and MSE.

In [15]:
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(all_X, y)

# Rerun previous linear regression
# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression().fit(x_train, y_train)
y_hat_train = linreg.predict(x_train)

train_mse = mean_squared_error(y_train, y_hat_train)
train_r2 = linreg.score(x_train, y_train)

linreg = LinearRegression().fit(x_test, y_test)
y_hat_test = linreg.predict(x_test)

test_mse = mean_squared_error(y_test, y_hat_test)
test_r2 = linreg.score(x_test, y_test)

print('Train MSE:', train_mse, 'Train R2:', train_r2)

print('Test MSE: ', test_mse, 'Test R2: ', test_r2)

Train MSE: 324120061.99360454 Train R2: 0.9442706795152574
Test MSE:  182165377.37260273 Test R2:  0.9765480346728791


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [24]:
# Lasso, alpha=1, train v. test
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(all_X, y)

# Rerun previous linear regression
# Fit the model and print R2 and MSE for train and test
lasso = Lasso(alpha=1).fit(x_train, y_train)
y_hat_train = lasso.predict(x_train)

train_mse = mean_squared_error(y_train, y_hat_train)
train_r2 = lasso.score(x_train, y_train)

lasso = Lasso(alpha=1).fit(x_test, y_test)
y_hat_test = lasso.predict(x_test)

test_mse = mean_squared_error(y_test, y_hat_test)
test_r2 = lasso.score(x_test, y_test)

print('Train MSE:', train_mse, 'Train R2:', train_r2)

print('Test MSE: ', test_mse, 'Test R2: ', test_r2)

Train MSE: 409761434.06685066 Train R2: 0.9343319939724521
Test MSE:  192874152.8192764 Test R2:  0.9701578652416988


With a higher regularization parameter (alpha = 10)

In [25]:
# Lasso, alpha=10, train v. test
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(all_X, y)

# Rerun previous linear regression
# Fit the model and print R2 and MSE for train and test
lasso = Lasso(alpha=10).fit(x_train, y_train)
y_hat_train = lasso.predict(x_train)

train_mse = mean_squared_error(y_train, y_hat_train)
train_r2 = lasso.score(x_train, y_train)

lasso = Lasso(alpha=10).fit(x_test, y_test)
y_hat_test = lasso.predict(x_test)

test_mse = mean_squared_error(y_test, y_hat_test)
test_r2 = lasso.score(x_test, y_test)

print('Train MSE:', train_mse, 'Train R2:', train_r2)

print('Test MSE: ', test_mse, 'Test R2: ', test_r2)

Train MSE: 446828844.32356447 Train R2: 0.9350113588596531
Test MSE:  163509651.7953301 Test R2:  0.9642433436370232


## Ridge

With default parameter (alpha = 1)

In [26]:
# Ridge, alpha=1, train v. test
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(all_X, y)

# Rerun previous linear regression
# Fit the model and print R2 and MSE for train and test
ridge = Ridge(alpha=1).fit(x_train, y_train)
y_hat_train = ridge.predict(x_train)

train_mse = mean_squared_error(y_train, y_hat_train)
train_r2 = ridge.score(x_train, y_train)

ridge = Ridge(alpha=1).fit(x_test, y_test)
y_hat_test = ridge.predict(x_test)

test_mse = mean_squared_error(y_test, y_hat_test)
test_r2 = ridge.score(x_test, y_test)

print('Train MSE:', train_mse, 'Train R2:', train_r2)

print('Test MSE: ', test_mse, 'Test R2: ', test_r2)

Train MSE: 395054521.8232814 Train R2: 0.9354799168902227
Test MSE:  449745562.5164189 Test R2:  0.9344121989025642


With default parameter (alpha = 10)

In [27]:
# Ridge, alpha=10, train v. test
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(all_X, y)

# Rerun previous linear regression
# Fit the model and print R2 and MSE for train and test
ridge = Ridge(alpha=10).fit(x_train, y_train)
y_hat_train = ridge.predict(x_train)

train_mse = mean_squared_error(y_train, y_hat_train)
train_r2 = ridge.score(x_train, y_train)

ridge = Ridge(alpha=10).fit(x_test, y_test)
y_hat_test = ridge.predict(x_test)

test_mse = mean_squared_error(y_test, y_hat_test)
test_r2 = ridge.score(x_test, y_test)

print('Train MSE:', train_mse, 'Train R2:', train_r2)

print('Test MSE: ', test_mse, 'Test R2: ', test_r2)

Train MSE: 673409556.2095096 Train R2: 0.8925135866914606
Test MSE:  405923090.6788539 Test R2:  0.93680099373243


## Look at the metrics, what are your main conclusions?   

Lasso with alpha=1 seems to have performed best overall. It must be eliminating features with low predicting power, thus reducing the penalties to the R2 score.

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [28]:
# number of Ridge params almost zero
counter = 0

for coef in ridge.coef_:
    if abs(coef) < 1:
        counter += 1
print(len(ridge.coef_))
print(counter)

246
29


In [29]:
# number of Lasso params almost zero
counter = 0

for coef in lasso.coef_:
    if abs(coef) < 1:
        counter += 1
print(len(lasso.coef_))
print(counter)

246
57


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

## Summary

Great! You now know how to perform Lasso and Ridge regression.