# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]

# Impute null values
for col in X:
    feature_med = X[col].median()
    X[col].fillna(value = feature_med, inplace = True)

# Create y
y = df.SalePrice

Look at the information of `X` again

In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [7]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X,y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print('Training R2 = ', linreg.score(X_train, y_train))
print('Testing R2 = ', linreg.score(X_test, y_test))
print('Training MSE = ', mean_squared_error(y_train, linreg.predict(X_train)))
print('Testing MSE = ', mean_squared_error(y_test, linreg.predict(X_test)))

Training R2 =  0.8263424255293087
Testing R2 =  0.755677831544955
Training MSE =  1064028890.2885737
Testing MSE =  1671890629.2838702


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [8]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scale = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scale,y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [9]:
# Your code here
linreg_norm = LinearRegression()
linreg_norm.fit(X_train, y_train)
print('Training R2 = ', linreg_norm.score(X_train, y_train))
print('Testing R2 = ', linreg_norm.score(X_test, y_test))
print('Training MSE = ', mean_squared_error(y_train, linreg_norm.predict(X_train)))
print('Testing MSE = ', mean_squared_error(y_test, linreg_norm.predict(X_test)))

Training R2 =  0.8086637169775084
Testing R2 =  0.8056307714161693
Training MSE =  1259808184.8739574
Testing MSE =  1057443804.4265274


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [12]:
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype in [np.object]]
X_cat = df[features_cat]
print('Shape without dummies:')
np.shape(X_cat)

Shape without dummies:


(1460, 43)

In [13]:
# Make dummies
X_cat = pd.get_dummies(X_cat)
print('Shape with dummies:')
np.shape(X_cat)

Shape with dummies:


(1460, 252)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [14]:
# Your code here
X_merge = pd.concat([pd.DataFrame(X_scale), X_cat], axis = 1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [29]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_merge, y)
linreg_merge = LinearRegression()
linreg_merge.fit(X_train, y_train)
print('Training R2 = ', linreg_merge.score(X_train, y_train))
print('Testing R2 = ', linreg_merge.score(X_test, y_test))
print('Training MSE = ', mean_squared_error(y_train, linreg_merge.predict(X_train)))
print('Testing MSE = ', mean_squared_error(y_test, linreg_merge.predict(X_test)))

Training R2 =  0.94074241359729
Testing R2 =  -5.0441900304572774e+17
Training MSE =  369103563.91872144
Testing MSE =  3.297648324586507e+27


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [17]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso() 
lasso.fit(X_train, y_train)
print('Training R2 = ', lasso.score(X_train, y_train))
print('Testing R2 =', lasso.score(X_test, y_test))
print('Training MSE = ', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE = ', mean_squared_error(y_test, lasso.predict(X_test)))

Training R2 =  0.9360740134925611
Testing R2 = 0.8870946240413695
Training MSE =  396594383.52537334
Testing MSE =  741099013.0957825


With a higher regularization parameter (alpha = 10)

In [18]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)
print('Training R2 = ', lasso.score(X_train, y_train))
print('Testing R2 = ', lasso.score(X_test, y_test))
print('Training MSE = ', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE = ', mean_squared_error(y_test, lasso.predict(X_test)))

Training R2 =  0.9342824984286946
Testing R2 =  0.8923618830441477
Training MSE =  407708874.6916214
Testing MSE =  706525278.9796225


## Ridge

With default parameter (alpha = 1)

In [20]:
# Your code here
ridge = Ridge()
ridge.fit(X_train, y_train)
print('Training R2 =', ridge.score(X_train, y_train))
print('Testing R2 = ', ridge.score(X_test, y_test))
print('Training MSE = ', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE = ', mean_squared_error(y_test, ridge.predict(X_test)))

Training R2 = 0.9233007663063825
Testing R2 =  0.8728342514049975
Training MSE =  475839122.1705914
Testing MSE =  834702599.2621821


With default parameter (alpha = 10)

In [21]:
# Your code here
ridge = Ridge(alpha = 10)
ridge.fit(X_train, y_train)
print('Training R2 = ', ridge.score(X_train, y_train))
print('Testing R2 = ', ridge.score(X_test, y_test))
print('Training MSE = ', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE = ', mean_squared_error(y_test, ridge.predict(X_test)))

Training R2 =  0.8998090139851723
Testing R2 =  0.87599624557936
Training MSE =  621581058.0995269
Testing MSE =  813947602.0608683


## Look at the metrics, what are your main conclusions?

Conclusions here

First Naive Linear Regression
Training R2 =  0.83
Testing R2 =  0.76
Training MSE =  1,064,028,890
Testing MSE =  1,671,890,629

Normalized Linear Regression
Training R2 =  0.81
Testing R2 =  0.81
Training MSE =  1,259,808,184
Testing MSE =  1,057,443,804

Normalized + Continuous Dummies Linear Regression
Training R2 =  0.94
Testing R2 =  -2.30e+19
Training MSE =  383,232,328
Testing MSE =  1.51e+29

Of the three linear regressions, the Normalized Linear Regression has the highest Testing R2 and lowest Testing MSE. The addition of Continuous Dummies to this increases the Training R2 but is catastrophic to Testing R2 and MSE.

Lasso (alpha=1) Regression
Training R2 =  0.94
Testing R2 = 0.89
Training MSE =  396,594,383
Testing MSE =  741,099,013

Lasso (alpha=10) Regression
Training R2 =  0.93
Testing R2 =  0.89
Training MSE =  407,708,874
Testing MSE =  706,525,278

The two Lasso regressions are nearly identical, with the alpha = 10 version having slightly lower Testing MSE.

Ridge (alpha=1) Regression
Training R2 = 0.92
Testing R2 =  0.87
Training MSE =  475,839,122
Testing MSE =  834,702,599

Ridge (alpha=10) Regression
Training R2 =  0.90
Testing R2 =  0.88
Training MSE =  621,581,058
Testing MSE =  813,947,602

The two Lasso regressions are nearly identical, with the alpha = 10 version having slightly lower Testing MSE. 

Overall, Lasso (alpha=10) Regression has the lowest Testing MSE. The highest Testing R2 results from either Lasso version.

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [22]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

11


In [23]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

72


Compare with the total length of the parameter space and draw conclusions!

In [24]:
# your code here
len(lasso.coef_)

289

In [27]:
sum(abs(ridge.coef_) < 10**(-10))/ len(lasso.coef_)

0.03806228373702422

In [26]:
sum(abs(lasso.coef_) < 10**(-10)) / len(lasso.coef_)

0.2491349480968858

In [None]:
# The number of parameter estimates that are very close to 0 are: for Ridge, 11 (3.8%); for Lasso, 72 (24.9%). 
# The total length of the parameter space is 289. 
# Lasso does a better job of reducing the number of predictors.

## Summary

Great! You now know how to perform Lasso and Ridge regression.