# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [19]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [4]:
# Load necessary packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# remove "object"-type features and SalesPrice from `X`
X = df.drop(df.select_dtypes(['object']), inplace=True, axis=1)
X = df.drop('SalePrice', axis = 1)

# Impute null values


# Create y
y = df['SalePrice']

Look at the information of `X` again

In [5]:
X.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,548,0,61,0,0,0,0,0,2,2008
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,460,298,0,0,0,0,0,0,5,2007
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,608,0,42,0,0,0,0,0,9,2008
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,642,0,35,272,0,0,0,0,2,2006
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,836,192,84,0,0,0,0,0,12,2008


In [6]:
X.isnull().sum()

Id                 0
MSSubClass         0
LotFrontage      259
LotArea            0
OverallQual        0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
MasVnrArea         8
BsmtFinSF1         0
BsmtFinSF2         0
BsmtUnfSF          0
TotalBsmtSF        0
1stFlrSF           0
2ndFlrSF           0
LowQualFinSF       0
GrLivArea          0
BsmtFullBath       0
BsmtHalfBath       0
FullBath           0
HalfBath           0
BedroomAbvGr       0
KitchenAbvGr       0
TotRmsAbvGrd       0
Fireplaces         0
GarageYrBlt       81
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
EnclosedPorch      0
3SsnPorch          0
ScreenPorch        0
PoolArea           0
MiscVal            0
MoSold             0
YrSold             0
dtype: int64

In [8]:
for col in X:
    med = X[col].median()
    X[col].fillna(value = med, inplace = True)

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [10]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1)
# Fit the model and print R2 and MSE for train and test
lin = LinearRegression()
lin.fit(X_train, y_train)
print(" Train Mean Squared Error: ", mean_squared_error(y_train, lin.predict(X_train)))
print(" Train R^2 : ", lin.score(X_train, y_train))
print(" Test Mean Squared Error: ", mean_squared_error(y_test, lin.predict(X_test)))
print(" Test R^2 : ", lin.score(X_test, y_test))

 Train Mean Squared Error:  1228802186.9140017
 Train R^2 :  0.7934326608365047
 Test Mean Squared Error:  1221377702.1495645
 Test R^2 :  0.8289459519460562


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [12]:
from sklearn.preprocessing import MinMaxScaler

# Scale the data and perform train test split
scale = MinMaxScaler()
transformed = scale.fit_transform(X)
X = pd.DataFrame(transformed, columns = X.columns)


Perform the same linear regression on this data and print out R-squared and MSE.

In [13]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1)
# Fit the model and print R2 and MSE for train and test
lin = LinearRegression()
lin.fit(X_train, y_train)
print(" Train Mean Squared Error: ", mean_squared_error(y_train, lin.predict(X_train)))
print(" Train R^2 : ", lin.score(X_train, y_train))
print(" Test Mean Squared Error: ", mean_squared_error(y_test, lin.predict(X_test)))
print(" Test R^2 : ", lin.score(X_test, y_test))

 Train Mean Squared Error:  1228802186.914
 Train R^2 :  0.7934326608365049
 Test Mean Squared Error:  1221377702.1495564
 Test R^2 :  0.8289459519460572


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [24]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(['object'])

In [25]:
# Make dummies
X_cat.head()

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,...,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
4,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal


In [26]:
X_cat = pd.get_dummies(X_cat)

In [28]:
X_cat.shape

(1460, 252)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [33]:
# Your code here
X_total = pd.concat([X_cat, X], axis = 1)

In [34]:
X_total.shape

(1460, 289)

Perform the same linear regression on this data and print out R-squared and MSE.

In [35]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_total, y, test_size = 0.3, random_state=1)
# Fit the model and print R2 and MSE for train and test
lin = LinearRegression()
lin.fit(X_train, y_train)
print(" Train Mean Squared Error: ", mean_squared_error(y_train, lin.predict(X_train)))
print(" Train R^2 : ", lin.score(X_train, y_train))
print(" Test Mean Squared Error: ", mean_squared_error(y_test, lin.predict(X_test)))
print(" Test R^2 : ", lin.score(X_test, y_test))

 Train Mean Squared Error:  393370484.6849315
 Train R^2 :  0.9338725995183246
 Test Mean Squared Error:  4.048765859191095e+31
 Test R^2 :  -5.670299929484328e+21


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [40]:
# Your code here
lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)
print(" Train Mean Squared Error: ", mean_squared_error(y_train, lasso.predict(X_train)))
print(" Train R^2 : ", lasso.score(X_train, y_train))
print(" Test Mean Squared Error: ", mean_squared_error(y_test, lasso.predict(X_test)))
print(" Test R^2 : ", lasso.score(X_test, y_test))

 Train Mean Squared Error:  392179885.794983
 Train R^2 :  0.9340727447063188
 Test Mean Squared Error:  864976826.0455214
 Test R^2 :  0.8788599240779162


With a higher regularization parameter (alpha = 10)

In [41]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)
print(" Train Mean Squared Error: ", mean_squared_error(y_train, lasso.predict(X_train)))
print(" Train R^2 : ", lasso.score(X_train, y_train))
print(" Test Mean Squared Error: ", mean_squared_error(y_test, lasso.predict(X_test)))
print(" Test R^2 : ", lasso.score(X_test, y_test))

 Train Mean Squared Error:  401324904.9797467
 Train R^2 :  0.9325354246236293
 Test Mean Squared Error:  800277806.8367836
 Test R^2 :  0.8879210270612907


## Ridge

With default parameter (alpha = 1)

In [42]:
# Your code here
from sklearn.linear_model import Lasso, Ridge
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
print(" Train Mean Squared Error: ", mean_squared_error(y_train, ridge.predict(X_train)))
print(" Train R^2 : ", ridge.score(X_train, y_train))
print(" Test Mean Squared Error: ", mean_squared_error(y_test, ridge.predict(X_test)))
print(" Test R^2 : ", ridge.score(X_test, y_test))

 Train Mean Squared Error:  491305352.9360016
 Train R^2 :  0.9174092945524075
 Test Mean Squared Error:  904413652.3114895
 Test R^2 :  0.8733367933024637


With default parameter (alpha = 10)

In [43]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)
print(" Train Mean Squared Error: ", mean_squared_error(y_train, ridge.predict(X_train)))
print(" Train R^2 : ", ridge.score(X_train, y_train))
print(" Test Mean Squared Error: ", mean_squared_error(y_test, ridge.predict(X_test)))
print(" Test R^2 : ", ridge.score(X_test, y_test))

 Train Mean Squared Error:  681390750.5817802
 Train R^2 :  0.8854550587741216
 Test Mean Squared Error:  983521121.3448687
 Test R^2 :  0.8622577857312214


## Look at the metrics, what are your main conclusions?

Using Ridge and lasso helped with overfitting of the model. R squared was higher with lasso.

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [44]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

6


In [45]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

73


Compare with the total length of the parameter space and draw conclusions!

In [46]:
# your code here
len(lin.coef_)

289

In [49]:
73/289

0.25259515570934254

In [None]:
# Lasso was able to lower about 25% of the coefficients to almost zero. This greatly reduces the complexity of the model which helps with overfitting.

## Summary

Great! You now know how to perform Lasso and Ridge regression.