# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [16]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [17]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [18]:
# Load necessary packages
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn import linear_model



In [19]:
# remove "object"-type features and SalesPrice from `X`

X = df.select_dtypes(exclude=['object'])
X = X.drop(['SalePrice'], axis=1)








In [20]:
# Impute null values
for col in list(X.columns):
    X[col].fillna(X[col].median(), inplace=True)

In [23]:
# Create y
y = df[['SalePrice']]

Look at the information of `X` again

In [24]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [25]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split

# Split in train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)




In [36]:
# Fit the model and print R2 and MSE for train and test
model = LinearRegression()

reg = model.fit(X_train, y_train)

y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)

print('R2 Train:' , reg.score(X_train, y_train))
print('MSE Train: ', round(mean_squared_error(y_train, y_pred_train),2))
print("")
print('R2 Test:' , reg.score(X_test, y_test))
print('MSE Test: ', round(mean_squared_error(y_test, y_pred_test),2))

R2 Train: 0.8651115274692942
MSE Train:  822832852.45

R2 Test: 0.5427461687994772
MSE Test:  3257057125.12


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [38]:
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

# Scale the data and perform train test split

scale = MinMaxScaler()
transformed = scale.fit_transform(X)
X = pd.DataFrame(transformed, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)


Perform the same linear regression on this data and print out R-squared and MSE.

In [39]:
# Your code here

model = LinearRegression()

reg = model.fit(X_train, y_train)

y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)

print('R2 Train:' , reg.score(X_train, y_train))
print('MSE Train: ', round(mean_squared_error(y_train, y_pred_train),2))
print("")
print('R2 Test:' , reg.score(X_test, y_test))
print('MSE Test: ', round(mean_squared_error(y_test, y_pred_test),2))


R2 Train: 0.8651115274692943
MSE Train:  822832852.45

R2 Test: 0.5427461687994442
MSE Test:  3257057125.12


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [40]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(exclude=['int64', 'float64'])
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 43 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422

In [41]:
# Make dummies

X_cat = pd.get_dummies(X_cat

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,LotShape_IR1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
3,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,1,0,0,0,0,0
4,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0


In [43]:
X_cat.head()

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,LotShape_IR1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
3,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,1,0,0,0,0,0
4,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0


Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [44]:
# Your code here
X = pd.concat([X,X_cat], axis=1)

In [45]:
X.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0.0,0.235294,0.150685,0.03342,0.666667,0.5,0.949275,0.883333,0.1225,0.125089,...,0,0,0,1,0,0,0,0,1,0
1,0.000685,0.0,0.202055,0.038795,0.555556,0.875,0.753623,0.433333,0.0,0.173281,...,0,0,0,1,0,0,0,0,1,0
2,0.001371,0.235294,0.160959,0.046507,0.666667,0.5,0.934783,0.866667,0.10125,0.086109,...,0,0,0,1,0,0,0,0,1,0
3,0.002056,0.294118,0.133562,0.038561,0.666667,0.5,0.311594,0.333333,0.0,0.038271,...,0,0,0,1,1,0,0,0,0,0
4,0.002742,0.235294,0.215753,0.060576,0.777778,0.5,0.927536,0.833333,0.21875,0.116052,...,0,0,0,1,0,0,0,0,1,0


Perform the same linear regression on this data and print out R-squared and MSE.

In [46]:
# Your code here

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)

model = LinearRegression()

reg = model.fit(X_train, y_train)

y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)

print('R2 Train:' , reg.score(X_train, y_train))
print('MSE Train: ', round(mean_squared_error(y_train, y_pred_train),2))
print("")
print('R2 Test:' , reg.score(X_test, y_test))
print('MSE Test: ', round(mean_squared_error(y_test, y_pred_test),2))

R2 Train: 0.9290435927288627
MSE Train:  432841012.28

R2 Test: -9.667498644463323e+21
MSE Test:  6.8862398067527495e+31


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [53]:
# Your code here
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=1)
reg = lasso.fit(X_train, y_train)

y_pred_train_lasso = reg.predict(X_train)
y_pred_test_lasso = reg.predict(X_test)

print('R2 Train:' , reg.score(X_train, y_train))
print('MSE Train: ', round(mean_squared_error(y_train, y_pred_train_lasso),2))
print("")
print('R2 Test:' , reg.score(X_test, y_test))
print('MSE Test: ', round(mean_squared_error(y_test, y_pred_test_lasso),2))

R2 Train: 0.9396513617882085
MSE Train:  368132585.31

R2 Test: 0.5705461222447257
MSE Test:  3059035741.22


With a higher regularization parameter (alpha = 10)

In [52]:
# Your code here
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=10)
reg = lasso.fit(X_train, y_train)

y_pred_train_lasso = reg.predict(X_train)
y_pred_test_lasso = reg.predict(X_test)

print('R2 Train:' , reg.score(X_train, y_train))
print('MSE Train: ', round(mean_squared_error(y_train, y_pred_train_lasso),2))
print("")
print('R2 Test:' , reg.score(X_test, y_test))
print('MSE Test: ', round(mean_squared_error(y_test, y_pred_test_lasso),2))

R2 Train: 0.9382778171767869
MSE Train:  376511341.55

R2 Test: 0.5962940887841155
MSE Test:  2875630830.97


## Ridge

With default parameter (alpha = 1)

In [54]:
# Your code here

ridge = Ridge(alpha=1)
reg = ridge.fit(X_train, y_train)

y_pred_train_ridge = reg.predict(X_train)
y_pred_test_ridge= reg.predict(X_test)

print('R2 Train:' , reg.score(X_train, y_train))
print('MSE Train: ', round(mean_squared_error(y_train, y_pred_train_ridge),2))
print("")
print('R2 Test:' , reg.score(X_test, y_test))
print('MSE Test: ', round(mean_squared_error(y_test, y_pred_test_ridge),2))

R2 Train: 0.9355838945914466
MSE Train:  392944532.35

R2 Test: 0.6455423483196221
MSE Test:  2524831376.32


With default parameter (alpha = 10)

In [55]:
# Your code here

ridge = Ridge(alpha=10)
reg = ridge.fit(X_train, y_train)

y_pred_train_ridge = reg.predict(X_train)
y_pred_test_ridge= reg.predict(X_test)

print('R2 Train:' , reg.score(X_train, y_train))
print('MSE Train: ', round(mean_squared_error(y_train, y_pred_train_ridge),2))
print("")
print('R2 Test:' , reg.score(X_test, y_test))
print('MSE Test: ', round(mean_squared_error(y_test, y_pred_test_ridge),2))

R2 Train: 0.9134793149547963
MSE Train:  527784626.34

R2 Test: 0.7381320661563114
MSE Test:  1865307104.21


## Look at the metrics, what are your main conclusions?   

Conclusions here
- As alpha increases, the fit of the model improves (difference between R2 for training and test data falls)
- MSE also reduces
- Rige has a more dramatic effect.  This is because there is a larger penalisation value

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [58]:
# number of Ridge params almost zero
np.sum(sum(abs(ridge.coef_) < 10**(-10)))

5

In [59]:
# number of Lasso params almost zero
np.sum(sum(abs(lasso.coef_) < 10**(-10)))

31

Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.