# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
# df.info()

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# remove "object"-type features and SalesPrice from `X`
object_columns = list(df.select_dtypes(include='object').columns) 
object_columns.append('MasVnrArea')
df2 = df.drop(object_columns, axis=1)

# Impute null values
df2['LotFrontage'].fillna(df2['LotFrontage'].mean(), inplace=True)
df2['GarageYrBlt'].fillna(df2['GarageYrBlt'].median(), inplace=True)

# Create y
X = df2.drop(['SalePrice'], axis=1) 
y = df2['SalePrice']


Look at the information of `X` again

In [4]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
Fireplaces       1460 non-null int64
Gar

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [5]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Fit the model and print R2 and MSE for train and test
model = LinearRegression()
model.fit(X_train, y_train)

y_hat = model.predict(X_train)
print("R2:", r2_score(y_train, y_hat))
print("MSE:", mean_squared_error(y_train, y_hat))

R2: 0.8093010423144277
MSE: 1105452189.6423388


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [6]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scale = preprocessing.scale(X)

Perform the same linear regression on this data and print out R-squared and MSE.

In [7]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.33, random_state=42)

# Fit the model and print R2 and MSE for train and test
model = LinearRegression()
model.fit(X_train, y_train)

y_hat = model.predict(X_train)
r2_score(y_train, y_hat)
print("R2:", r2_score(y_train, y_hat))
print("MSE:", mean_squared_error(y_train, y_hat))

R2: 0.8093009570358276
MSE: 1105452683.9891164


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [8]:
# Create X_cat which contains only the categorical variables
X_cat=df[object_columns]
X_cat.drop(['MasVnrArea'], axis=1, inplace=True)

In [9]:
# Make dummies
X_cat = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [13]:
# Your code here
df3 = pd.concat([df2, X_cat], axis=1)
df3['SalePrice']

0       208500
1       181500
2       223500
3       140000
4       250000
5       143000
6       307000
7       200000
8       129900
9       118000
10      129500
11      345000
12      144000
13      279500
14      157000
15      132000
16      149000
17       90000
18      159000
19      139000
20      325300
21      139400
22      230000
23      129900
24      154000
25      256300
26      134800
27      306000
28      207500
29       68500
         ...  
1430    192140
1431    143750
1432     64500
1433    186500
1434    160000
1435    174000
1436    120500
1437    394617
1438    149700
1439    197000
1440    191000
1441    149300
1442    310000
1443    121000
1444    179600
1445    129000
1446    157900
1447    240000
1448    112000
1449     92000
1450    136000
1451    287090
1452    145000
1453     84500
1454    185000
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

Perform the same linear regression on this data and print out R-squared and MSE.

In [14]:
X = df3.drop(['SalePrice'], axis=1)
y = df3['SalePrice']

# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Fit the model and print R2 and MSE for train and test
model = LinearRegression()
model.fit(X_train, y_train)

y_hat = model.predict(X_train)
r2_score(y_train, y_hat)
print("R2:", r2_score(y_train, y_hat))
print("MSE:", mean_squared_error(y_train, y_hat))

R2: 0.9377679845011461
MSE: 360749312.0780114


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [39]:
# Your code here

X = df3.drop(['SalePrice'], axis=1)
y = df3['SalePrice']

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)
X_scale = pd.DataFrame(X_scale, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.33, random_state=42)

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=1)
lasso_model = lasso.fit(X_train,y_train)

y_hat = lasso_model.predict(X_train)
r2_score(y_train, y_hat)
print("R2:", r2_score(y_train, y_hat))
print("MSE:", mean_squared_error(y_train, y_hat))


R2: 0.9377674304423264
MSE: 360752523.87079144


With a higher regularization parameter (alpha = 10)

In [41]:
# Your code here
X = df3.drop(['SalePrice'], axis=1)
y = df3['SalePrice']

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)
X_scale = pd.DataFrame(X_scale, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.33, random_state=42)

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=10)
lasso_model = lasso.fit(X_train,y_train)

y_hat = lasso_model.predict(X_train)
r2_score(y_train, y_hat)
print("R2:", r2_score(y_train, y_hat))
print("MSE:", mean_squared_error(y_train, y_hat))

R2: 0.937718090928288
MSE: 361038537.3576586


## Ridge

With default parameter (alpha = 1)

In [42]:
# Your code here
# Your code here
X = df3.drop(['SalePrice'], axis=1)
y = df3['SalePrice']

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scale = scaler.fit_transform(X)
X_scale = pd.DataFrame(X_scale, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.33, random_state=42)

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1)
ridge_model = ridge.fit(X_train,y_train)

y_hat = ridge_model.predict(X_train)
r2_score(y_train, y_hat)
print("R2:", r2_score(y_train, y_hat))
print("MSE:", mean_squared_error(y_train, y_hat))

R2: 0.9377582360481548
MSE: 360805822.3433829


With default parameter (alpha = 10)

In [53]:
# Your code here
# Your code here
# Your code here
X = df3.drop(['SalePrice'], axis=1)
y = df3['SalePrice']

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scale = scaler.fit_transform(X)
X_scale = pd.DataFrame(X_scale, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.33, random_state=42)

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=10)
ridge_model = ridge.fit(X_train,y_train)

y_hat = ridge_model.predict(X_train)
r2_score(y_train, y_hat)
print("R2:", r2_score(y_train, y_hat))
print("MSE:", mean_squared_error(y_train, y_hat))

R2: 0.9374066363411243
MSE: 362843991.1447836


## Look at the metrics, what are your main conclusions?   

Conclusions here

Ridge, alpha=1 performed the best 

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [77]:
# number of Ridge params almost zero
ridge_params = pd.DataFrame(ridge_model.coef_, X_scale.columns, columns=['Coefficient'])
ridge_close_to_0 = ridge_params.loc[(ridge_params.Coefficient > -1) & (ridge_params.Coefficient < 1)]
print(len(ridge_params) - len(ridge_close_to_0), "out of", len(ridge_params), "params selected")

283 out of 288 params selected


In [79]:
# number of Lasso params almost zero
lasso_params = pd.DataFrame(lasso_model.coef_, X_scale.columns, columns=['Coefficient'])
lasso_close_to_0 = lasso_params.loc[(lasso_params.Coefficient > -1) & (lasso_params.Coefficient < 1)]
print(len(lasso_params) - len(lasso_close_to_0), "out of", len(lasso_params), "params selected")

249 out of 288 params selected


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [81]:
# your code here
249/288

0.8645833333333334

## Summary

Great! You now know how to perform Lasso and Ridge regression.