# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')
backup = df.copy()

Look at df.info

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# Load necessary packages
import numpy as np

# remove "object"-type features and SalePrice from `X`
df1 = df.select_dtypes(exclude='object')
X = df1.drop('SalePrice', axis=1)

# Impute null values
for col in X.columns:
    X[col] = X[col].fillna(X[col].median())

# Create y
y = df.SalePrice

Look at the information of `X` again

In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [5]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.linear_model import *
from sklearn.model_selection import train_test_split

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print(f'Linear Train R^2 :\t{linreg.score(X_train, y_train)}')
print(f'Linear Test R^2 :\t{linreg.score(X_test, y_test)}\n')
print(f'Linear Train MSE :\t{mean_squared_error(y_train, linreg.predict(X_train))}')
print(f'Linear Test MSE :\t{mean_squared_error(y_test, linreg.predict(X_test))}')

Linear Train R^2 :	0.8046400702344446
Linear Test R^2 :	0.8240865405545486

Linear Train MSE :	1186117094.297817
Linear Test MSE :	1232328149.8815475


## Standardize your data

We haven't standardized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [6]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

Perform the same linear regression on this data and print out R-squared and MSE.

In [7]:
linreg_stand = LinearRegression()
linreg_stand.fit(X_train, y_train)
print(f'Scaled Linear Train R^2 :\t{linreg_stand.score(X_train, y_train)}')
print(f'Scaled Linear Test R^2 :\t{linreg_stand.score(X_test, y_test)}\n')
print(f'Scaled Linear Train MSE :\t{mean_squared_error(y_train, linreg_stand.predict(X_train))}')
print(f'Scaled Linear Test MSE :\t{mean_squared_error(y_test, linreg_stand.predict(X_test))}')

Scaled Linear Train R^2 :	0.8047331105936881
Scaled Linear Test R^2 :	0.8242471379110129

Scaled Linear Train MSE :	1185552204.8617349
Scaled Linear Test MSE :	1231203115.7665255


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [8]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(include='object')

In [9]:
# Make dummies
X_cat = pd.get_dummies(X_cat)
np.shape(X_cat)

(1460, 252)

Merge `X_cat` together with our scaled `X` so you have one big predictor dataframe.

In [10]:
X_all = pd.concat([pd.DataFrame(X_scaled), X_cat], axis = 1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [11]:
# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X_all, y, random_state=42)

# Fit the model and print R2 and MSE for train and test
linreg_scaled_cats = LinearRegression()
linreg_scaled_cats.fit(X_train, y_train)
print(f'Scaled (w/cats) Linear Train R^2 :\t{linreg_scaled_cats.score(X_train, y_train)}')
print(f'Scaled (w/cats) Linear Test R^2 :\t{linreg_scaled_cats.score(X_test, y_test)}\n')
print(f'Scaled (w/cats) Linear Train MSE :\t{mean_squared_error(y_train, linreg_scaled_cats.predict(X_train))}')
print(f'Scaled (w/cats) Linear Test MSE :\t{mean_squared_error(y_test, linreg_scaled_cats.predict(X_test))}')

Scaled (w/cats) Linear Train R^2 :	0.9394011383437874
Scaled (w/cats) Linear Test R^2 :	-1.1751732418154632e+18

Scaled (w/cats) Linear Train MSE :	367922663.5251142
Scaled (w/cats) Linear Test MSE :	8.232451748956811e+27


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [19]:
from sklearn.linear_model import Lasso, Ridge
lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)
print('Training r^2:\t', lasso.score(X_train, y_train))
print('Testing r^2:\t', lasso.score(X_test, y_test))
print('Training MSE:\t', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:\t', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2:	 0.9367137158658911
Testing r^2:	 0.8966464099462678
Training MSE:	 384239201.64251727
Testing MSE:	 724023839.9951768


With a higher regularization parameter (alpha = 10)

In [20]:
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)
print('Training r^2:\t', lasso.score(X_train, y_train))
print('Testing r^2:\t', lasso.score(X_test, y_test))
print('Training MSE:\t', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:\t', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2:	 0.9345920827385311
Testing r^2:	 0.9021283694468796
Training MSE:	 397120580.7626387
Testing MSE:	 685621019.4809843


## Ridge

With default parameter (alpha = 1)

In [21]:
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
print('Training r^2:\t', ridge.score(X_train, y_train))
print('Testing r^2:\t', ridge.score(X_test, y_test))
print('Training MSE:\t', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:\t', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2:	 0.9229902012717633
Testing r^2:	 0.8863756622794857
Training MSE:	 467560767.5003401
Testing MSE:	 795973601.5995784


With default parameter (alpha = 10)

In [22]:
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)
print('Training r^2:\t', ridge.score(X_train, y_train))
print('Testing r^2:\t', ridge.score(X_test, y_test))
print('Training MSE:\t', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:\t', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2:	 0.899324376597647
Testing r^2:	 0.8793879667615905
Training MSE:	 611246523.4806529
Testing MSE:	 844924568.2660812


## Look at the metrics, what are your main conclusions?   

Lasso seems to be a bit more accurate than Ridge, both in terms of R^2 and MSE. 

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [24]:
print('low Ridge parameter coefficients:', sum(abs(ridge.coef_) < 10**(-10)))
print('low Lasso parameter coefficients:', sum(abs(lasso.coef_) < 10**(-10)))

low Ridge parameter coefficients: 4
low Lasso parameter coefficients: 56


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [18]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.