# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge of Ridge and Lasso regression!

## Objectives

You will be able to:

* Use Lasso and ridge regression in Python
* Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [5]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]

# Impute null values
for col in X:
    med = X[col].median()
    X[col].fillna(value = med, inplace = True)

# Create y
y = df.SalePrice



Look at the information of `X` again

In [6]:
X.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,548,0,61,0,0,0,0,0,2,2008
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,460,298,0,0,0,0,0,0,5,2007
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,608,0,42,0,0,0,0,0,9,2008
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,642,0,35,272,0,0,0,0,2,2006
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,836,192,84,0,0,0,0,0,12,2008


## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [14]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test

X_train,X_test,y_train,y_test = train_test_split(X,y)
# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(x_train,y_train)
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)
print('Training r^2:', linreg.score(X_train, y_train))
print('Testing r^2:', linreg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, linreg.predict(X_test)))

Training r^2: 0.06820478513696138
Testing r^2: 0.08517951472667928
Training MSE: 5212481364.45187
Testing MSE: 7722620714.421291


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [19]:
from sklearn import preprocessing

# Scale the data and perform train test split

X = preprocessing.scale(X)
X_train,X_test,y_train,y_test = train_test_split(X,y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [20]:
# Your code here
linreg = LinearRegression()
linreg.fit(x_train,y_train)
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)
print('Training r^2:', linreg.score(X_train, y_train))
print('Testing r^2:', linreg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, linreg.predict(X_test)))

Training r^2: 0.09469094514220455
Testing r^2: 0.09670936971259836
Training MSE: 5822881861.038396
Testing MSE: 5353962961.963963


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [28]:
# Create X_cat which contains only the categorical variables
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype in [np.object]]
X_cat = df[features_cat]



numpy.ndarray

In [26]:
# Make dummies
lst = []
for col in features_cat:
    dummy = pd.get_dummies(X_cat,drop_first=True)
    lst.append(dummy)
cats = pd.concat(lst,axis=1)


In [25]:
cats.head()

Unnamed: 0,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Pave,Alley_Pave,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_HLS,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0,0,1,0,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,1,0
1,0,0,1,0,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,1,0
2,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [29]:
# Your code here
x_cat = pd.concat([pd.DataFrame(X),cats],axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [30]:
# Your code here
linreg.fit(x_cat,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [31]:
linreg.predict(x_cat)

array([206699.19589041, 205850.57089041, 204509.00839041, ...,
       280491.25057791, 140978.32089041, 149193.59432791])

In [34]:
linreg.score(x_cat,y) , mean_squared_error(y,linreg.predict(x_cat))

(0.9323523663274966, 426639323.8716349)

Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [39]:
# Your code here
from sklearn.linear_model import Lasso,Ridge

lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)
print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2: 0.8043179169280736
Testing r^2: 0.8319885166160732
Training MSE: 1258612896.8175504
Testing MSE: 995833709.6178709


With a higher regularization parameter (alpha = 10)

In [40]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)
print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2: 0.8043168941454191
Testing r^2: 0.8321120726161572
Training MSE: 1258619475.281043
Testing MSE: 995101371.4024645


## Ridge

With default parameter (alpha = 1)

In [46]:
# Your code here
ridge = Ridge(alpha=.1)
ridge.fit(X_train,y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.8043179228901093
Testing r^2: 0.831981236326196
Training MSE: 1258612858.4701715
Testing MSE: 995876861.1806659


With default parameter (alpha = 10)

In [45]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train,y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.8042763626709853
Testing r^2: 0.8327131390226046
Training MSE: 1258880170.7706273
Testing MSE: 991538744.7462175


## Look at the metrics, what are your main conclusions?   

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [47]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

0


In [None]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.