# Ames Housing - LASSO (Least Absolute Shrinkage and Selection Operator)
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
plt.style.use('fivethirtyeight')

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## Load data

Load training data from CSV file.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/ames_housing/data/train.csv')

In [None]:
data.head()

## Prepare data

First, we will remove some columns that are not useful for our task.

In [None]:
data = data.drop(['house_id', 'YrSold', 'MoSold', 'SaleCondition', 'SaleType'], axis=1)

Next, we will split the data into features (*X*) and labels (*y*) and into training (*X_train, y_train*) and test (*X_test, y_test*) sets.

In [None]:
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Finally, we will do some feature engineering. It is important to use only information from the training set for feature engineering, and the mechanistically repeat these steps on the test set.

Typically, feature engineering depends strongly on the datatype of the variables. Hence, we will first determine which variables are categorical and which are numerical. Subsequentally, we will transform these variables seperately.

In [None]:
categorical_features = X_train.select_dtypes(include='object').columns
numerical_features = X_train.select_dtypes(exclude='object').columns

The categorical variables must be transformed into numerical representations, e.g., by one-hot encdoing them.

In [None]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
enc.fit(X_train[categorical_features])

X_train_cat = enc.transform(X_train[categorical_features])
X_test_cat = enc.transform(X_test[categorical_features])

X_train_cat = pd.DataFrame(X_train_cat, columns=enc.get_feature_names_out(categorical_features))
X_test_cat = pd.DataFrame(X_test_cat, columns=enc.get_feature_names_out(categorical_features))

In [None]:
X_train_cat.head()

The numerical variables will be standardized, that is, we will subtract the mean and divide by the standard deviation. This is especially important for LASSO, as all coefficients need to be comparable in terms of units and magnitudes.

In [None]:
scaler = StandardScaler()
scaler.fit(X_train[numerical_features]) 

X_train_num = scaler.transform(X_train[numerical_features])
X_test_num = scaler.transform(X_test[numerical_features])

X_train_num = pd.DataFrame(X_train_num, columns=numerical_features)
X_test_num = pd.DataFrame(X_test_num, columns=numerical_features)

In [None]:
X_train_num.head()

Let's fuse the enginnered categorical and numerical variables again.

In [None]:
X_train = pd.concat([X_train_num, X_train_cat], axis=1)
X_test = pd.concat([X_test_num, X_test_cat], axis=1)

In [None]:
X_train.head()

## LASSO regression

We will start by initializing a LASSO model with an arbitrary lambda (called *alpha* in sklearn) value and fitting it on the training data.

In [None]:
lasso_mod = Lasso(alpha=1)
lasso_mod.fit(X_train, y_train)

Evaluate the model on both training and test set using *R2* and *RMSE* as metrics.

In [None]:
# Training data
pred_train = lasso_mod.predict(X_train)
r2_train = r2_score(y_train, pred_train)
rmse_train = mean_squared_error(y_train, pred_train, squared=False)
print('R2 on training set:', round(r2_train, 2))
print('RMSE on training set:', round(rmse_train, 2))

print("===")

# Test data
pred_test = lasso_mod.predict(X_test)
r2_test = r2_score(y_test, pred_test)
rmse_test = mean_squared_error(y_test, pred_test, squared=False)
print('R2 on test set:', round(r2_test, 2))
print('RMSE on test set:', round(rmse_test, 2))

Next, we will try 100 different lambda (*alpha*) values between 0 and 1000. The loop below fits 100 different LASSO models, each with a different alpha, and collects the test set RMSE and the estimated coefficients in a dataframe.

In [None]:
alphas = np.linspace(0, 1000, 100)

lasso_mod = Lasso()

results = []
for a in alphas:
    result = {}
    lasso_mod.set_params(alpha=a)
    lasso_mod.fit(X_train, y_train)
    pred_test = lasso_mod.predict(X_test)

    rmse_test = mean_squared_error(y_test, pred_test, squared=False)
    
    coef_names = lasso_mod.feature_names_in_
    coef_values = lasso_mod.coef_

    result["alpha"] = a
    result["rmse"] = rmse_test
    for i in range(0, len(coef_names)):
        result[coef_names[i]] = coef_values[i]

    results.append(result)


In [None]:
results_df = pd.DataFrame(results)
results_df.head()

Let's visualize how the coefficients shrink with increasing alpha.

In [None]:
# wrangle the data into long format
results_df_long = pd.melt(results_df.drop(['alpha', 'rmse'], axis=1), value_vars=results_df.columns[2:])
alphas_m = np.tile(results_df["alpha"].to_numpy(), 283)
rmses_m = np.tile(results_df["rmse"].to_numpy(), 283)
results_df_long["alpha"] = alphas_m
results_df_long["rmse"] = rmses_m

# create lineplot
sns.lineplot(data=results_df_long, x=results_df_long["alpha"], y=results_df_long["value"], hue=results_df_long["variable"])
plt.xscale('log')
plt.xlabel('alpha')
plt.ylabel('Standardized Coefficients')
plt.legend().remove()
plt.show()

Similarily, we can plot the test set RMSE against alpha.

In [None]:
sns.lineplot(data=results_df_long, x=results_df_long["alpha"], y=results_df_long["rmse"])
plt.xlabel('alpha')
plt.ylabel('RMSE')
plt.show()


## Hyperparameter tuning

A more robust way to choose the alpha value that leads to the best out-of-sample predictive accuracy is to use k-fold cross validation (we might just have been lucky with the above train/test split).

In [None]:
lasso_mod_cv = LassoCV(cv=5, alphas=alphas, random_state=42)
lasso_mod_cv.fit(X_train, y_train)

Which alpha value leads to the best out-of-sample predictive accuracy?

In [None]:
lasso_mod_cv.alpha_

Refit the model with the best alpha value.

In [None]:
lasso_mod_tuned = Lasso(alpha=lasso_mod_cv.alpha_)
lasso_mod_tuned.fit(X_train, y_train)

Evaluate this model on the test set.

In [None]:
# Training data
pred_train = lasso_mod_tuned.predict(X_train)
r2_train = r2_score(y_train, pred_train)
rmse_train = mean_squared_error(y_train, pred_train, squared=False)
print('R2 on training set:', round(r2_train, 2))
print('RMSE on training set:', round(rmse_train, 2))

print("===")

# Test data
pred_test = lasso_mod_tuned.predict(X_test)
r2_test = r2_score(y_test, pred_test)
rmse_test = mean_squared_error(y_test, pred_test, squared=False)
print('R2 on test set:', round(r2_test, 2))
print('RMSE on test set:', round(rmse_test, 2))

## A secret weapon: LASSO with A LOT OF interaction terms

The automatic feature selection capability of LASSO can be used to identify important interaction terms. In the following, we will add all possible interaction terms to the feature matrix, use cross-validation to identify the best lambda/alpha value, and then refit and evaluate a final model.

Check dimensions of *X_train*.

In [None]:
X_train.shape

Use the *PolynomialFeatures* transformer to create all possible two-way combinations of features.

In [None]:
interact = PolynomialFeatures(interaction_only=True)

X_train_interact = interact.fit_transform(X_train)
X_train_interact = pd.DataFrame(X_train_interact, columns=interact.get_feature_names_out())

X_test_interact = interact.transform(X_test)
X_test_interact = pd.DataFrame(X_test_interact, columns=interact.get_feature_names_out())

Check dimensions of *X_train_interact*.

In [None]:
X_train_interact.shape

Let's have a look at the feature matrix.

In [None]:
X_train_interact.head()

As done before, we now perform k-fold cross validation to search for best lambda/alpha value and then refit the model with this parameter. WARNING: The next cell might take a while to run (approx. 15mins).

In [None]:
alphas = np.linspace(0, 10000, 100)
lasso_mod_interact_cv = LassoCV(cv=2, alphas=alphas, random_state=42)
lasso_mod_interact_cv.fit(X_train_interact, y_train)

In [None]:
lasso_mod_interact_tuned = Lasso(alpha=lasso_mod_interact_cv.alpha_)
lasso_mod_interact_tuned.fit(X_train_interact, y_train)

Evaluate the tuned LASSO model with all possible two-way interactions on the test set.

In [None]:
# Training data
pred_train = lasso_mod_interact_tuned.predict(X_train_interact)
r2_train = r2_score(y_train, pred_train)
rmse_train = mean_squared_error(y_train, pred_train, squared=False)
print('R2 on training set:', round(r2_train, 2))
print('RMSE on training set:', round(rmse_train, 2))

print("===")

# Test data
pred_test = lasso_mod_interact_tuned.predict(X_test_interact)
r2_test = r2_score(y_test, pred_test)
rmse_test = mean_squared_error(y_test, pred_test, squared=False)
print('R2 on test set:', round(r2_test, 2))
print('RMSE on test set:', round(rmse_test, 2))