# **Lasso Regression in Python Notes**

By Noah Rubin

May 2021

---

* Like ridge, lasso regression aims to address the concept of the bias-variance tradeoff in machine learning that suggests that optimising one tends to degrade the other
* Lasso purposely introduces bias into the regression model in an effort to reduce the variance, which can then potentially lower the mean squared error of our estimator, since $$\text{MSE} = \text{Bias}^2 + \text{Variance}$$
* Even though by the Gauss-Markov theorem, OLS has the lowest sampling variance out of any linear unbiased estimator, there may be a biased estimator that can achieve a lower mean squared error, such as the lasso estimator
* In essense, lasso regression can be used because OLS might fit the training data well, but may not generalise as nicely to out of sample data
* Lasso regression is also a tool to help reduce the impact of multicollinearity within our feature matrix, just like ridge can
* One major advantage that lasso has over ridge is that while ridge can only shrink coefficients towards zero, lasso can shrink coefficients all the way to zero through adding an L1 regaularisation penalty to our ols loss function. The loss function for lasso regression is defined as:

$$J(\beta_0, \beta_1, ... , \beta_p) = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j})^2 + \lambda\sum_{j=1}^p |\beta_j|.$$

In matrix form this is defined as

$$J(\vec{\beta}) = (\vec{y} - X\vec{\beta})^T(y - X\vec{\beta}) + \lambda||\beta||_1$$ 

Because of the mathematical properties that follow from penalising the sum of the absolute values of the $\beta_j$ coefficients, certain coefficients can be shrunk all the way to zero, "...thus the lasso yields models that simultaneously use regularization to improve the model and to conduct feature selection." - Applied Predictive Modeling (By Max Kuhn and Kjell Johnson). In this sense, lasso further encourages parsimonious models through embedded feature selection methods

---

Both ridge and lasso are able to lessen the impact of multicollinearity, but the way that is done is different between the two models. In ridge regression, correlated predictors tend to be close to each other in value, while for lasso, out of the predictors correlated with each other, one tends to stand out while the remaining correlated predictors' coefficient values shrink close toward zero (or exactly zero).

In [1]:
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso

# Personal display settings
#===========================

# Suppress scientific notation
np.set_printoptions(suppress=True)

# Get dataset values showing only 2dp
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_colwidth', None)

# For clear plots with a nice background
plt.style.use('seaborn-whitegrid') 
%matplotlib inline

%load_ext autoreload
%autoreload 2

# python files
import data_prep
import helper_funcs

In [2]:
train = pd.read_csv('../datasets/train_updated.csv')
test = pd.read_csv('../datasets/test_updated.csv')

In [3]:
# Split data
to_drop = ['Country', 'HDI', 'Life_exp']

X_train = train.drop(to_drop, axis='columns')
X_test = test.drop(to_drop, axis='columns')

y_train = train['Life_exp']
y_test = test['Life_exp']

In [4]:
pipe = data_prep.create_pipeline(Lasso())
pipe

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numeric',
                                                  Pipeline(steps=[('identity',
                                                                   FunctionTransformer())]),
                                                  ['GDP_cap']),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['Status'])])),
                ('imputation', KNNImputer()), ('ss', StandardScaler()),
                ('model', Lasso())])

In [5]:
param_grid = {
    'imputation__n_neighbors': np.arange(3, 21, 2), 
    'imputation__weights': ['uniform', 'distance'],
    'model__alpha': np.linspace(0.01, 3, 15)  # sklearn calls it alpha instead of lambda
}

best_estimator, best_params = data_prep.exhaustive_search(X_train, y_train, pipe, param_grid, cv=10, scoring='neg_mean_squared_error')
final_model = best_estimator.fit(X_train, y_train)
print(f"Best parameters: {best_params}")

Best parameters: {'imputation__n_neighbors': 17, 'imputation__weights': 'distance', 'model__alpha': 0.01}


### Evaluate model on the test set

In [6]:
r2, mse, rmse, mae = helper_funcs.display_regression_metrics(y_test, final_model.predict(X_test))

R^2 = 0.9354627953926437
Mean Squared Error = 5.228546530170564
Root Mean Squared Error = 2.286601524133701
Mean Absolute Error = 1.7660982540681942


### Save piepline for future use

In [7]:
joblib.dump(final_model, './saved_models/Lasso Regression.joblib')

['./saved_models/Lasso Regression.joblib']

### Make a prediction

- Year 2050
- Infant Mortality of 6.99
- 32.56% of GDP is spent on health in 2050
- GDP per capita is 18,898
- Employment to population ration (age 15+) is 30.08%
- Developing Country
- Average years of schooling is 9.66
- 85% of the population has access to electricity

In [9]:
saved_pipeline = joblib.load('./saved_models/Lasso Regression.joblib')
input_data = [2050, 6.99, 32.56, 18898, 30.08, 'Developing', 9.66, 85]

print(f"Predicted life expectancy (using lasso regression) = {helper_funcs.make_prediction(input_data, saved_pipeline, X_test)}")

Predicted life expectancy (using lasso regression) = 83.4312345456778
