# Lasso Regression
* Lasso Regression shares the same assumptions as OLS, but Lasso Regression is particularly useful for feature selection and dealing with high-dimensional data
    * Linearity: There should be a linear relationship between the features and the target
    * Independence of Errors: Errors (residuals) should be independent of each other. Patterns in a residual plot may suggest a lack of independence
    * Homoscedasticity: Variance of errors should be constant. An example of heteroscedsticity (bad) is if you have a conal shape in your residual plot
    * Normality of Errors: The errors should be normally distributed. Check the Q-Q plot - if the residuals are normally distributed, then the points should fall approximately along a straight line
    * No perfect collinearity: Features are not perfectly correlated (perfect collinearity)

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error

In [2]:
X, y = make_regression(n_samples=10000, n_features = 500, noise = 10, random_state = 42)
feature_names = ["feature " + str(i) for i in range(1, len(X[0,:])+1)]
X = pd.DataFrame(dict(zip(feature_names, np.transpose(X))))

In [3]:
# We will split the data here
# If you have access to a separate test set, treat this as validation to make sure your data satisfies the assumptions
# of linear regression. Retrain on the entire dataset and then make predictions on the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [4]:
# Use scaled data for regularization so that all features are weighed equally

# Import Standard Scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data
scaler.fit(X_train)

# Transform X_train and X_test
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [5]:
# Lasso Regression on the scaled data
model = LassoCV(cv = 5, random_state = 42)
model.fit(X_train_scaled, y_train)
print(f"Best Lasso alpha: {model.alpha_}")
print(f"Training R-Squared: {model.score(X_train_scaled, y_train):.4f}")
print(f"Testing R-Squared: {model.score(X_test_scaled, y_test):.4f}")
print(f"Mean Squared Error: {mean_squared_error(y_test, model.predict(X_test_scaled)):.4f}")

Best Lasso alpha: 0.2301807986514151
Training R-Squared: 0.9980
Testing R-Squared: 0.9979
Mean Squared Error: 99.6313


In [6]:
# We can check which the coefficients for each feature and see which ones didn't dropped to 0
feature_coef = dict(zip(X.columns, model.coef_))
for feature in feature_coef.keys():
    if feature_coef[feature] != 0:
        print(f"{feature}: {feature_coef[feature]}")

feature 3: 0.22678032001195544
feature 11: 83.55604152529011
feature 13: 0.004163867856155918
feature 32: -0.06700188386600144
feature 57: -0.07283213918619315
feature 88: 0.04528011865349032
feature 110: 0.0440845295045865
feature 133: 93.14766564327005
feature 150: 0.025824777935323475
feature 163: 0.02778617063519812
feature 179: 62.569708169085885
feature 209: 0.03621703231099764
feature 240: 88.19968581895158
feature 247: 40.601506314871855
feature 299: -0.05123638207301383
feature 304: -0.07041683689634096
feature 305: 96.40956827729042
feature 338: 0.005680273015467948
feature 364: 33.326729197954286
feature 368: -0.03981530959174944
feature 376: 83.15722980251165
feature 413: -0.0007360393469938345
feature 435: 11.284603410570345
feature 458: -0.04975307444701059
feature 470: 63.407261940983695
feature 493: 0.040553106765340935
