# Lasso Regression

Lasso Regression, which stands for Least Absolute Shrinkage and Selection Operator, is a type of linear regression that uses shrinkage. Shrinkage here means that the data values are shrunk towards a central point, like the mean. The lasso technique encourages simple, sparse models (i.e., models with fewer parameters). This particular type of regression is well-suited for models showing high levels of multicollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.

### Key Features of Lasso Regression:

1. **Regularization Term**: The key characteristic of Lasso Regression is that it adds an L1 penalty to the regression model, which is the absolute value of the magnitude of the coefficients. The cost function for Lasso regression is:

   $$ \text{Minimize } \sum_{i=1}^{n} (y_i - \sum_{j=1}^{p} x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j| $$

   where $ \lambda $ is the regularization parameter.

2. **Feature Selection**: One of the advantages of lasso regression over ridge regression is that it can result in sparse models with few coefficients; some coefficients can become exactly zero and be eliminated from the model. This property is called automatic feature selection and is a form of embedded method.

3. **Parameter Tuning**: The strength of the L1 penalty is determined by a parameter, typically denoted as alpha or lambda. Selecting a good value for this parameter is crucial and is typically done using cross-validation.

4. **Bias-Variance Tradeoff**: Similar to ridge regression, lasso also manages the bias-variance tradeoff in model training. Increasing the regularization strength increases bias but decreases variance, potentially leading to better generalization on unseen data.

5. **Scaling**: Before applying lasso, it is recommended to scale/normalize the data as lasso is sensitive to the scale of input features.

### Implementation in Scikit-Learn:

Lasso regression can be implemented using the `Lasso` class from Scikit-Learn's `linear_model` module. Here's a basic example:

In [1]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate some regression data
X, y = make_regression(n_samples=1000, n_features=15, noise=0.1, random_state=42)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Lasso regression object
lasso = Lasso(alpha=1.0)
ridge = Ridge(alpha=1.0)

# Fit the model
lasso.fit(X_train, y_train)
ridge.fit(X_train, y_train)

# Make predictions
y_pred_lasso = lasso.predict(X_test)
y_pred_ridge = ridge.predict(X_test)
# Evaluate the model
print("MSE of Lasso:", mean_squared_error(y_test, y_pred_lasso))
print("MSE of Ridge:", mean_squared_error(y_test, y_pred_ridge))

MSE of Lasso: 9.387744740461226
MSE of Ridge: 0.05090866185225964


In this example, `alpha` is the parameter that controls the amount of L1 regularization applied to the model. Fine-tuning `alpha` through techniques like cross-validation is a common practice to find the best model.

In [2]:
# Fine tune alpha value using cv
from sklearn.model_selection import GridSearchCV
import numpy as np

# Create a Lasso regression object
lasso = Lasso()

# Create a dictionary for the grid search key and values
param_grid = {'alpha': np.arange(1, 10, 0.1)}

# Use grid search to find the best value for alpha
lasso_cv = GridSearchCV(lasso, param_grid, cv=10)

# Fit the model
lasso_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Lasso Regression Parameters: {}".format(lasso_cv.best_params_))
print("Best score is {}".format(lasso_cv.best_score_))

# Create a Ridge regression object
ridge = Ridge()

# Create a dictionary for the grid search key and values
param_grid = {'alpha': np.arange(1, 10, 0.1)}

# Use grid search to find the best value for alpha
ridge_cv = GridSearchCV(ridge, param_grid, cv=10)

# Fit the model
ridge_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Ridge Regression Parameters: {}".format(ridge_cv.best_params_))
print("Best score is {}".format(ridge_cv.best_score_))

Tuned Lasso Regression Parameters: {'alpha': 1.0}
Best score is 0.9995685234915115
Tuned Ridge Regression Parameters: {'alpha': 1.0}
Best score is 0.9999981195099323


# Assignment Alert: Find out what is L1 or L2 regularization?

L1 and L2 regularization are techniques used in machine learning to prevent overfitting and improve the generalization of models. They are regularization methods that add a penalty term to the objective function during the training process.

### L1 Regularization (Lasso):

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), adds the absolute values of the coefficients as a penalty to the objective function. The L1 regularization term is proportional to the sum of the absolute values of the coefficients:

\[ \text{L1 regularization term} = \lambda \sum_{i=1}^{n} |w_i| \]

- \( \lambda \): Regularization strength, a hyperparameter that controls the degree of regularization.
- \( w_i \): Coefficients of the model.

L1 regularization has a sparsity-inducing property, meaning it tends to force some of the coefficients to exactly zero. This makes it useful for feature selection, as it can lead to sparse models where only a subset of features contributes significantly.

### L2 Regularization (Ridge):

L2 regularization, also known as Ridge regularization, adds the squared values of the coefficients as a penalty to the objective function. The L2 regularization term is proportional to the sum of the squared values of the coefficients:

\[ \text{L2 regularization term} = \lambda \sum_{i=1}^{n} w_i^2 \]

- \( \lambda \): Regularization strength, a hyperparameter that controls the degree of regularization.
- \( w_i \): Coefficients of the model.

L2 regularization does not enforce sparsity in the coefficients. Instead, it tends to shrink the coefficients toward zero, effectively reducing their impact on the model. L2 regularization is particularly effective when dealing with multicollinearity (highly correlated features).

### Key Differences:

1. **Effect on Coefficients:**
   - L1 regularization tends to produce sparse models with some coefficients exactly equal to zero.
   - L2 regularization tends to shrink coefficients toward zero but does not force them to be exactly zero.

2. **Feature Selection:**
   - L1 regularization is often used for feature selection because of its sparsity-inducing property.
   - L2 regularization may not lead to exact feature selection but is effective in handling multicollinearity.

3. **Objective Function:**
   - L1 regularization adds the absolute values of coefficients to the objective function.
   - L2 regularization adds the squared values of coefficients to the objective function.

4. **Geometric Interpretation:**
   - L1 regularization corresponds to a diamond-shaped penalty region in the coefficient space.
   - L2 regularization corresponds to a circular-shaped penalty region.

In practice, a combination of L1 and L2 regularization is sometimes used, leading to what is known as Elastic Net regularization. The choice between L1 and L2 regularization often depends on the specific characteristics of the data and the goals of the modeling task. Cross-validation is commonly used to find the optimal values for the regularization hyperparameters.

---