# L1 and L2 Regularization

## What is Regularization?

Regularization may be defined as any modification or change in the learning algorithm that helps reduce its error over a test dataset, commonly known as generalization error but not on the supplied or training dataset.

In learning algorithms, there are many variants of regularization techniques, each of which tries to cater to different challenges. These can be listed down straightforwardly based on the kind of challenge the technique is trying to deal with:

1. Some try to put extra constraints on the learning of an ML model, like adding restrictions on the range/type of parameter values.
2. Some add more terms in the objective or cost function, like a soft constraint on the parameter values. More often than not, a careful selection of the right constraints and penalties in the cost function contributes to a massive boost in the model's performance, specifically on the test dataset.
3. These extra terms can also be encoded based on some prior information that closely relates to the dataset or the problem statement.
4. One of the most commonly used regularization techniques is creating ensemble models, which take into account the collective decision of multiple models, each trained with different samples of data.

The main aim of regularization is to reduce the over-complexity of the machine learning models and help the model learn a simpler function to promote generalization.

### L1 Regularisation

L1 regularization, also known as **Lasso Regularization** (Least Absolute Shrinkage and Selection Operator), is a technique used in machine learning to prevent overfitting and improve model generalization by adding a penalty term to the loss function. The penalty is proportional to the **absolute values** of the model's weights.

#### Objective Function with L1 Regularization
The modified loss function with L1 regularization is:

$$
\mathcal{L} = \text{Loss}(\hat{y}, y) + \lambda \sum_{i=1}^{n} |w_i|
$$

Where:
- $ \text{Loss}(\hat{y}, y) $: The original loss function (e.g., mean squared error or cross-entropy).
- $ \lambda $: The regularization strength (hyperparameter).
- $ w_i $: Model parameters (weights).

#### Key Properties of L1 Regularization
1. **Feature Selection**:
   - L1 regularization drives some weights $ w_i $ to exactly **zero**.
   - This leads to sparse models, where only the most important features are retained, effectively performing feature selection.

2. **Overfitting Prevention**:
   - By penalizing large weights, the model is less likely to overfit the training data.

3. **Sparsity**:
   - The L1 norm encourages sparsity (many weights being zero), which makes the model interpretable and efficient.

4. **Hyperparameter Tuning**:
   - The value of $ \lambda $ controls the trade-off between fitting the data and applying the regularization penalty. A larger $ \lambda $ increases regularization strength, leading to a sparser model.

#### Applications
- Feature selection in high-dimensional datasets.
- Regression problems (e.g., Lasso regression).
- Models requiring interpretability, where sparsity is valuable.

### L2 Regularisation

L2 regularization, also known as **Ridge Regularization**, is a technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function. The penalty is proportional to the **squared magnitude** of the model's weights.

---

#### **Objective Function with L2 Regularization**
The modified loss function with L2 regularization is:

$$
\mathcal{L} = \text{Loss}(\hat{y}, y) + \lambda \sum_{i=1}^{n} w_i^2
$$

Where:
- $ \text{Loss}(\hat{y}, y) $: The original loss function (e.g., mean squared error or cross-entropy).
- $ \lambda $: The regularization strength (hyperparameter).
- $ w_i $: Model parameters (weights).

---

#### **Key Properties of L2 Regularization**
1. **Weight Shrinkage**:
   - L2 regularization penalizes large weights by shrinking them toward zero, though they never become exactly zero.

2. **Overfitting Prevention**:
   - By discouraging large coefficients, the model generalizes better to unseen data.

3. **Smoothness**:
   - Unlike L1, which creates sparse models, L2 regularization ensures all features are included but with reduced impact.

4. **Mathematical Stability**:
   - L2 regularization is particularly useful when features are highly correlated or the dataset is ill-conditioned.

5. **Hyperparameter Tuning**:
   - The value of $ \lambda $ controls the trade-off between fitting the data and regularization. A larger $ \lambda $ increases regularization strength, making the model simpler but potentially underfitting the data.

---

#### **Applications**
- Stabilizing regression problems with multicollinearity (e.g., Ridge Regression).
- Improving generalization for complex models with numerous parameters.
- Controlling overfitting in neural networks.


In order to understand why we need any sort of regularization, let’s go through a quick example.

Let’s say we wanted to predict people’s height with a dataset that included several predictors such as: weight, salary, ethnicity and eyesight. Our linear regression equation would like the one below.

$$
\mathcal{height} = \beta_0 + \beta_1 \cdot \mathcal{weight} + \beta_2 \cdot \mathcal{salary} + \beta_3 \cdot \mathcal{ethnicity} + \beta_4 \cdot \mathcal{eyesight}
$$

While WE KNOW some of these predictors are incorrect and would not provide any useful insight into someone’s height, linear regression would force a way in doing so by minimising the loss function, in this case, let’s say RSS (Residual Sum of Squares). This leads to overfitting, which essentially means that even the noise within the dataset is being modelled.

Regularization combats this by adding a penalty term that can help disregard or weaken the coefficients of any irrelevant predictor like eyesight and salary from the model.

<div style="display:flex;justify-content:center;">
<img src="images/overfitting.png" width=800 style="background:white;" />
</div>