# **Regularization**

**Regularization** is a technique used to improve the performance and generalization of machine learning models by adding a penalty term to the cost function. This helps prevent **overfitting**, where the model performs well on training data but poorly on unseen data.

In regression, regularization modifies the cost function by adding a term proportional to the magnitude of the model coefficients. The two most common types of regularization are **Lasso (L1)** and **Ridge (L2)**.

---

## **Why Use Regularization?**

- **Overfitting Prevention:** Reduces the complexity of the model by shrinking the coefficients, making it less likely to overfit the data.
- **Feature Selection:** In some regularization techniques (e.g., Lasso), certain coefficients are reduced to zero, effectively selecting only the most relevant features.
- **Multicollinearity Handling:** Helps stabilize the model when features are highly correlated.

---

## **Regularized Cost Functions**

In linear regression, the standard cost function is:
$$ J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 $$  
where $\hat{y}_i = \mathbf{w} \cdot \mathbf{x}_i + b$.

Regularization adds a penalty term to this cost function:

### 1. **Ridge Regression (L2 Regularization)**

**Penalty Term:** Sum of squared coefficients  
$$ J_{\text{ridge}}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 + \lambda \sum_{j=1}^p w_j^2 $$

- $\lambda$: Regularization parameter (controls the strength of regularization).
- $w_j$: Coefficients of the model.

**Effect of Ridge:**
- Shrinks coefficients towards zero but does not make them exactly zero.
- Works well when all features are relevant.
- Helps handle multicollinearity by stabilizing coefficient estimates.

---

### 2. **Lasso Regression (L1 Regularization)**

**Penalty Term:** Sum of absolute values of coefficients  
$$ J_{\text{lasso}}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 + \lambda \sum_{j=1}^p |w_j| $$

- $\lambda$: Regularization parameter (controls the strength of regularization).
- $w_j$: Coefficients of the model.

**Effect of Lasso:**
- Shrinks some coefficients to exactly zero, effectively performing feature selection.
- Useful when only a subset of features is important.

---

## **Key Differences Between Ridge and Lasso**

| Aspect               | Ridge (L2)                     | Lasso (L1)                    |
|----------------------|--------------------------------|-------------------------------|
| **Penalty Term**     | $ \sum w_j^2 $                | $ \sum w_j $                |
| **Coefficient Effect** | Shrinks coefficients towards zero | Can shrink coefficients to exactly zero |
| **Feature Selection**| No                            | Yes                           |
| **Use Case**         | When all features are relevant | When feature selection is needed |

---

## **Choosing $\lambda$ (Regularization Strength)**

- A higher $\lambda$ leads to stronger regularization, shrinking coefficients more aggressively.
- A lower $\lambda$ results in weaker regularization, making the model closer to standard linear regression.
- $\lambda$ is typically chosen using techniques like **cross-validation**.

---

## **Regularization and Bias-Variance Tradeoff**

- Regularization increases bias but reduces variance.
- By adding a penalty to large coefficients, the model becomes simpler and less sensitive to noise in the training data.

---

## **Combined Regularization: Elastic Net**

Elastic Net combines Ridge (L2) and Lasso (L1):
$$ J_{\text{elastic}}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 + \alpha \lambda \sum_{j=1}^p |w_j| + (1-\alpha) \lambda \sum_{j=1}^p w_j^2 $$

- $\alpha$: Balances the contribution of L1 and L2 penalties.
- Useful when dealing with correlated features and when some feature selection is desired.

---

## **Summary**

- **Ridge (L2):** Penalizes the square of coefficients, reduces their magnitude, and helps when all features are important.
- **Lasso (L1):** Penalizes the absolute value of coefficients, can set some coefficients to zero, and performs feature selection.
- **Elastic Net:** Combines the strengths of Ridge and Lasso for flexibility in handling different types of data.

Regularization ensures the model is robust, interpretable, and capable of generalizing to new data.
