We might overfit the model. That’s where **Regularization** helps.

---

### Regularization Penalty

We add a penalty to the loss function to discourage overly large coefficients.  
Two common techniques are:

---

#### 1. Ridge Regression (L2 Regularization)

It has a penalty term added to the cost function. The penalty term consists of the sum of squared coefficients:

$$
\lambda \sum_{j=1}^p \beta_j^2
$$

where $\lambda$ is the **regularization parameter**.

The cost function becomes:

$$
J(\beta_0, \beta_1) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}^i \right)^2 + \lambda \sum_{j=1}^p \beta_j^2
$$

---

#### 2. Lasso Regression (L1 Regularization)

It has a penalty term added to the cost function. The penalty term consists of the sum of the absolute values of the coefficients:

$$
\lambda \sum_{j=1}^p |\beta_j|
$$

where $\lambda$ is the **regularization parameter**.

The cost function becomes:

$$
J(\beta_0, \beta_1) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}^i \right)^2 + \lambda \sum_{j=1}^p |\beta_j|
$$

---

### Choosing the Regularization Parameter

- **Small $\lambda$**:  
  The penalty is small, resulting in a model similar to **Ordinary Least Squares**.  
  The coefficients can grow large, which may lead to overfitting.  
  *Example Range:* $[10^{-6}, \, 10^{-3}]$

- **Large $\lambda$**:  
  The penalty is high, shrinking the coefficients more aggressively.  
  This reduces overfitting but may **underfit** the data.  
  *Example Range:* $[10, \, 1000]$

- **Optimal $\lambda$**:  
  Balances the trade-off between **bias** and **variance**, leading to a model that generalizes well.  
  *Example Range:* $[10^{-2}, \, 10]$
