# Ridge Regression (L2 Regularization)



* We derived the **equation of the best fit line**:
  $$
  h_\theta(x) = \theta_0 + \theta_1 x
  $$
  where $x$ is a single independent feature.

* For multiple features, the equation becomes:
  $$
  h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n
  $$

* We also saw the **cost function** for linear regression, which is the **Mean Squared Error (MSE):**
  $$
  J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \Big(h_\theta(x^{(i)}) - y^{(i)} \Big)^2
  $$

Our objective is to minimize this cost function using **Gradient Descent** to reach the **global minimum**.

---

## Problem of Overfitting in Linear Regression

Consider a dataset with very few points. If we apply linear regression, the model can perfectly fit those points, making the error **zero** on the training data.

* **Training Accuracy:** Very high (close to $100%$)
* **Training Error:** Very low (close to $0$)
* **Test Error:** High (poor performance on unseen data)

This is the classic case of **overfitting**.

* **Bias (on training data):** Low
* **Variance (on test data):** High

---

## Ridge Regression (L2 Regularization)

To address **overfitting**, we introduce **Ridge Regression**.

* Ridge Regression is also called **L2 Regularization**.
* It modifies the cost function by adding a **penalty term** proportional to the square of the coefficients.

### Ridge Regression Cost Function

The new cost function is:

$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \Big(h_\theta(x^{(i)}) - y^{(i)} \Big)^2 + \lambda \sum_{j=1}^{n} \theta_j^2
$$

where:

* $\lambda \geq 0$ is the **regularization parameter (hyperparameter)**.
* $\sum_{j=1}^{n} \theta_j^2$ is the penalty term (squared coefficients).
* Note: $\theta_0$ (bias term) is **not penalized**.

---

## Intuition Behind Ridge Regression

1. **Without Regularization ($\lambda = 0$):**

   * Cost function reduces to standard linear regression.
   * High risk of overfitting.

2. **With Regularization ($\lambda > 0$):**

   * Large coefficients ($\theta_j$) are penalized.
   * Model is forced to keep coefficients small.
   * Prevents the model from fitting noise in the data.
   * Reduces **variance** (better generalization).

---

## Relationship Between $\lambda$ and $\theta$ (Slope)

* As $\lambda$ increases:

  * Coefficients ($\theta_j$) decrease.
  * Model becomes simpler.
  * Overfitting is reduced.

* But if $\lambda$ is too large:

  * Coefficients shrink too much.
  * Model underfits the data.

**Key Relationship:**
$$
\lambda \uparrow \quad \Rightarrow \quad |\theta_j| \downarrow
$$

---

## Example with Multiple Features

Suppose we have a multiple linear regression model:

$$
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3
$$

Assume initial coefficients:

$$
\theta_0 = 0.3, \quad \theta_1 = 0.52, \quad \theta_2 = 0.48, \quad \theta_3 = 0.24
$$

After applying Ridge Regression with some $\lambda$:

$$
\theta_0 = 0.3, \quad \theta_1 = 0.40, \quad \theta_2 = 0.38, \quad \theta_3 = 0.14
$$

Notice:

* All coefficients shrink, but **none become zero**.
* Features with **less impact** (e.g., $x_3$) get reduced further.
* This prevents weakly correlated features from dominating the model.

---

## Important Notes

* Ridge Regression **never eliminates features** (coefficients don’t become exactly zero).
* It only **shrinks them** to reduce their impact.
* This makes Ridge Regression different from **Lasso Regression**, which can set coefficients exactly to zero.

---

## Summary

* **Linear Regression** may overfit, especially with high-dimensional data.
* **Ridge Regression** introduces an $L2$ penalty to reduce overfitting.
* Cost Function:
  $$
  J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \Big(h_\theta(x^{(i)}) - y^{(i)} \Big)^2 + \lambda \sum_{j=1}^{n} \theta_j^2
  $$
* **Effect of $\lambda$:**

  * Small $\lambda \to$ similar to linear regression (risk of overfitting).
  * Large $\lambda \to$ stronger penalty, smaller coefficients (risk of underfitting).

Next, we’ll discuss **Lasso Regression (L1 Regularization)** and compare it with Ridge.


![alt text](image.png)