
## Ridge regression
Linear Regression + penalty on large coefficients (L2 norm) → gives a more stable, less overfit model.

### ✅ Use Ridge Regression when:

1. **You have multicollinearity** (features are correlated).
2. **Your model is overfitting** (good training accuracy, poor test accuracy).
3. **You want to keep all features** but just reduce their influence (shrink, not remove).
4. **You expect coefficients to be small but not exactly zero** (like in polynomial regression or high-dimensional data).

# small example

### 📊 Data

Suppose we have just 3 points:

| x | y |
| - | - |
| 1 | 2 |
| 2 | 4 |
| 3 | 6 |

This is a **perfect line**: $y = 2x$.

---

### 🔹 Plain Linear Regression

* Model can fit exactly:

  $$
  y = 2x
  $$
* **Training cost = 0** (no error).
* But if we add a noisy point later (say $x=4, y=9$ instead of 8), the model will struggle because it only memorized the perfect line.

---

### 🔹 With Ridge Regression

* Ridge says: “Fit the line, but don’t let coefficients get too extreme.”
* So instead of exactly $m = 2, b = 0$, Ridge might give something like:

  $$
  y = 1.9x + 0.1
  $$
* **Training cost > 0** (a little error).
* But on noisy/unseen data, it performs **better** because weights are smaller and more stable.

---

✅ **Takeaway:**

* **Plain regression**: memorizes perfectly → cost = 0, risk of overfit.
* **Ridge regression**: allows tiny training error but improves **generalization**.

## Lasso regression 
Linear Regression + L1 penalty → makes the model simpler by shrinking some feature weights to zero.


✅ Use Lasso Regression when:

1. You have **many features** but only a few are truly important.
2. You want the model to **automatically remove irrelevant features** (by setting their weights to zero).
3. You care about a **simpler, more interpretable model**.

## In short:
**Use Lasso when you want prediction + feature selection at the same time.**


## 📊 Example Data

Suppose we want to predict a score (`y`) using 3 features:

| x1 | x2 | x3 | y  |
| -- | -- | -- | -- |
| 1  | 2  | 0  | 10 |
| 2  | 4  | 0  | 20 |
| 3  | 6  | 0  | 30 |
| 4  | 8  | 0  | 40 |

Here:

* `x1` and `x2` are useful (they explain y).
* `x3` is **useless** (always 0).

---

### 🔹 Plain Linear Regression

It will try to give **some weight** to every feature, even the useless one:

* $y \approx 0.1x1 + 4.9x2 + 0.05x3$
* Notice `x3` got a tiny weight (not exactly zero).

---

### 🔹 Lasso Regression

Because of the L1 penalty, Lasso will **shrink useless features to zero**:

* $y \approx 0.2x1 + 4.8x2 + 0x3$
* 👉 `x3` dropped out completely.

---

### ✅ Takeaway

* **Plain regression**: keeps all features, even useless ones.
* **Lasso regression**: keeps only important features, sets useless ones to **0**.

## 🔹 What is Elastic Net Regression?

Elastic Net = **Linear Regression with both Ridge (L2) and Lasso (L1) penalties**.

$$
J(w) = \text{MSE} + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2
$$

* **Lasso part (L1):** can set useless features to 0 (feature selection).
* **Ridge part (L2):** shrinks large weights to keep them stable when features are correlated.

👉 It combines the strengths of both.

---

## 🔹 When to Use Elastic Net

✅ Use **Elastic Net** when:

1. You have **many features**, some irrelevant, some correlated.
2. You want **feature selection** (like Lasso) but also **stability with correlated features** (like Ridge).
3. You’re not sure whether Ridge or Lasso alone is best → Elastic Net is a safer middle ground.

---

## 🔹 Simple Example

### Data

| x1 | x2 | x3 | y  |
| -- | -- | -- | -- |
| 1  | 2  | 0  | 10 |
| 2  | 4  | 0  | 20 |
| 3  | 6  | 0  | 30 |
| 4  | 8  | 0  | 40 |

* `x1` and `x2` are correlated (x2 = 2 × x1).
* `x3` is useless (always 0).

---

### 🔹 Plain Linear Regression

* Might give unstable weights, e.g.:
  $y = 0 \cdot x1 + 5 \cdot x2 + 0.1 \cdot x3$

---

### 🔹 Lasso

* Drops `x3` completely, but may randomly choose between `x1` or `x2` (since they’re correlated).
  $y = 0 \cdot x1 + 5 \cdot x2 + 0 \cdot x3$

---

### 🔹 Ridge

* Keeps both `x1` and `x2`, shrinks them evenly, but doesn’t drop `x3`.
  $y = 2.4 \cdot x1 + 2.4 \cdot x2 + 0.05 \cdot x3$

---

### 🔹 Elastic Net

* Drops the useless feature (`x3` like Lasso).
* Shares weights more fairly between correlated features (`x1`, `x2` like Ridge).
  $y = 2.3 \cdot x1 + 2.3 \cdot x2 + 0 \cdot x3$

---

✅ **Takeaway:**

* Lasso → feature selection but unstable with correlated features.
* Ridge → stable with correlated features but keeps everything.
* **Elastic Net → best of both: drops useless features, keeps stability with correlated ones.**

# Ridge() Ridgecv() Lasso() Lassocv()

### 1. **Ridge()**

* Think of it as **regular linear regression** but with a tiny rule that stops the model from giving too much importance to any single feature.
* You **set a strength (`alpha`)** yourself.
* Helps prevent the model from **memorizing the data** (overfitting).

---

### 2. **Lasso()**

* Like Ridge, but **stronger**: it can actually **ignore unimportant features** by setting their weight to zero.
* You **choose the strength (`alpha`)**.
* Useful if you want the model to focus on the **most important features only**.

---

### 3. **RidgeCV()**

* Same as Ridge, but smarter: it **automatically finds the best alpha** for you.
* It does this by testing different alphas using **cross-validation**.
* Saves you from guessing the right alpha.

---

### 4. **LassoCV()**

* Same as Lasso, but **automatically finds the best alpha** using cross-validation.
* Shrinks unimportant features to zero **and** chooses the best strength for the penalty.

---

✅ **In short:**

| Method    | Manual or Auto alpha? | Feature Selection? | What it does                                                      |
| --------- | --------------------- | ------------------ | ----------------------------------------------------------------- |
| Ridge()   | Manual                | No                 | Shrinks coefficients slightly to prevent overfitting              |
| Lasso()   | Manual                | Yes                | Shrinks some coefficients to zero, keeps only important features  |
| RidgeCV() | Automatic             | No                 | Shrinks coefficients + chooses best alpha                         |
| LassoCV() | Automatic             | Yes                | Shrinks coefficients, drops unimportant ones + chooses best alpha |