# Gradient Boosting regressor with Three Weak learners

### Tiny Example (3 weak learners)

**Data:**

* (Size=1 ‚Üí Price=100)
* (Size=2 ‚Üí Price=200)
* (Size=3 ‚Üí Price=300)

---

#### Step 1: First weak learner (Tree #1)

Predicts the **average = 200** for everything.

* Size=1 ‚Üí Predict 200 (error = -100)
* Size=2 ‚Üí Predict 200 (error = 0)
* Size=3 ‚Üí Predict 200 (error = +100)

---

#### Step 2: Second weak learner (Tree #2)

Learns to **predict the errors** from step 1:

* If Size=1 ‚Üí predict -100
* If Size=2 ‚Üí predict 0
* If Size=3 ‚Üí predict +100

Add these corrections to Tree #1:

* Size=1 ‚Üí 200 + (-100) = 100 ‚úÖ
* Size=2 ‚Üí 200 + (0) = 200 ‚úÖ
* Size=3 ‚Üí 200 + (+100) = 300 ‚úÖ

Now predictions are perfect.

---

#### Step 3: Third weak learner (Tree #3)

No errors left (all 0), so Tree #3 does nothing.

---

‚úÖ **Final Model = Tree #1 + Tree #2**
Predictions exactly match the true prices.

# üå± Gradient Boosting Regressor ‚Äì Key Terms

### 1. **Weak Learner**

* A small, simple model (usually a shallow decision tree).
* On its own, it‚Äôs not very accurate.

---

### 2. **Ensemble**

* The final model is a **collection of many weak trees**.
* Each new tree improves on the mistakes of the ones before it.

---

### 3. **Loss Function**

* A way to measure prediction error.
* Common ones for regression:

  * **Mean Squared Error (MSE):** penalizes big errors more.
  * **Mean Absolute Error (MAE):** looks at average distance between prediction and true value.

---

### 4. **Residuals (Errors)**

* The difference between the true value and the model‚Äôs prediction:

  $$
  \text{Residual} = y - \hat{y}
  $$
* Each new tree is trained to predict these residuals (the mistakes).

---

### 5. **Gradient**

* A more general way of saying ‚Äúdirection of error‚Äù (especially when using other loss functions).
* The new tree learns to follow this direction to reduce mistakes.

---

### 6. **Learning Rate**

* A small multiplier that controls how much each new tree affects the model.
* Small learning rate = slower progress, but usually more accurate in the long run.

---

### 7. **Number of Estimators**

* The number of trees added.
* More trees = better fit, but too many = risk of overfitting.

---

### 8. **Tree Depth**

* Controls how complex each weak tree is.
* Shallow trees (depth=1‚Äì3) are common, because they focus on small corrections.

---

### 9. **Subsampling**

* Instead of using all data for every tree, we use only part of it.
* This makes the model more robust and less likely to overfit.

---

### 10. **Regularization**

* Ways to keep the model simpler and prevent overfitting:

  * Limit tree depth.
  * Use fewer features per tree.
  * Use a small learning rate.

---

### 11. **Additive Model**

* Gradient boosting builds the model step by step:

  $$
  \text{New prediction} = \text{Old prediction} + \text{Small correction}
  $$
* Over many steps, predictions get closer to the true values.