# **Gradient Boosting classifier using 3 weak learners**

We want to classify fruits as **Apple (0)** or **Orange (1)** based on **size**.

**Training data:**

| Size | Label      |
| ---- | ---------- |
| 1    | Apple (0)  |
| 2    | Apple (0)  |
| 3    | Orange (1) |

---

### Step 1: First weak learner (Tree #1)

* Predicts the **majority class** → Apple (0) for everything.

| Size | True Label | Pred (Tree #1) | Error     |
| ---- | ---------- | -------------- | --------- |
| 1    | 0          | 0              | ✅ correct |
| 2    | 0          | 0              | ✅ correct |
| 3    | 1          | 0              | ❌ mistake |

So, only Size=3 is misclassified.

---

### Step 2: Second weak learner (Tree #2)

This tree focuses on the **residual errors** (the misclassified points).

* Learns: “If Size=3 → predict Orange (1).”

Now combine Tree #1 and Tree #2 (weighted sum of their votes, then apply sigmoid/softmax):

* Size=1 → Still Apple ✅
* Size=2 → Still Apple ✅
* Size=3 → Corrected to Orange ✅

---

### Step 3: Third weak learner (Tree #3)

Since predictions are already perfect, this tree doesn’t change anything (errors are all zero).

---

✅ **Final Model = Tree #1 + Tree #2**

* Size=1 → Apple
* Size=2 → Apple
* Size=3 → Orange

---

🔑 **Difference vs Regression case:**

* In regression, weak learners predict **numerical residuals**.
* In classification, weak learners predict **which class was misclassified**, and updates are done in terms of probabilities (log-odds).

# 🌱 Gradient Boosting Classifier – Key Terms

### 1. **Weak Learner**

* A very simple model, usually a small decision tree (called a *stump*).
* On its own, it’s “weak” (not very accurate).

---

### 2. **Ensemble**

* The final Gradient Boosting model is not just one tree.
* It’s a **team of many weak trees**, each one fixing mistakes of the previous.

---

### 3. **Loss Function**

* A way to measure how wrong the model is.
* For classification, usually **log-loss** (penalizes wrong class predictions).

---

### 4. **Residuals (Errors)**

* After making predictions, we check where the model went wrong.
* These mistakes are turned into numbers (gradients) that the next tree tries to fix.

---

### 5. **Gradient**

* Tells us the **direction to move** to reduce errors.
* Each new tree learns from this direction.

---

### 6. **Learning Rate**

* A small step size that controls how much each new tree contributes.
* Small learning rate = slower learning, but usually better accuracy.

---

### 7. **Number of Estimators**

* The number of trees we add to the model.
* More trees = better fit, but too many = risk of overfitting.

---

### 8. **Tree Depth**

* How deep each tree is (how many splits it can make).
* Shallow trees (like depth=1 or 2) are common, because boosting works best with simple learners.

---

### 9. **Subsampling**

* Instead of using all the data for each tree, we randomly take part of it.
* Helps prevent overfitting and adds variety.

---

### 10. **Regularization**

* Tricks to stop the model from memorizing the data (overfitting).
* Examples:

  * Keep trees small.
  * Use fewer features per tree.
  * Use a smaller learning rate.

---

### 11. **Additive Model**

* Gradient boosting adds trees **one by one**, each improving the previous result:

  $$
  \text{New model} = \text{Old model} + \text{Small correction}
  $$