<a href="https://colab.research.google.com/github/glorivaas/Machine_Learning25/blob/main/Lab7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 7 - Gradient Boosting

### Author: Gloria Rivas

1. **Derivation and Analysis**

  **Scenario A:**
  - Derive explicitly the optimal $\lambda$ for fitting from scratch, i.e., solve
    
    $$
    \lambda^* = \arg\min_{\lambda} \sum_{i=1}^{n} L(y_i, \lambda)
    $$
  <br>Where the loss is:
  $$
L(y_i, \lambda) = -y_i \log(\sigma(\lambda)) - (1 - y_i) \log(1 - \sigma(\lambda))
$$

with the sigmoid function:

$$
\sigma(\lambda) = \frac{1}{1 + e^{-\lambda}}
$$

Let:
- $ m $: number of samples where $ y_i = 1 $
- $ k $: number of samples where $ y_i = 0 $
- $ n = m + k $

Then the total loss becomes:

$$
\mathcal{L}(\lambda) = -m \log(\sigma(\lambda)) - k \log(1 - \sigma(\lambda))
$$

Take the derivative with respect to $ \lambda $:

$$
\frac{d\mathcal{L}}{d\lambda} = \sigma(\lambda)(1 - \sigma(\lambda)) \left( -\frac{m}{\sigma(\lambda)} + \frac{k}{1 - \sigma(\lambda)} \right)
$$

Simplify:

$$
= -m (1 - \sigma(\lambda)) + k \sigma(\lambda)
$$

Set derivative to zero:

$$
-m + m \sigma(\lambda) + k \sigma(\lambda) = 0
\Rightarrow \sigma(\lambda)(m + k) = m
\Rightarrow \sigma(\lambda) = \frac{m}{m + k} = \frac{m}{n}
$$

Solve for $ \lambda $:

$$
\sigma(\lambda) = \frac{1}{1 + e^{-\lambda}} = \frac{m}{n}
\Rightarrow \lambda^* = \log\left( \frac{m}{k} \right)
$$

---
Final Result

$$
\boxed{\lambda^* = \log\left( \frac{m}{k} \right)}
$$

---
**Interpretation**

- The optimal constant $$ \lambda^* $$ reflects the log-odds of the class distribution.
- If $ m = k $, then $ \lambda^* = 0 $ (neutral prediction).
- If $ m > k $, then $ \lambda^* > 0 $ : model favors class 1.
- If $ m < k $, then $ \lambda^* < 0 $ : model favors class 0.

### Scenario B

Now assume that we already have predictions $ f_i = f_{m-1}(x_i) $, and we want to add a constant shift $ \lambda $:

$$
\lambda^* = \arg\min_{\lambda} \sum_{i=1}^n L(y_i, f_i + \lambda)
$$

Where:

$$
L(y_i, f_i + \lambda) = -y_i \log(\sigma(f_i + \lambda)) - (1 - y_i) \log(1 - \sigma(f_i + \lambda))
$$

Let $ s_i = \sigma(f_i + \lambda) $, then:

$$
\frac{d\mathcal{L}}{d\lambda} = \sum_{i=1}^n \left( s_i - y_i \right)
= \sum_{i=1}^n \left( \sigma(f_i + \lambda) - y_i \right)
$$

Set derivative to zero:

$$
\boxed{
\sum_{i=1}^{n} \left[ \sigma(f_i + \lambda) - y_i \right] = 0
}
$$

---

- In Scenario A, all predictions are constant. The optimization is simple and convex.
- In Scenario B:
  - Each $ f_i $ is different.
  - The sigmoid is applied to $ f_i + \lambda $, making the loss **nonlinear** and **non-separable**.
  - The result is a **complex loss landscape** with **no closed-form** for $ \lambda $.
  - Must be solved **numerically** (e.g., gradient descent).

The difficulty arises from the interaction between the **non-linearity of the sigmoid function** and the **variability in the previous predictions**.