# Bias-Variance Decomposition

In machine learning, predictive models aim to minimize their error on unseen data. This error can be broken down into three components: **Bias**, **Variance**, and **Irreducible Error**. The bias-variance decomposition helps us understand how these components contribute to the overall error.

Let's consider a regression problem where we aim to predict a target variable $y$ given features $x$. The true relationship between $X$ and $Y$ can be expressed as:

$$y = f(x) + \varepsilon$$

Where:
- $f(X)$ is the true (but unknown) function that maps features to the target.
- $\epsilon$ is random noise with zero mean and constant variance $\sigma^2$.

$\mathbb{E} \varepsilon = 0,  \mathbb{V}\text{ar} \varepsilon = \mathbb{E} \varepsilon^2 = \sigma^2.$  

The goal of our model is to estimate $f(X)$.

Short Refresher:

$$
\begin{align*}
\mathbb{V}\text{ar} [f(x)] &= \mathbb{E}_x [f(x) - \mathbb{E}_x[f(x)]]^2 \\
    &= \mathbb{E}_x [f^2(x) - 2f(x) \mathbb{E}_x [f(x)] + \mathbb{E}_x [f(x)]^2] \\
    &= \mathbb{E}_x [f^2(x)] - 2  \mathbb{E}_x [f(x) \mathbb{E}_x [f(x)]] +  \mathbb{E}_x[\mathbb{E}_x [f(x)]^2] \\
    &= \mathbb{E}_x [f^2(x)] - 2  \mathbb{E}_x [f(x)] \mathbb{E}_x [f(x)] + \mathbb{E}_x [f(x)]^2 \\
    &= \mathbb{E}_x [f^2(x)] - 2  \mathbb{E}_x [f(x)]^2 + \mathbb{E}_x [f(x)]^2 \\
    &= \mathbb{E}_x [f^2(x)] -   \mathbb{E}_x [f(x)]^2\\
    \\
\text{Bias} [f(x)] &= f(x) - \mathbb{E}_x[f(x)] \\
\end{align*}
$$



Let's assume we have subsample of the dataset $X$:  
$$ X = ((x_1, y_1), \ldots (x_l, y_l))$$ 

And the estimator of $y$ : $a(x)$, trained on this subsample $X$.
$$a(x) = a(x, X)$$
The goal of estimator is to correctly estimate the function $f(x)$, given the subsample $X$. Note that the target value $y$ is the function of $x \text{ and } \varepsilon$:  
$$y(x) = y(x, \varepsilon)$$

The joined expectation $\mathbb{E}_{X,\varepsilon} = \mathbb{E}_{X}\mathbb{E}_{\varepsilon}$ because $X$ and $\varepsilon$ are independent.

# Bias-Variance Decomposition and its Connection with Random Forest

In machine learning, predictive models aim to minimize their error on unseen data. This error can be broken down into three components: **Bias**, **Variance**, and **Irreducible Error**. The bias-variance decomposition helps us understand how these components contribute to the overall error.

Let's consider a regression problem where we aim to predict a target variable $Y$ given features $X$. The true relationship between $X$ and $Y$ can be expressed as:

$$Y = f(X) + \epsilon$$

Where:
- $f(X)$ is the true (but unknown) function that maps features to the target.
- $\epsilon$ is random noise with zero mean and constant variance $\sigma^2$.

The goal of our model is to estimate $f(X)$. In the context of Random Forest, we have an ensemble of decision trees. The bias-variance decomposition for Random Forest can be expressed as:

$$ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$

Where:
- **MSE (Mean Squared Error)** is the expected prediction error.
- **Bias** measures the error due to overly simplistic assumptions in the model. It quantifies how much our model's predictions differ from the true function $f(X)$.
- **Variance** measures the error due to model instability. It quantifies how much the predictions for $Y$ vary as we fit the model to different training datasets.
- **Irreducible Error** represents the noise inherent in the data, which cannot be reduced no matter how complex the model is.

Now, let's derive the expressions for Bias and Variance:

### Bias:
Bias can be defined as the expected difference between our model's predictions and the true function $f(X)$:

$$
\begin{align*}
\text{Bias} &= \mathbb{E}[(\hat{f}(X) - f(X))^2] \\
&= \mathbb{E}[\hat{f}(X)^2] - 2\mathbb{E}[\hat{f}(X)f(X)] + \mathbb{E}[f(X)^2]
\end{align*}
$$

In Random Forest, each tree provides an estimate $\hat{f}_i(X)$. Assuming the trees are uncorrelated, the expected value of their squared predictions is:

$$
\mathbb{E}[\hat{f}(X)^2] = \frac{1}{N}\sum_{i=1}^{N} \mathbb{E}[\hat{f}_i(X)^2]
$$

Where $N$ is the number of trees. Now, let's calculate the second term:

$$
\begin{align*}
\mathbb{E}[\hat{f}(X)f(X)] &= \frac{1}{N}\sum_{i=1}^{N} \mathbb{E}[\hat{f}_i(X)f(X)] \\
&= \frac{1}{N}\sum_{i=1}^{N} \mathbb{E}[\hat{f}_i(X)]\mathbb{E}[f(X)]
\end{align*}
$$

Since Random Forest averages the predictions of its trees, $\mathbb{E}[\hat{f}_i(X)] = \mathbb{E}[f(X)]$ for each tree.

Finally, the Bias term simplifies to:

$$
\text{Bias} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{E}[\hat{f}_i(X)^2] - \mathbb{E}[f(X)^2]
$$

### Variance:
Variance measures how much the predictions of our model vary across different training datasets. For Random Forest, this can be expressed as:

$$
\text{Variance} = \frac{1}{N}\sum_{i=1}^{N} \text{Var}(\hat{f}_i(X))
$$

Where $\text{Var}(\hat{f}_i(X))$ is the variance of the predictions of the $i$-th tree.

In summary, the bias-variance decomposition helps us understand the trade-off between model complexity (which affects variance) and model simplicity (which affects bias). Random Forest mitigates overfitting by aggregating the predictions of multiple trees, effectively reducing variance while maintaining low bias.

Understanding this decomposition is essential for model selection and tuning to achieve optimal predictive performance.
