# Excess Risk Decomposition

## Notes to self: Excess risk theory reminder

Ideally we'd like to find the Bayes decision function that minimises risk
$$
f^* = \arg \min_f \mathbb E(l(x | y))
$$

However to keep optimisation feasible (and add some regularisation) we only look for functions within an hypothesis space $\mathbb F$.
$$
f^\mathbb F = \arg \min_{f \in \mathbb F} \mathbb E (l(x | y))
$$

The difference between these is the **approximation error**
$$
R(f^\mathbb F) - R(f^*)
$$

However in general we can't compute $f^\mathbb F$, as we only have a limited number of data. As such we can only compute the **empirical risk minimiser**
$$
\hat{f_n} = \arg \min \frac{1}{n} \sum_{i=1}^n l\left(f(x_i), y_i\right)
$$

This introduces the **estimation error**.
$$
R(\hat{f_n}) - R(f^\mathbb F)
$$

However, because when trying to find $\hat{f_n}$, the optimisation algorithm doesn't give the perfect result, but it yields a result $\tilde{f_n}$. Thus we also get an **optimisation error**
$$
R(\tilde{f_n}) - R(\hat{f_n})
$$

The **Excess Risk** is the difference between the final risk, and the risk of the Bayes decision function
$$
R(\tilde{f_n}) - R(f^*) \\
= \underbrace{R(f^\mathbb F) - R(f^*)}_{\text{Approximation Error}} 
  + \underbrace{R(\hat{f_n}) - R(f^\mathbb F)}_{\text{Estimation Error}}
  + \underbrace{R(\tilde{f_n}) - R(\hat{f_n})}_{\text{Optimisation Error}}
$$

### Some notes

* The approximation error and the estimation error are always positive.
* The optimisation error could be negative
* If the approximation error and estimation error dominate the optimisation error (which seems to be quite often the case),
  there is no point in using advanced optimisation methods to minimise the optimisation error,
  as it will have a negligable effect on the total excess risk.



## Concept Check Questions

### Exercise 1: Uniform distribution with $X=Y$

#### (a) Bayes risk

Bayes risk is 0.

#### (b) Approx error when using constant functions

In this case
$$f^\mathbb F = \mathbb E[X] = 5.5$$.
Then the approximation error becomes
$$
R(f^\mathbb F) = \frac{1}{10}\sum_{x=1}^{10}(x-5.5)^2 = 8.25
$$

#### (c) Hypothesis space of affine functions

The approximation error will be 0.

When $\hat{f}(x) = x + 1$,
$$
R(\hat f) - R(f_\mathcal F) = 1^2 = 1
$$

### Exercise 3: Characterise in terms of error

#### (a) Overfitting

Estimation error

#### (b) Underfitting

Most likely to be approximation error. However in some cases it might maybe be caused by optimisation error

#### (c) Computationally intractable precise empirical risk minimisation

I would be tempted to say that this one is going to be optimisation error.

#### (d) Not enough data

Likely to cause an estimation error

### Exercise 4: Brain teasers

#### (a) Randomness of $R(\hat f_n)$
$R(\hat f_n)$ depends on the chosen training data. If we assume the training data to be fixed, this will be deterministic. However if we train using randomly selected/generated data, it will be considered random.

#### (b) Increasing hypothesis space
**False**: Increasing the hypothesis space will not always leave $R(\hat f_n)-R(f^*)$ constant. For a counter example have a look at a situation like exercise 1, where switching from constant functions to affine functions will (in most but the most extreme cases) decrease $R(\hat f_n)-R(f^*)$.

#### (c) Data as a random sample
**False**: When we treat the data as a random sample, the approximation error will still be deterministic, as it doesn't depend on the data. However the estimation error, as it depends on the actual data used, will indeed be random

#### (d) Empirical risk of ERM
**False**  The risk we try to minimise for on our training data will generally be smaller than the true risk.

#### (e) Implicit sample space

1. Sample space of the test set to compute $\hat R$
2. Sample space to compute the different training sets to compute $\hat f_n$
3. Sample space is again the training data set. Note that for minibatch gradient descent the other aspect of the sample space is which training data are chose for which batch.

### Exercise 5: Brain teasers ct'd (comparisons)

(a) More training data should yield a smaller estimation error. Although given that it's only five more data points I expect the result to be negligible and somewhat random.

(b) Approximation error $f_1 \geq$ approximation error $f_2$.

(c) Approximation error $f_1 \geq$ approximation error $f_2$. Note that there might be a risk of overfitting.

(d) I think the estimation error in the second case will be larger, as there's more freedom, and thus more data needed to get an equally well fit.

(e) Unknown. it might be improving, or there might be overfitting.

### Exercise 6

Haven't seen decision trees (yet).

## Topic 3: Concept Check Questions

1. $$\lambda = \frac{1}{C}$$
2. 
3. 
2. The norm of the regularised parameters will always be smaller, given that large parameters are penalised, as the specific goal of the regularisation term is to minimise the size of the parameters. (Note: the model solution here is much more formal; I quite like the approach, and I would never have thought of it myself.)
3. Feature normalisation ensures that the 'regularisation force' applied to each of the parameters is equally large. If we would expect the different features to have greatly different sizes, the matching parameters would also have values differing in order of magnitude, and the large parameters would be much more affected by regularisation than the smaller parameters.