# Measuring performance

Previous notebooks described neural network models, loss functions, and training algorithms. This one considers how to measure the performance of the trained models.
With suï¬€icient capacity (i.e., number of hidden units), a neural network model will often
perform perfectly on the training data. However, this does not necessarily mean it will
generalize well to new test data.
We will see that the test errors have three distinct causes and that their relative
contributions depend on (i) the inherent uncertainty in the task, (ii) the amount of
training data, and (iii) the choice of model. The latter dependency raises the issue of
*hyperparameter* search.

1. model parameters -> the number of hidden units and the number of hidden layers
2. learning algorithm hyperparameters -> learning rate and batch size

There are three possible sources of error, which are known as *noise*, *bias*, and *variance* respectively

## Mathematical formulation of test error

We now make the notions of noise, bias, and variance mathematically precise. Consider a 1D regression problem where the data generation process has additive noise with variance $\sigma^2$; we can observe different outputs $y$ for the same input $x$, so for each $x$, there is a distribution $P_{r}(y|x)$ with expected value (mean) $\mu[x]$:

$$
\mu[x] = \mathbb{E}_y[y|x] = \int y|x| P_{r}(y|x) dy,
$$

and fixed noise $\sigma^2 = \mathbb{E}_y [(\mu[x] - y[x])^2]$. Here we have used the notation $y[x]$ to specify that we are considering the output $y$ at a given input position $x$.

Now consider a least squares loss between the model prediction $f[x, \phi]$ at position $x$ and the observed value $y[x]$ at that position:

$$
L[x] = (f[x, \phi] - y[x])^2
$$

$$
= \left((f[x, \phi] - \mu[x]) + (\mu[x] - y[x])\right)^2
$$

$$
= (f[x, \phi] - \mu[x])^2 + 2(f[x, \phi] - \mu[x])(\mu[x] - y[x]) + (\mu[x] - y[x])^2,
$$

where we have both added and subtracted the mean $\mu[x]$ of the underlying function in the second line and have expanded out the squared term in the third line.

The underlying function is stochastic, so this loss depends on the particular $y[x]$ we observe. The expected loss is:

$$
\mathbb{E}_y [L[x]] = \mathbb{E}_y \left[ (f[x, \phi] - \mu[x])^2 + 2(f[x, \phi] - \mu[x])(\mu[x] - y[x]) + (\mu[x] - y[x])^2 \right]
$$

$$
= (f[x, \phi] - \mu[x])^2 + 2(f[x, \phi] - \mu[x])(\mu[x] - \mathbb{E}_y [y[x]]) + \mathbb{E}_y [(\mu[x] - y[x])^2]
$$

$$
= (f[x, \phi] - \mu[x])^2 + 2(f[x, \phi] - \mu[x]) \cdot 0 + \mathbb{E}_y [(\mu[x] - y[x])^2]
$$

$$
= (f[x, \phi] - \mu[x])^2 + \sigma^2,
$$

where we have made use of the rules for manipulating expectations. In the second line, we have distributed the expectation operator and removed it from terms with no dependence on $y[x]$, and in the third line, we note that the second term is zero since $\mathbb{E}_y [y[x]] = \mu[x]$ by definition. Finally, in the fourth line, we have substituted in the definition of the fixed noise $\sigma^2$. We can see that the expected loss has been broken down into two terms; the
first term is the squared deviation between the model and the true function mean, and
the second term is the noise.

The first term can be further partitioned into bias and variance. The parameters $\phi$ of the model $f[x, \phi]$ depend on the training dataset $D = \{x_i, y_i\}$, so more properly, we should write $f[x, \phi[D]]$. The training dataset is a random sample from the data generation process; with a different sample of training data, we would learn different parameter values. The expected model output $f_\mu[x]$ with respect to all possible datasets $D$ is hence:

$$
f_\mu[x] = \mathbb{E}_D [f[x, \phi[D]]].
$$

Returning to the first term of equation 8.3, we add and subtract $f_\mu[x]$ and expand:

$$
(f[x, \phi[D]] - \mu[x])^2
$$

$$
= (f[x, \phi[D]] - f_\mu[x] + (f_\mu[x] - \mu[x]))^2
$$

$$
= (f[x, \phi[D]] - f_\mu[x])^2 + 2(f[x, \phi[D]] - f_\mu[x])(f_\mu[x] - \mu[x]) + (f_\mu[x] - \mu[x])^2.
$$

We then take the expectation with respect to the training dataset $D$:

$$
\mathbb{E}_D [(f[x, \phi[D]] - \mu[x])^2] = \mathbb{E}_D [(f[x, \phi[D]] - f_\mu[x])^2] + (f_\mu[x] - \mu[x])^2,
$$

where we have simplified using similar steps as for equation 8.3. Finally, we substitute this result into equation 8.3:

$$
\mathbb{E}_D [\mathbb{E}_y [L[x]]] = \mathbb{E}_D [(f[x, \phi[D]] - f_\mu[x])^2] + (f_\mu[x] - \mu[x])^2 + \sigma^2
$$

$$
\mathbb{E}_D [(f[x, \phi[D]] - f_\mu[x])^2]+(f_\mu[x]- \mu[x])^2 + \sigma^2
$$

This equation says that the expected loss after considering the uncertainty in the training data $D$ and the test data $y$ consists of three additive components. The variance is uncertainty in the fitted model due to the particular training dataset we sample. The bias is the systematic deviation of the model from the mean of the function we are modeling. The noise is the inherent uncertainty in the true mapping from input to output. These three sources of error will be present for any task. They combine additively for regression tasks with a least squares loss. However, their interaction can be more complex for other types of problems.

**Reducing variance**:
Recall that the variance results from limited noisy training data. Fitting the model
to two different training sets results in slightly different parameters. It follows we can
reduce the variance by increasing the quantity of training data. This averages out the
inherent noise and ensures that the input space is well sampled.
the effect of training with 6, 10, and 100 samples. For each dataset
size, we show the best-fitting model for three training datasets. With only six samples,
the fitted function is quite different each time: the variance is significant. As we increase
the number of samples, the fitted models become very similar, and the variance reduces.
In general, adding training data almost always improves test performance.

**Reducing bias**:
The bias term results from the inability of the model to describe the true underlying
function. This suggests that we can reduce this error by making the model more flexible.
This is usually done by increasing the model capacity. For neural networks, this means
adding more hidden units and/or hidden layers.

Also there exists an unexpected side-effect of increasing the model capacity.
For a fixed-size training dataset, the variance term typically increases as the model
capacity increases. Consequently, increasing the model capacity does not necessarily
reduce the test error. This is known as the bias-variance trade-off.