# Model assessment

## Theory

Model assessment aims at quantifying the model generalization performance by estimating the test error. While there are analytical methods, such as the Bayesian information criterion (BIC) and the Akaike information criterion (AIC), numerical methods are widely used. 

In cases where enough data are available, for numerical methods data are divided into three parts: a training, a validation and a test set. Training set is used to fit the model, validation set is used to estimate the prediction error for model selection and parameter tuning, and test set is used for a final estimation of the generalization error. Tipical divisions are 50 % for training and 25 % for validation and test. However, for extremely large data sets, training set may be 90 % or more, leaving few thousand data points for validation and test.

In cases where data are scarce, resampling methods are used.  Cross-validation (CV) directly estimates the prediction error and boostrap estimates statistical accuracy. 

### Generalization error

Let $X$ be a vector on inputs, $Y$ the target variable, $f(X)$ the prediction model and $L(Y,f(X))$ the loss function. The *test error*, also named *generalization error*, is the expected error over the test set
$$
Err=E(Y,f(X)),
$$
where $X$ and $Y$ are randomly sampled from their joint distribution. 
Training error is the average loss over the training set
$$
Err_{train}=\sum_{i=1}^N[L(y_i,f(x_i))]
$$

For regression problems, variables are quantitative and typical losses are the squared error,  
$$
L(X)=(Y-f(X))^2,
$$ 
or the absolute error, $L(X)=|Y-f(X)|$. 

For qualitative variables, the output variable or response $G$ takes values $1, 2, \ldots, K$. A typical loss is the $0$-$1$ loss, $L(G,\hat{G}(X))=I(G\neq\hat{G}(X))$, or the log-likelihood, also named *cross-entropy loss* or *deviance*
$$
L(G,\hat{G}(X))=-2\sum_{k=1}^KI(G=k)\log p_k(X)=-2\log p_G(X). 
$$

The EPE at a given point $x_0$ can be then decomposed into an irreducible error, bias and variance, known as the [bias-variance trade-off](https://colab.research.google.com/drive/1R5NFTzQqUwz01_1kFdB713auQU-fOgLR#scrollTo=i8k39JmkDwVc) as previously described:   

$$
EPE(\hat{f})=\sigma^2+MSE(\hat{f}(x_0))=\sigma^2+\textrm{var}(\hat{f}(x_0))+\textrm{bias}(\hat{f}(x_0))^2.
$$

One way to estimate EPE is to estimate the optimism (expected difference between this error and the training error) and then add it to the training error, as in AIC and BIC methods. Cross-validation and bootstrap methods estimate errors by a resampling approach.

### Croos-validation

For the cases when the amount of data is not large, instead of splitting the data into three parts, resampling methods estimate the EPE by combining resampling and expectation, where $EPE=E[L(Y,\hat{f}(X))]$.

In $k$-fold cross-validation (CV), data are split into $k$ equal-sized parts, $C_1, C_2, \ldots, C_K$, where $C_i$ indicates the indices for the $i$-th fold, and predicts the EPE as following:

$$
CV=\frac{1}{N}\sum_{i=1}^NL(y_i,\hat{y}_{-k(i)}), 
$$

where $\hat{y}_{-k(i)}$ is the fitted model with the $k$-th data fold removed. An advantange of CV is that it uses the entire data set to estimate the EPE. 

The lower the $k$ value, the higher the bias in the error estimates and the less variance. Typical choices for $k$ are $5$ or $10$.

For a regression problem, $L$ is the MSE, 

$$
CV=\frac{1}{N}\sum_{i=1}^NL(y_i,\hat{y}_{-k(i)})=\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_{-k(i)})^2= \frac{1}{N}\sum_{k=1}^Kn_k \sum_{i\in C_k}\frac{1}{n_k}(y_i-\hat{y}_{-k(i)})^2=\frac{1}{N}\sum_{k=1}^Kn_k MSE_k.
$$

For a classification problem, 

$$
CV=\frac{1}{N}\sum_{i=1}^NL(y_i,\hat{y}_{-k(i)})=\frac{1}{N}\sum_{k=1}^Kn_k\sum_{i\in C_k}\frac{1}{n_k}I(y_i\neq \hat{y}_{-k(i)})=\frac{1}{N}\sum_{k=1}^Kn_k Err_k.
$$



The validation approach splits data into training and validation, where validation is used for model selection, for intance, to choose a parameter $\alpha$ by computing the $MSE(\alpha)$. This works well as the minimum of the MSE is well estimated. However, the MSE can be overestimated, as methods perform worse when trained on fewer obervations. The CV improves this but still may underestimate the test MSE, where $k$-fold CV provides better estimates than LOOCV.

The training error has a downward bias. On the other hand, $k$-fold cross-validation has an upward bias, which may be negligible in LOOCV but it sometimes cannot be neglected in $5$-fold or $10$-fold CV.

### Bootstrap methods

*Bootstrap* methods can be used to estimate statistical accuracy of a parameter estimate, such as the standard error or confidence interval of a coefficient. The term boostrap coming from the phrase "to pull oneself up by one's bootstrap", from the adventures of Baron Munchausen [Hastie et al]. After falling into the river, he pick himself out py pulling from his bootstraps, so the idea being to pull from what one has. They work by randomly drawing data with replacement (every point can be sampled more than once or none, for each subset) from the training data set, which leads to bootstrap datasets, and then estimating any parameter of the distribution. 



For an example, let $\alpha$ be a parameter that we want to estimate its value and standard error. From the training data, we could estimate its average $\bar{\alpha}$. If we had access to the distribution, we could sample from the distribution many times and then compute its SE. Bootstrap instead of using independent samples, it takes distinct samples and estimates the value of alpha for each bootstrap subset, $\hat{\alpha}_r$, so the mean can be computed as 

$$
\bar{\alpha}=\frac{1}{B}\sum_{r=1}^B\hat{\alpha}_r,
$$

and its SE

$$
SE=\sqrt{\frac{1}{B-1}\sum_r(\hat{\alpha}_r-\bar{\alpha})}. 
$$

For the case of estimating EPE we get the following: 

$$
Err_{\textrm{boot}}=\frac{1}{BN}\sum_b\sum_i L(y_i,\hat{f}^b(x_i)),
$$

where $\hat{f}^b(x_i)$ is the prediction at point $x_i$ from the $b$-th bootstrap dataset, and $B$ is the number of bootstrap sets. However, same data is being used for training and test; on the contrary, in cross-validation non-overlapping data were used. A better choice is the *leave-one-out bootstrap* estimate, which computes predictions from bootstrap sets not contaning this sample: 

$$
Err_{\textrm{LOOBS}}=\frac{1}{N}\sum_i\frac{1}{|C^{-i}|}\sum_{b\in C^{-i}}L(y_i,\hat{f}^b(x_i))
$$

where $C^{-i}$ is the set containing points that exclude $x_i$.The LOOBS is bias, though, due to the similarity between subsets. Thus, for estimating EPE, CV is better than bootstrap. 