## What are Cross-Validations (CV) used for?

It is first and foremost used to provide an **estimate and statistics of the prediction errors** when the model is release to the wild.

An implication of that is the model can be used to **choose for hyperparameters** (e.g. those that determines the degree of regularitzation) or even **perform feature selection**. 
- For example, sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature. After choosing a first feature, SFS tests all models that add a second feature to this first chosen feature. The best pair is then selected. Next the same procedure is done for three, then four, and so on. When adding a feature does not improve classification accuracy on the validation data, the SFS process stops. 
- There is a similar procedure called sequential backward elimination of features. As you might guess, it works by starting with all features and discarding features one at a time. It continues to discard features as long as there is no performance loss.

Note that using cv to pick hyperparameters suffer from so-called 'multiple comparisons' (e.g., choosing the best complexity for a model by comparing many complexities). , i.e. many multiple statistical tests are run and then simply the results that look good are picked. Thus cv can only be seen as a safeguard to model overfitting, rather than a guarantee of model generalization.

Often a 'one-standard error' rule is used with cross-validation, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model (though sometimes parsimonious may be ambiguous). The standard deviation can be estimated by the CV across the folds.

## What are the different types of cross validations?

### K-Fold Cross-Validation

Let $\kappa : \{1, . . . , N\} \rightarrow \{1, . . . , K\}$ be an indexing function that indicates the partition to which observation $i$ is allocated by the randomization. Denote by $\hat{f}^{-k}(x)$ the fitted function, computed with the $k$-th part of the data removed. Then the cross-validation estimate of prediction error is
\begin{align}
CV(\hat{f}) = \frac{1}{N}\sum_{n=1}^NL(y_n, \hat{f}^{-\kappa(n)}(x_n)),
\end{align}
where $L$ denote the loss function. Note that CV only leaves out the data in fitting, while the prediction loss is taken over all the training samples. Put in another way, one needs to train $\kappa$ models, and predict all-over the samples, though in this way, we are summing over in-sample and out-of-sample errors.

### Leave-One-Out

The case where $K=N$ is known as **leave-one-out cross validation**. 
- With $K=N$, the cross-validation estimator is a relatively unbiased for the true (expected) prediction error, but can have high variance because the $N$ 'training sets' by leaving only one out are so similar to one another. The computational burden is also considerable, requiring to train $N$ models.
- On the other hand, with K = 5 say, cross-validation has lower variance. But bias could be a problem, depending on how the performance of the learning method varies with the size of the training set. 

Overall, five- or tenfold cross-validation are recommended as good compromises.

### Approximations: Generalized Cross-validation

Generalized cross-validation provides a convenient approximation to leaveone out cross-validation, for linear fitting under squared-error loss.

A linear fitting method is one for which we can write
\begin{align}
\hat{y} = Sy.
\end{align}
It can be show that for many linear fitting methods, 
\begin{align}
\frac{1}{N}\sum_{n=1}^N[y_n-\hat{f}^{-n}(x_n)]^2 = \frac{1}{N}\sum_{n=1}^N\left[\frac{y_n-\hat{f}(x_n)}{1-S_{nn}}\right]^2,
\end{align}
where $S_{nn}$ is the $n$-th diagonal element of $S$. The GCV approximation is then 
\begin{align}
GCV{\hat{f}} = \frac{1}{N}\sum_{n=1}^N\left[\frac{y_n-\hat{f}(x_n)}{1-trace(S)/N}\right]^2,
\end{align}
by appealing to the approximation $1/(1-x)^2\approx 1+2x$.

## What is the usual blunder/wrong way in doing cross validation?

It is called **data leakage**, whereby information is inadvently shared between the training and test set, or between the training and validation set within a fold, therefore making the test set or validation set not a good proxy to 'data that we will see if the model is released out in the wild'.

Though this concept seems straightforward, data leakage can be sneaky. Following are by no means an exhaustive list of scenarios that trip people up.

#### Inproper pre-processing

- This is an example in Section 7.10.2 in ESL, where features are chosen on the *whole train set* before cv is carried out. Since which feature is important now depends on the whole set information, it sabotages the effectiveness of cv.

- Another example is when one looks at why model performs badly on test data, and improve accordingly (see below). But in this sense, optimizing hyper-parameter using cv is suspicious; see the comment about 'multiple comparison' above.

- In cv, the model should be completely re-trained using just the train data in that fold, though evaluation of errors is on the whole data set.

#### Duplicates

When train and test sets have identical data (duplications).

#### Temporal applications

When future information is used to train the model and test on the past.

## If a model, after cross-validated, performs badly in the test data. Can we go back and tune the model parameters/hyper-parameters further so that it performs better in the test data?

If one tries to look at how the model performs badly on some instances of the training data, seek to remedy those and that brings improvement of model performance on the test data, it is acceptable. 

On the other hand, a big taboo is to directly look at instances in the test data to improve model performance.

## What is a good CV strategy for time-series?

(Look at MLEDU Slides 1, Page 41, among others)

## What is the difference between a learning curve and a validation curve?

A learning curve shows the generalization performance—the performance only on testing data, plotted against the amount of training data used. A validation graph shows the generalization performance as well as the performance on the training data, but plotted against some model parameter. Validation graphs generally are shown for a fixed amount of training data.

The implementation of `learning_curve` is introduced in [debugging-ml-algorithms](debugging-ml-algorithms.ipynb). In what follows we discuss `validation_curve` in `sklearn`.

## References

- < Data Science for Business >, Chapter 5.
- ESL, Chapter 7.10