In [1]:
import sys
sys.path.append('..')

import matplotlib.pyplot as plt
import math
import numpy as np

import metrics
import utils

# Bias, Variance, and Model Complexity

Consider the case of quantitative reponse. We have a target variable $Y$, a vector of inputs $X$, and a prediction model $\hat{f}(X)$ estimated from the training set $\mathcal{T}$.  
The loss function measure the error between $Y$ and $\hat{f}(X)$:
- squared error: $L(Y, \hat{f}(X) = (Y - \hat{f}(X))^2$
- abslute error: $L(Y, \hat{f}(X) = |Y - \hat{f}(X)|$

The test error, also called generalization error, is the prediction error over an independant test sample:
$$\text{Err}_\mathcal{T} = E[L(Y,\hat{f}(X))|\mathcal{T}]$$

The training set $\mathcal{T}$ is fixed, and the this is the error for this specific training set.  

The expected prediction error (or expected test error) is:
$$\text{E} = E[L(Y, \hat{f}(X)] = E[\text{Err}_\mathcal{T}]$$

This expectation averages over all sources of randomness, included the randomness in the used training dataset $\mathcal{T}$.  

The training error is the average loss over the training sample:
$$\bar{\text{err}} = \frac{1}{N} \sum_{i=1}^N L(y_i, \hat{f}(x_i))$$

As the model complexity increases, the bias decreases but the variance increases. There is an intermediate model complexity that gives minimum expected test error.  

The training error is not a good estimate of the test error, it decreases to 0 as we increases the model complexity, giving us o model that overfit the training data and generalizes poorly.

The story is similar for a qualitative response $G$ taking on $K$ values.  We usually model the probabilities $p_k(X) = P(G=K|X)$, or some other monotone transformations $f_k(X)$.  
The prediction is $\hat{G}(X) = \arg \max_k \hat{p}_k(X)$.  
The typical loss functions are:

- 0-1 loss: $:L(G, \hat{G}(X)) = I(G \neq \hat{G}(X)$
- negative log likelihhod: $L(G, \hat{p}(X)) = -2\log \hat{p}_G(X)$  

We will use $Y$ and $f(X)$ to represent all situations, transformations to other situations (qualitative reponse) is obvious.  
We describe methods to estimate the expected test error of a model.  
Our model will have tunning parameters $\alpha$ that varies the model complexity, and we look for the $\alpha$ that manimizes the test error.

There are 2 separate goals:
- Model selection: estimate the performance of different models in order to choose the best one.
- Model assessment: estimate the prediction error of the chosen model.  

If we have a lot of data, the best approach is to divide the training set in 3 parts:
- The training set, used to fit the models.
- The validation set, used to estimate prediction error for model selection.
- The test set, used for assessment of the prediction error of the choosen model.  

The test set shouldn't be used for model selection, otherwhise the test set error of the choosen model will underestimate the true test error.  
A typical split it 50\% training, 25\% validation, 25\% test.  
The methods in this chapter are for situations where there is insufficient data to perform this 3-parts split. They give an approximation of the validation set analytically, or by efficient sample re-use.  
Beside for model selection, they can give us a reliable estimate of test error for the chosen model.

# The Bias-Variance Decomposition

Let $Y = f(X) + \epsilon$, with $E(\epsilon)=0$ and $\text{Var}(\epsilon)=\sigma_\epsilon^2$.  
The expected prediction error of a regression fit $\hat{f}(X)$ at an input point $x_0$ is:

$$
\begin{split}
\text{Err}(x_0) & = E[(Y-\hat{f}(x_0))^2|X=x_0]\\
& = \sigma_\epsilon^2 + [E \hat{f}(x_0) - f(x_0)]^2 + E[\hat{f}(x_0) - E \hat{f}(x_0)]^2
\end{split}
$$

This equation can be split in 3 therms:
- the irreductible error, it's the variance of the target, cannot be avoided: $\sigma_\epsilon^2$
- the squared bias: $[E \hat{f}(x_0) - f(x_0)]^2$
- the variance of $\hat{f}(x_0)$: $E[\hat{f}(x_0) - E \hat{f}(x_0)]^2$

## k-nearest neighbors

$$\text{Err}(x_0) = \sigma^e_\epsilon + (f(x_0) - \frac{1}{k} \sum_{l=1}^k f(x_l))^2 + \frac{\sigma^2_\epsilon}{k}$$

As the number of neighbors $k$ increases, the model complexity decreases.

## linear regression

$$\text{Err}(x_0) = \sigma_\epsilon^2 + [f(x_0) - E \hat{f}_p(x_0)]^2 + ||h(x_0)||^2\sigma_\epsilon^2$$

$$\text{with } h(x_0) = X(X^TX)^{-1}x_0$$

$$\frac{1}{N} \sum_{i=1}^N \text{Err}(x_i) = \sigma_\epsilon^2 + \frac{1}{N} \sum_{i=1}^N [f(x_i) - E \hat{f}_p(x_i)]^2 + \frac{p}{N} \sigma_\epsilon^2$$

As the number of parameters $p$ increase, the model complexity increases

## ridge regression

Identical to expression for linear regression, except for $h(x_0)$:
$$h(x_0) = X(X^TX + \alpha I)^{-1}x_0$$.

Linear model as a bias of 0, contrary to ridge, wich has a positive bias. Bit it has lower variance than a linear model.  
It's worthwhile when the decrease in variance exceeds the increase in bias.

# Optimism of the training error rate

Training set $\mathcal{T} = \{(x_1,y_1),(x_2,y_2),\text{...},(x_N,y_N)\}$.  
The generalization error of a model $\hat{f}$ is:
$$\text{Err}_\mathcal{T} = E_{X^0,Y^0}[L(Y^0,\hat{f}(X^0))|\mathcal{T}]$$

with $(X^0,Y^0)$ a new point draw from the data distribution $F$.  

The expected error is:
$$\text{Err} = E_\mathcal{T}E_{X^0,Y^0}[L(Y^0,\hat{f}(X^0))|\mathcal{T}]$$

Most methods try to estimate $\text{Err}$.  

The training error is:
$$\bar{\text{err}} = \frac{1}{N} \sum_{i=1}^N L(y_i, \hat{f}(x_i))$$

The training error will be an overly optimistic estimate of $\text{Err}_\mathcal{T}$.

We define the in-sample error:

$$\text{Err}_\text{in} = \frac{1}{N} \sum_{i=1}^N E_{Y^0}[L(Y_i^0, \hat{f}(x_i))|\mathcal{T}]$$

$Y^0$ indicates that we oberse $N$ new reponse values for each training point $x_i$

Let define the optimis as:
$$\text{op} \equiv \text{Err}_\text{in} - \bar{\text{err}}$$

We can define the average optimist, it's expecation over all training sets:
$$\omega \equiv E_Y[\text{op}]$$

For most loss functons, it's expression generally is:
$$\omega = \frac{2}{N} \sum_{i=1}^N \text{Cov}(\hat{y_i}, y_i)$$

Thus the amount by which the training error underestimates the true error depends on how strongly $y_i$ affcts its own prediction. The harder we fit the data, the greater the optimism will be.  

We get the relation:

$$E_y[\text{Err}_\text{in}] = E_y[\bar{\text{err}}] + \frac{2}{N} \sum_{i=1}^N \text{Cov}(\hat{y_i}, y_i)$$

If the model is linear with $d$ parameters, the optimism expression simplifies:
$$\sum_{i=1}^N \text{Cov}(\hat{y_i}, y_i) = d\sigma_\epsilon^2$$

Thus we have:
$$E_y[\text{Err}_\text{in}] = E_y[\bar{\text{err}}] + 2 \frac{d}{N} \sigma^2_\epsilon$$

The optimism increases with the number $d$ of parameters, and decrease when the training size $N$ increase.  

We can estimate the prediction error by estimating the optimum and add it to the training error.  
In-sample error is not usually of direct interest because future $X$ are not likely to coincide with the training set, but it's convenient for model selection

# Estimates of In-Sample Prediction Error

The in-saple estimate form is:

$$\hat{\text{Err}_\text{in}} = \bar{\text{err}} + \hat{\omega}$$

For linear models, we get the $C_p$ statistic:
$$C_p = \bar{err} + 2 \frac{d}{N}\hat{\sigma}_\epsilon^2$$

$\hat{\sigma}_\epsilon^2$ is an estimate of the noise variance.  

The Akaike Information Critetion (AIC) is a similar, but more generally applied estimate of $\text{Err}_\text{in}$, when $N \to \infty$:

$$-2E[\log P_\hat{\theta}(Y)] \approx -\frac{2}{N} E[\text{loglik}] + 2 \frac{d}{N}$$

$$\text{with logligk } = \sum_{i=1}^N \log P_\hat{\theta}(y_i)$$

For model selection, given a set models $f_\alpha(x)$, with $\alpha$ tunning parameter, whe chose the $\alpha$ that gives the minimum AIC:
$$\text{AIC}(\alpha) = \bar{\text{err}}(\alpha) + 2 \frac{d(\alpha)}{N} \sigma_\epsilon^2$$

with $\bar{\text{err}}(\alpha)$ and $d(\alpha)$ respectively the training error and the number of parameters of model $f_\alpha(x)$.

# The Effective Number of Parameters

In [2]:
X = np.random.randn(100, 4)
y = np.random.randn(100)

S = X @ np.linalg.inv(X.T @ X) @ X.T
preds = S @ y
print(np.trace(S))

3.9999999999999996


A linear fitting model is one for which we can write:
$$\hat{y} = Sy$$
with $S \in \mathbb{R}^{N*N}$ depending on the $x_i$ but not on the $y_i$.

The effective number of parameters is defined as:
$$\text{df}(S) = \text{trace}(S)$$

If $y$ is from an additive model $Y = f(X) + \epsilon$, in can be show that:
$$\sum_{i=1}^N \text{Cov}(\hat{y}_i, y_i) = \text{trace}(S)\sigma^2_\epsilon$$
It motivates the more general definition:
$$df(\hat{y}) = \frac{\sum_{i=1}^N \text{Cov}(\hat{y}_i, y_i)}{\sigma^2_\epsilon}$$  

For neural networks models where we minimize an error function $R(w)$ with  penalty $\alpha \sum_m w_m^2$, the effective number of paremeters is:
$$\text{df}(\alpha) = \sum_{m=1}^M \frac{\theta_m}{\theta_m+\alpha}$$

with $\theta_m$ eigenvalues for the Hessian matrix:
$$\frac{\partial^2 R(w)}{\partial w \partial w^T}$$

# The Bayesian Approach and BIC

The Bayesian information Criterion (BIC)

$$\text{BIC} =  - 2 * \text{loglik} + \log N * d$$

Uner a Gaussian model, we can write it:

$$\text{BIC} = \frac{N}{\sigma_\epsilon^2}[\bar{\text{err}} + \log N \frac{d}{N} \sigma_\epsilon^2]$$

For model selection, there is not clear choice between AIC and BIC.  
When $N \to infty$, BIC tends to select the correct model, but AIC tends to select too complex models.  
For finite $N$, BIC often choose too simple models, because of its heavy penalty on complexity

# Minimum Description Length

We define the Shanon entropy of a random variable $Z$ as:
$$H(Z) = - \sum_{z_i} P(z_i) \log_2 P(z_i)$$

This value represents the minimum average message length to transmit messages from $Z$, in bits.  

We have a model $M$ with parameters $t\theta$, and data $Z = (X, y)$. The condition probability of the outputs under the model is $P(y|\theta,M,X).  
Assuming the receiver know the inputs, and we wish to transmit the outputs, the message length is:

$$\text{length } = - \log P(y|\theta,M,X) - \log P(\theta|M)$$

The MDL principe says we should choose the model that minimizes this length, ie that minimizes the negative log-posterior distribution.  
Hence me chose the model that maximize the posterior probability.

# Vapnik-Chervonenkis Dimension

# Cross-Validation

Cross-Validation is a technique to estimate the expected prediction error

## K-Fold Cross-Validation

We divite the data into $K$ parts of equal size.  
For the $k$-th part, we fit a model on the $K-1$ other parts of the data, and calculate the prediction error of this model on the $k$-th part.  
We do this $k=1,\text{...},K$ an combine the $K$ prediction error estimates.  

Let $\kappa: \{1,\text{...},N\} \to \{1,\text{...},K\}$ indicates the partition $k$ into which observation $i$ is allocated. This allocation should be random.  
Let $\hat{f}^{-k}(x)$ the model fitted without the $k$-th part of the data. The Cross-Validation estimate of prediction error is:
$$\text{CV}(\hat{f}) = \frac{1}{N} \sum_{i=1}^N L(y_i, \hat{f}^{-\kappa(i)}(x_i))$$

Typical choices of $K$ are $5$ or $10$

In [3]:
from sklearn.datasets import load_digits

data = load_digits()
X, y = data.data, data.target

print(X.shape)
print(y.shape)
print(X[0,:15])
print(y[:15])

(1797, 64)
(1797,)
[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.]
[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4]


In [4]:
#y 1D-array of integers, represent correct class
#preds 2d-array of probas, a row represent probra of 1obs for all classes 
def neg_log_loss(y, preds):
    res = 0
    for i in range(len(y)):
        res += np.log(preds[i,y[i]])
    return -res

def cv_split(X, y, K):
    
    #shuffle data
    p = np.random.permutation(len(X))
    X, y = X[p], y[p]

    #split into K
    split_size = int((len(X) + K - 1) / K)
    splits = []
    
    for k in range(K):
        Xk1 = X[:k*split_size]
        yk1 = y[:k*split_size]
        Xk2 = X[(k+1)*split_size:]
        yk2 = y[(k+1)*split_size:]
        Xk = np.concatenate((Xk1, Xk2), axis=0)
        yk = np.concatenate((yk1, yk2), axis=0)
        
        Xo = X[k*split_size: (k+1)*split_size]
        yo = y[k*split_size: (k+1)*split_size]
        
        splits.append((Xk, yk, Xo, yo))
        
    return splits

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.base import clone

def cv_eval(X, y, K, model, loss_fn):
    splits = cv_split(X, y, K)
    loss = 0

    for (X_train, y_train, X_test, y_test) in splits:
        clf = clone(model)
        clf.fit(X_train, y_train)
        preds = clf.predict_proba(X_test)
        lossk = loss_fn(y_test, preds)
        loss += lossk
        #print('accuracy:', np.mean(y_test == clf.predict(X_test)))
        
    return loss
    
model = LogisticRegression(solver='liblinear', multi_class='ovr')
loss = cv_eval(X, y, 10, model, neg_log_loss)
print('Loss:', loss)

Loss: 318.9508387627867


The case where $K=N$ is called the leave-one-out cross valiadation, $N$ models are estimated using all but one observations, and the model is tested on that single obversation

In [6]:
Xs, ys = X[:100], y[:100]
model = LogisticRegression(solver='liblinear', multi_class='ovr')
loss = cv_eval(X, y, len(Xs), model, neg_log_loss)
print('Loss:', loss)

Loss: 316.0156542201092


We can also use Cross-Validation to find a tuning parameter $\alpha$:
    
$$\text{CV}(\hat{f}, \alpha) = \frac{1}{N} \sum_{i=1}^N L(y_i, \hat{f}^{-\kappa(i)}(x_i, \alpha))$$

$\text{CV}$ produces an estimate of the test error curve, and we find $\alpha$ that minimizes it.  

one-standard-rule: choose the model with lowest complexity  whose error is no more than one standard deviation away from the best model

In [8]:
def cv_tune(X, y, K, loss_fn, vals):
    errs = np.empty(len(vals))
    for i in range(len(vals)):
        print('Eval model {}/{}...'.format(i+1, len(vals)))
        model = LogisticRegression(solver='liblinear', 
                                   multi_class='ovr',
                                  C=vals[i])
        errs[i] = cv_eval(X, y, K, model, loss_fn)
    return errs
    
vals = np.linspace(0.01, 0.99, 100)
errs = cv_tune(X, y, 10, neg_log_loss, vals)


mini = np.argmin(errs)
print('min = {}, max = {}, avg = {}'.format(np.min(errs),
                                            np.max(errs),
                                            np.mean(errs)))
print('best: C = {}, loss = {}'.format(vals[mini], errs[mini]))

minp1s = np.min(errs) + np.std(errs)
osr_c = vals[errs < minp1s][0]
print('osr solution: C = {}'.format(osr_c))

Eval model 1/100...
Eval model 2/100...
Eval model 3/100...
Eval model 4/100...
Eval model 5/100...
Eval model 6/100...
Eval model 7/100...
Eval model 8/100...
Eval model 9/100...
Eval model 10/100...
Eval model 11/100...
Eval model 12/100...
Eval model 13/100...
Eval model 14/100...
Eval model 15/100...
Eval model 16/100...
Eval model 17/100...
Eval model 18/100...
Eval model 19/100...
Eval model 20/100...
Eval model 21/100...
Eval model 22/100...
Eval model 23/100...
Eval model 24/100...
Eval model 25/100...
Eval model 26/100...
Eval model 27/100...
Eval model 28/100...
Eval model 29/100...
Eval model 30/100...
Eval model 31/100...
Eval model 32/100...
Eval model 33/100...
Eval model 34/100...
Eval model 35/100...
Eval model 36/100...
Eval model 37/100...
Eval model 38/100...
Eval model 39/100...
Eval model 40/100...
Eval model 41/100...
Eval model 42/100...
Eval model 43/100...
Eval model 44/100...
Eval model 45/100...
Eval model 46/100...
Eval model 47/100...
Eval model 48/100...
E

With $K = N$, the CV estimator is unbiased for the expected prediction error, but can have high variance.  
Decreasing $K$ lower the variance but increase the bias.

Generalized Cross-Validation provite an approximation to leave-one-out Cross Validation, for linear models fitted with least sqares. Thus models of this form:

$$\hat{y} = Sy$$

The GVC approximation is:
$$GCV(\hat{f}) = \frac{1}{N} \sum_{i=1}^N (\frac{y_i - \hat{f}(x_i)}{1 - \text{trace}(S)/N})^2$$

## The Wrong and Right Way to Do Cross-validation

With a multi-step modeling procedure, cross-validation must be applied to the entiere sequence of modeling steps.  
There can't be a step that see the whole dataset.

# Bootstrap Models

The bootstrap is used to estimate the expected prediction error.  
Let's suppose we have a training set $Z = (z_1,\text{..},z_N$, with $z_i=(x_i,y_i)$. We sample from $Z$ several datasets (with replacement) of size $N$. We produce $B$ bootstrap datasets.  

We are trying to estimate $S(Z)$, with $S$ any quantity computed from $Z$. We can estimate the variance of $S(Z)$:

$$\hat{\text{Var}(S(Z))} = \frac{1}{B-1} \sum_{b=1}^B (S(Z^{*b}) - \bar{S}^*)^2$$

We can use this to estimate the prediction error, with the leave-one-out boostrap estimate:

$$\hat{\text{err}}^{(1)} = \frac{1}{N} \sum_{i=1}^N \frac{1}{|C^{-i}|} \sum_{b \in C^{-i}} L(y_i, \hat{f}^{*b}(x_i))$$

With $C^{-i}$ all bostramp samples that do not contains $i$. We need to choose $B$ big enough so that $C^{-i}$ is never empty, or ignore the empty ones.

In [18]:
def bootstrap_data(X, y, B):
    res = []
    for i in range(B):
        p = np.random.randint(0, len(X), len(X))
        Xo = np.delete(X, p, axis=0)
        yo = np.delete(y, p, axis=0)
        res.append((X[p], y[p], Xo, yo))
    return res

def bootstrap_eval(X, y, B, model, loss_fn):
    sets = bootstrap_data(X, y, B)
    loss = 0
    for Xk, yk, Xo, yo in sets:
        clf = clone(model)
        clf.fit(Xk, yk)
        preds = clf.predict_proba(Xo)
        lossk = loss_fn(yo, preds)
        loss += lossk
        

    return loss


model = LogisticRegression(solver='liblinear', multi_class='ovr')        
bootstrap_eval(X, y, 10, model, neg_log_loss)

1431.0110937028146