# <center>Introduction to Model Selection Methods </center>
<br><br>
<center> Zhangyi Hu </center>
<center> Oct. 16, 2016</center>

# <center> Contents </center>

- Review on basic terminology of statistical learning 
- Motivation of model selection
- Three approaches of model selection
    - Estimate test error
    - Estimate information loss
    - Estimate the posterior probability

# <center> Basic terminology of supervised statistical learning </center>

###  Data = feature + response 

### Model  predicts response based on feature

### Data can be *Training* set or *Test* set 

### <center> Data = feature + response </center>
Other names for *feature*:
- regressor, predictor
- independent(input, explanatory) variable
- $X$

Other names for *response*:
- regressand
- dependent(output, explained) variable
- $Y$

### <center> Model  predicts response based on feature </center>
- Model is mathematically defined by a set of parameters $\{\theta_i\}$
  - Ture model: $\mathbf{y}=f_{\theta}\left(\mathbf{X}\right)+\mathbf{\varepsilon}$
  - Trained model:
$\hat{\mathbf{y}}=g_{\hat{\theta}}\left(\mathbf{X}\right)$

- The process of finding the value of parameters is called *Model Trainning*: 
$$\{\theta_{i}\}\rightarrow\{\hat{\theta}_{i}\}$$

- The process of choosing the subset of parameter space is called *Model Selection*: 
$$\{\hat{\theta}_{i}\}_{i=1}^{p}\mbox{ or }\{\hat{\theta}_{i}\}_{i=1}^{p+q}$$

### <center> Training set and Test set </center>
- Both are data, with feature and response
- Training set is used to train the model, i.e. obtain $\{\theta_i\}$
  - The difference between model prediction and response in the training set: **Training Error**

- Test set are used to evaluate the model
  - Test set are usually not available to the model designer
  - e.g. The the data provided by the end user of the trained model
  - The difference between prediction and response in the test set: **Test Error**
    - When the test set has the same features as training set but new observations of response: **In-sample Test Error**
    - When the test set has different feature: **Extra-sample Test Error**

- Usually, the model designer randomly put away part of available data and pretend he or she doesn't know it and, at the final stage, use it as test set
- Generally, reducing *Test Error* is the main effort of statistical learning engineering

# <center> Motivation of model selection </center>
### Model, by definition, need to be relatively simple to be practical
### With no constraint on parameter space, a model can fit anything

## <center>One extreme example</center>

In [2]:
import matplotlib.pyplot as plt
import matplotlib
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
matplotlib.rcParams.update({'font.size': 14})
ax.set_xlabel('feature')
ax.set_ylabel('response')
ax.set_xlim([0.8, 6.0])
ax.set_ylim([1.0, 9.0])
x = [1,2,3,4]
y = [2.1, 3.9, 6.0, 7.8]
ax.plot(x, y, 'o', ms=8)
for xy in zip(x, y):
    ax.annotate(' (%s, %s)' % xy, xy=xy, textcoords='data') 
fig.savefig('resource/FourPoints.png')

<center><img src='resource/FourPoints.png' width='70%'></center>

<center> A type of model that can fit any available data </center>
<center> Its training error is zero! </center>
<center> $y=\begin{cases}
2.1 & x=1\\
3.9 & x=2\\
6.0 & x=3\\
7.8 & x=4\\
x & \mbox{otherwise}
\end{cases}$ </center>

<center> Do we have a better model? </center>

#### Your brain just drew a straight line subconsciously 
#### It is the nature of intelligence to favor one simple line compared with 4 points
#### Artificial Intelligence behaviors in a similar way
#### The purpose of model selection is find a proper constraint on the parameters

## <center> Examples of model parameter Constraints </center>
Let the parameters be a $p$-vector $\mathbf{\beta} = [\beta_1,\dots,\beta_p]^T$

- $\left\Vert \mathbf{\beta}\right\Vert _{0} \le n$, subset selection
- $\left\Vert \mathbf{\beta}\right\Vert _{1} \le \lambda$, LASSO
- $\left\Vert \mathbf{\beta}\right\Vert _{2} \le \lambda$, Shrinkage

Model selection find the suitable $n$ or $\lambda$, which defines a subset of the parameter space

## <center>Three approaches of model selection</center>
#### Estimate test error (Mallow's $C_p$, Cross validation, Bootstrap)
#### Estimate information loss (Akaike Information Criterion)
#### Estimate the posterior probability (Bayesian Information Criterion)

### <center>Test error function</center>
- Test error(or loss) function measures how bad a prediction is with test data:

<center>e.g. squared-error loss: $L(y,\hat{y})=\left(y-\hat{y}\right)^{2}$</center>

#### <center>Decomposition of expected test error</center>

\begin{align*}
\mbox{Err}\left(x_{0}\right) & =\mathbb{E}\left[(Y-\hat{f}(x_{0}))^{2}\right]\\
 & =\mathbb{E}\left[\left(f(x_{0})+\varepsilon-\hat{f}(x_{0})\right)^{2}\right]\\
 & =\sigma_{\varepsilon}^{2}+\mathbb{E}\left[\left(f(x_{0})-\mathbb{E}\left[\hat{f}(x_{0})\right]+\mathbb{E}\left[\hat{f}(x_{0})\right]-\hat{f}(x_{0})\right)^{2}\right]\\
 & =\sigma_{\varepsilon}^{2}+\mathbb{E}\left[\left(f(x_{0})-\mathbb{E}\left[\hat{f}(x_{0})\right]\right)^{2}\right]+\mathbb{E}\left[\left(\mathbb{E}\left[\hat{f}(x_{0})\right]-\hat{f}(x_{0})\right)^{2}\right]\\
 & =\sigma_{\varepsilon}^{2}+\mbox{Bias}^{2}\left(f(x_{0})\right)+\mbox{Var}\left(\hat{f}(x_{0})\right)
\end{align*}

## <center> Bias-Variance trade off </center>
- With model getting more complex, the bias usually decreases
- However, at the same time the variance usually goes up

- e.g. in OLS

\begin{eqnarray*}
\mbox{Var}\left(\hat{f}(x_{i})\right) & = & \frac{p}{N}\sigma_{\varepsilon}^{2}\\
\mbox{Var}\left(\hat{f}(x_{0})\right) & \sim & \frac{p}{N}\sigma_{\varepsilon}^{2},\quad N\rightarrow\infty
\end{eqnarray*}

<center>
$\begin{array}{cc}
x_{i} & x_{0}\\
\overline{\mbox{in sample}} & \overline{\mbox{out of sample}}
\end{array}$
</center>

## <center> Bias-Variance trade off </center>
- Training error always decreases with increasingly complex model(or with more fitting effort)
- It is a common mistake to use too complex a model with large variance: **over fit**

<center> Perhaps the most over fitted model </center>

<center> $y=\begin{cases}
y_i & \mathrm{if}\quad x=x_i\\
x & \mbox{otherwise}
\end{cases}$ </center>

## <center> Bias-Variance trade off </center>
<center><img src='resource/ESLII_Fig7.2.png' width='60%'/></center>
<center><font size="3">T. Hastie, R. Tibshirani, and J. Friedman, Elements of statistical learning, Figure 7.2</font></center>

## <center>Estimate test error</center>
- Select the subset of parameter space whose best fitted model produces the smallest test error

- With many loss functions, the difference between in sample test error and training error plus some term can be caculated explicitly
$$\mathbb{E}_{\mathbf{y}}\left[\mbox{Err}_{\mbox{in}}\right]=\mathbb{E}_{\mathbf{y}}\left[\overline{\mbox{err}}\right]+\frac{2}{N}\sum_{i=1}^{N}\mathrm{Cov}\left(\hat{y}_{i},y_{i}\right)$$

- <a href='http://nbviewer.jupyter.org/github/hzzyyy/Presentations/blob/master/Model%20Selection/proofs/In-sample%20test%20error%20and%20training%20error.ipynb'>Proof for squared error loss function </a>

## <center> Mallow's $C_p$ </center>
- Mallow's $C_p$ estimates in-sample test error for OLS
- Let's denote the projection matrix of OLS as:
$$\mathbf{S}=\mathbf{X}\left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\mathbf{X}^{T}$$
- Then we can prove that:

\begin{align*}
\sum_{i=1}^{N}\mbox{Cov}\left(\hat{y}_{i},y_{i}\right) & =\sigma_{\varepsilon}^{2}\mbox{Tr}\left(\mathbf{S}\right)=\sigma_{\varepsilon}^{2}p
\end{align*}
$$\mathbb{E}_{\mathbf{y}}\left[\mbox{Err}_{\mbox{in}}\right]=\mathbb{E}_{\mathbf{y}}\left[\overline{\mbox{err}}\right]+\frac{2p}{N}\sigma_{\varepsilon}^{2}$$


## <center> Mallow's $C_p$ </center>
- Mallow's $C_p$ estimates in-sample test error for OLS
- $\frac{\mbox{RSS}}N$ estimates $\mathbb{E}_{\mathbf{y}}\left[\mbox{Err}_{\mbox{in}}\right]$
- $\sigma_{\varepsilon}^2$ is estimated from another low bias model, which is most likely overfitted
- In subset selection
$C_{p}=\frac{\mbox{RSS}_{p}}{N}+\frac{2p}{N}\frac{\mbox{RSS}_{K}}{N-K}$
- In OLS, extra-sample test error approaches in-sample test error 
as $N$ gets large (<a href='http://nbviewer.jupyter.org/github/hzzyyy/Presentations/blob/master/Model%20Selection/proofs/In-sample%20test%20error%20and%20extra-sample%20test%20error.ipynb'>proof</a>)

## <center>Estimate test error</center>
- We can also directly estimate extra-sample test error, by using part of the training set as "test set"


- It is not the real test, because test set is used at the final stage, after the model selection is completed

- It is not used in model training with given parameter space constraint, so it is not training set

- It is in between training set and test set and we name it **validation set**

## <center>Estimate test error</center>
| <font size='5'>Training set </font>|<font size='5'> Validation set </font>|
| :-----------------: | :------------: |
| $\mathcal{T}$ | $\mathcal{V} = \{(x^j, y^j)\}_{j=1}^{M} $ |
- With one random partition between training and validation set, 
we can obtain one realization of the mean extra-sample test error
$$\frac{1}{M}\sum_{j=1}^{M}L(y^{j},\hat{y}^{j})$$
- This is one realization of the moment estimator of the expected extra-sample test error
$$\mbox{Err}_{\mathcal{T}}=\mathbb{E}_{y^j}\left[L(y^{j},\hat{y}^{j})\vert\mathcal{T}\right]$$

- If we use only one such realization to estimate, the variance is too large. 
By central limit theorem, we can use the mean of $K$ such realizations to reduce the variance by a factor of $1/K$

| <font size='5'>Training sets </font>|<font size='5'> Validation sets </font>|
| :-----------------: | :------------: |
| $\mathcal{T}_1$ | $\mathcal{V}_1 = \{(x_1^j, y_1^j)\}_{j=1}^{M} $ |
| $\cdots$ | $\cdots$ |
| $\mathcal{T}_i$ | $\mathcal{V}_i = \{(x_i^j, y_i^j)\}_{j=1}^{M} $ |
| $\cdots$ | $\cdots$ |
| $\mathcal{T}_K$ | $\mathcal{V}_K = \{(x_K^j, y_K^j)\}_{j=1}^{M} $ |

- The mean of the $K$ means of the extra-sample test error is
$$\frac{1}{K}\sum_{i=1}^{K}\frac{1}{M}\sum_{j=1}^{M}L(y_{i}^{j},\hat{y}_{i}^{j})$$
- Which is a moment estimator of the iterated expected extra-sample test error
$$\mbox{Err}=\mathbb{E}_{\mathcal{T}}\left[\mbox{Err}_{\mathcal{T}}\right]=\mathbb{E}_{\mathcal{T}}\left[\mathbb{E}_{y^{j}}\left[L(y^{j},\hat{y}^{j})\vert\mathcal{T}\right]\right]$$
- There are two sources of randomness
 - realization of the random test response $y_i^j$, conditional on $\mathcal{T}_i$
 - realization of the random partition between $\mathcal{T}_i$ and $\mathcal{V}_i$

## <center> Cross validation </center> 

In [12]:
from IPython.core.display import HTML
HTML(filename='../slides.html')
