# Resampling Methods
### by [Richard W. Evans](https://sites.google.com/site/rickecon/), February 2020
The code in this Jupyter notebook was written using Python 3.7. It uses data files [`Titanic dataset`](https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/master/titanic-train.csv). For the code to run properly, you will either need to have access to the internet or you should have the data file in the same folder as the Jupyter notebook file. Otherwise, you will have to change the respective lines of the code that read in the data to reflect the location of that data.

**Resampling methods** are a way to test the sensitivity of statistical results to estimation using a different sample. It is often too difficult or too expensive to draw a new sample from the population. Resampling methods take advantage of the training-set test-set paradigm to evaluate the sensitivity of estimates to sample variance. The two main classes of resampling methods are:

1. Cross validation
2. Bootstrapping

In choosing models to predict or match data or to infer relationships between variables, James, et al (2013) decompose the process into *model assessment* and *model selection*. 

**Model assessment** is treated in this notebook. It is the process and various means of evaluating the fit or accuracy of a given model. 

**Model selection** is the process of adjusting parameters, variables, or functional relationships between variables to better fit the data.

## 1. Cross validation

### 1.1. Validation set approach
This is an approach that we have already studied in the [classifiers notebook](https://github.com/UC-MACSS/persp-model-econ_W19/blob/master/Notebooks/Classification/LogitKNN.ipynb).

1. Partition the data into a training set and a test set.
2. Estimate the model using the training set.
3. Evaluate the fit or predictive accuracy on the test set.

The primary measure of fit is **the mean squared error (MSE)** of the estimated model on the test set. Let the test set have $N$ observations. The MSE of the test set is the sum of squared deviations of the actual dependent variable values minus the predicted values.

$$ MSE = \frac{1}{N}\sum_{i=1}^N\left(y_i - \hat{y}_i\right)^2 $$

In classification problems, researchers sometimes use **the ROC curve** (receiver operating characteristics) and **the area under the ROC curve (AUROC)** because it captures both type 1 and type 2 errors in a single visualization (false positive rate versus true positive rate). You want the false positive rate to be low for any given true positive rate. The area under the ROC curve (AUROC). **The AUROC is the opposite of a measure of error. The bigger the AUROC, the more accurate predictor is the model.**

[Insert figure]

Let's calculate the MSE from our logistic regression of the titanic example.

In [11]:
# Import needed libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import sklearn
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold
from sklearn import metrics 
from sklearn.metrics import classification_report, mean_squared_error
from pylab import rcParams

import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')

import warnings
warnings.filterwarnings("ignore")

In [4]:
# Read in Titanic data
url = ('https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/' +
      'master/titanic-train.csv')
titanic = pd.read_csv(url)
titanic.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age',
                   'SibSp','Parch','Ticket','Fare','Cabin','Embarked']

# Get rid of columns we don't use
titanic = titanic.drop(['PassengerId','Name','Ticket','Cabin'], 1)

# Impute missing age values
def age_approx(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

titanic['Age'] = \
    titanic[['Age', 'Pclass']].apply(age_approx, axis=1)
    
# Drop any observations with missing values
titanic.dropna(inplace=True)

# Make gender dummies and embark dummies and get rid of
# original variables
gender = pd.get_dummies(titanic['Sex'], drop_first=True)
embark_location = pd.get_dummies(titanic['Embarked'],
                                 drop_first=True)
titanic.drop(['Sex', 'Embarked'], axis=1, inplace=True)
titanic = pd.concat([titanic, gender, embark_location], axis=1)

# Drop Pclass variable due to excessive correlation with Fare
titanic.drop(['Pclass'], axis=1, inplace=True)

titanic.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,male,Q,S
0,0,22.0,1,0,7.25,1,0,1
1,1,38.0,1,0,71.2833,0,0,0
2,1,26.0,0,0,7.925,0,0,1
3,1,35.0,1,0,53.1,0,0,1
4,0,35.0,0,0,8.05,1,0,1


Now partition the data into the same 60% training set sample that we did in the [logistic regression notebook](https://github.com/UC-MACSS/persp-model_W18/blob/master/Notebooks/Classfcn1/KKNlogitLDA.ipynb) and estimate the logistic regression with all the variables.

In [12]:
X = titanic[['Age', 'SibSp', 'Parch', 'Fare', 'male', 'Q', 'S']]
y = titanic['Survived']

# This function train_test_split is from sklearn.cross_validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
                                                    random_state=25)
LogReg = LogisticRegression(max_iter=200)
print(type(LogReg))
LogReg.fit(X_train, y_train)
print(type(LogReg))
y_pred = LogReg.predict(X_test)
# Note that the squared doesn't matter in a Logistic model
y_pred

<class 'sklearn.linear_model.logistic.LogisticRegression'>
<class 'sklearn.linear_model.logistic.LogisticRegression'>


array([1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,

In [13]:
# You can code the MSE yourself
MSE_vs = ((y_test - y_pred) ** 2).sum() / y_pred.shape[0]
print('Validation set MSE = ', MSE_vs)

# Or you can use scikit-learn's method
print('Validation set MSE = ', mean_squared_error(y_test, y_pred))

Validation set MSE =  0.2247191011235955
Validation set MSE =  0.2247191011235955


Below is the estimation of the logistic regression model on all the observations in the dataset, not just the training set observations. What advantages are there to estimating on a subsample of the data (training set) rather than the full data set?

In [14]:
LogReg1 = LogisticRegression(max_iter=200)
LogReg1.fit(X, y)
y1_pred = LogReg1.predict(X)

# You can code the MSE yourself
MSE1_vs = ((y - y1_pred) ** 2).sum() / y1_pred.shape[0]
print('Validation set MSE = ', MSE1_vs)

# Or you can use scikit-learn's method
print('Validation set MSE = ', mean_squared_error(y, y1_pred))

Validation set MSE =  0.21034870641169853
Validation set MSE =  0.21034870641169853


### 1.2. Leave-one-out cross validation
Leave-one-out cross validation (LOOCV) is an approach in which the model is assessed using $N$ different training sets and test sets of a specific size. Let the data have $N$ observations. LOOCV is to choose a training set with $N-1$ observations, such that the test set only has one observation $y_i$. Repeat this $N$ with a slightly different training set such that each data point is the test set in exactly one of these sebsets.

In this case, the mean squared error MSE has no summation because there is only one observation in the test set.

$$ MSE_i = (y_i - \hat{y}_i)^2 $$

The LOOCV estimate for the test MSE is the average of these $N$ test error estimates.

$$ CV_{loo} = \frac{1}{N}\sum_{i=1}^N MSE_i $$

In [15]:
# Define loo as a leave-one-out object, then
# split it into N different partitions

# Note that the LeaveOneOut() function does not work
# well with pandas DataFrames
Xvars = titanic[['Age', 'SibSp', 'Parch', 'Fare',
                 'male', 'Q', 'S']].values
yvars = titanic['Survived'].values
print(type(Xvars), type(yvars))
Xvars, yvars

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


(array([[22.,  1.,  0., ...,  1.,  0.,  1.],
        [38.,  1.,  0., ...,  0.,  0.,  0.],
        [26.,  0.,  0., ...,  0.,  0.,  1.],
        ...,
        [24.,  1.,  2., ...,  0.,  0.,  1.],
        [26.,  0.,  0., ...,  1.,  0.,  0.],
        [32.,  0.,  0., ...,  1.,  1.,  0.]]),
 array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
        1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
        0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
        1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,

In [16]:
N_loo = Xvars.shape[0]
loo = LeaveOneOut()

MSE_vec = np.zeros(N_loo)
loo.get_n_splits(Xvars)

889

对整个数据集进行循环，数据集中的每个observation都作为一次test set.

In [22]:
# This loop will take 20 or 30 seconds
for train_index, test_index in loo.split(Xvars):
#     print(train_index, test_index)
#     print("-" * 20)
    X_train, X_test = Xvars[train_index], Xvars[test_index]
    y_train, y_test = yvars[train_index], yvars[test_index]
    LogReg = LogisticRegression(max_iter=200)
    LogReg.fit(X_train, y_train)
    y_pred = LogReg.predict(X_test)
    MSE_vec[test_index] = (y_test - y_pred) ** 2
    print('MSE for test set', test_index, ' is', MSE_vec[test_index])

MSE for test set [0]  is [0.]
MSE for test set [1]  is [0.]
MSE for test set [2]  is [0.]
MSE for test set [3]  is [0.]
MSE for test set [4]  is [0.]
MSE for test set [5]  is [0.]
MSE for test set [6]  is [0.]
MSE for test set [7]  is [0.]
MSE for test set [8]  is [0.]
MSE for test set [9]  is [0.]
MSE for test set [10]  is [0.]
MSE for test set [11]  is [0.]
MSE for test set [12]  is [0.]
MSE for test set [13]  is [0.]
MSE for test set [14]  is [1.]
MSE for test set [15]  is [0.]
MSE for test set [16]  is [0.]
MSE for test set [17]  is [1.]
MSE for test set [18]  is [1.]
MSE for test set [19]  is [0.]
MSE for test set [20]  is [0.]
MSE for test set [21]  is [1.]
MSE for test set [22]  is [0.]
MSE for test set [23]  is [1.]
MSE for test set [24]  is [1.]
MSE for test set [25]  is [1.]
MSE for test set [26]  is [0.]
MSE for test set [27]  is [1.]
MSE for test set [28]  is [0.]
MSE for test set [29]  is [0.]
MSE for test set [30]  is [0.]
MSE for test set [31]  is [0.]
MSE for test set [

MSE for test set [310]  is [0.]
MSE for test set [311]  is [1.]
MSE for test set [312]  is [0.]
MSE for test set [313]  is [0.]
MSE for test set [314]  is [0.]
MSE for test set [315]  is [0.]
MSE for test set [316]  is [0.]
MSE for test set [317]  is [0.]
MSE for test set [318]  is [0.]
MSE for test set [319]  is [0.]
MSE for test set [320]  is [0.]
MSE for test set [321]  is [0.]
MSE for test set [322]  is [0.]
MSE for test set [323]  is [0.]
MSE for test set [324]  is [0.]
MSE for test set [325]  is [0.]
MSE for test set [326]  is [0.]
MSE for test set [327]  is [0.]
MSE for test set [328]  is [0.]
MSE for test set [329]  is [0.]
MSE for test set [330]  is [0.]
MSE for test set [331]  is [1.]
MSE for test set [332]  is [0.]
MSE for test set [333]  is [0.]
MSE for test set [334]  is [0.]
MSE for test set [335]  is [0.]
MSE for test set [336]  is [0.]
MSE for test set [337]  is [1.]
MSE for test set [338]  is [0.]
MSE for test set [339]  is [1.]
MSE for test set [340]  is [0.]
MSE for 

MSE for test set [640]  is [0.]
MSE for test set [641]  is [1.]
MSE for test set [642]  is [1.]
MSE for test set [643]  is [0.]
MSE for test set [644]  is [1.]
MSE for test set [645]  is [0.]
MSE for test set [646]  is [1.]
MSE for test set [647]  is [0.]
MSE for test set [648]  is [0.]
MSE for test set [649]  is [0.]
MSE for test set [650]  is [0.]
MSE for test set [651]  is [0.]
MSE for test set [652]  is [0.]
MSE for test set [653]  is [1.]
MSE for test set [654]  is [0.]
MSE for test set [655]  is [0.]
MSE for test set [656]  is [1.]
MSE for test set [657]  is [0.]
MSE for test set [658]  is [0.]
MSE for test set [659]  is [1.]
MSE for test set [660]  is [0.]
MSE for test set [661]  is [0.]
MSE for test set [662]  is [0.]
MSE for test set [663]  is [1.]
MSE for test set [664]  is [0.]
MSE for test set [665]  is [0.]
MSE for test set [666]  is [0.]
MSE for test set [667]  is [0.]
MSE for test set [668]  is [0.]
MSE for test set [669]  is [0.]
MSE for test set [670]  is [0.]
MSE for 

In [24]:
MSE_loo_mean = MSE_vec.mean()
MSE_loo_std = MSE_vec.std()
print('test estimate MSE loocv=', MSE_loo_mean,
      ', test estimate MSE standard err=', MSE_loo_std)

test estimate MSE loocv= 0.21147356580427445 , test estimate MSE standard err= 0.40835339691289413


### 1.3. k-fold cross validation
$k$-fold cross validation is a method in which the dataset is randomly divided into $k$ groups (folds). Define a test set of the model as the $k$th fold. For each test set $k$, the model is estimated on the data from the other $k-1$ folds. 

Let the number of observations in the $k$th fold be $N_k$, and let $\mathcal{K}$ be the set of observations in the $k$th fold. The $MSE_k$ of the $k$th fold is:

$$ MSE_k = \frac{1}{N_k}\sum_{i\in\mathcal{K}}(y_i - \hat{y}_i)^2 $$

Then **the $k$-fold estimate for the test MSE** is the average of these $k$ test error estimates.

$$ CV_{kf} = \frac{1}{k}\sum_{j=1}^k MSE_j $$

**LOOCV is a special case of $k$-fold cross validation in which $k=N$**.

Let's use the Titanic data again and test our logit model performance with a $k$-fold cross validation with $k=6$.

In [38]:
def k_fold_stat(k):
    kf = KFold(n_splits=k, random_state=10, shuffle=True)
    MSE_vec_kf = np.zeros(k)
    kf.get_n_splits(Xvars) 
    
    k_ind = int(0)
    for train_index, test_index in kf.split(Xvars):
        X_train, X_test = Xvars[train_index], Xvars[test_index]
        y_train, y_test = yvars[train_index], yvars[test_index]
        LogReg = LogisticRegression(max_iter=300)
        LogReg.fit(X_train, y_train)
        y_pred = LogReg.predict(X_test)
        MSE_vec_kf[k_ind] = ((y_test - y_pred) ** 2).mean()
        k_ind += 1
    MSE_kf_mean = MSE_vec_kf.mean()
    MSE_kf_std = MSE_vec_kf.std()
    return k, MSE_kf_mean, MSE_kf_std

In [32]:
k = 800
kf = KFold(n_splits=k, random_state=10, shuffle=True)
print(type(kf))
MSE_vec_kf = np.zeros(k)
kf.get_n_splits(Xvars)

<class 'sklearn.model_selection._split.KFold'>


800

In [27]:
help(KFold)

Help on class KFold in module sklearn.model_selection._split:

class KFold(_BaseKFold)
 |  KFold(n_splits='warn', shuffle=False, random_state=None)
 |  
 |  K-Folds cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets. Split
 |  dataset into k consecutive folds (without shuffling by default).
 |  
 |  Each fold is then used once as a validation while the k - 1 remaining
 |  folds form the training set.
 |  
 |  Read more in the :ref:`User Guide <cross_validation>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=3
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.20
 |          ``n_splits`` default value will change from 3 to 5 in v0.22.
 |  
 |  shuffle : boolean, optional
 |      Whether to shuffle the data before splitting into batches.
 |  
 |  random_state : int, RandomState instance or None, optional, default=None
 |      If int, random_state is the seed used by the random number generator;
 |      If R

In [36]:
k_ind = int(0)
for train_index, test_index in kf.split(Xvars):
    print("TRAIN:", train_index, "TEST:", test_index)
    print("-" * 30)
#     print('k index=', k_ind)
    X_train, X_test = Xvars[train_index], Xvars[test_index]
    y_train, y_test = yvars[train_index], yvars[test_index]
    LogReg = LogisticRegression(max_iter=300)
    LogReg.fit(X_train, y_train)
    y_pred = LogReg.predict(X_test)
    MSE_vec_kf[k_ind] = ((y_test - y_pred) ** 2).mean()
    print('MSE for test set', k_ind, ' is', MSE_vec_kf[k_ind])
    k_ind += 1

TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 237 238 239 240 241 242 243 244

 884 885 886 887 888] TEST: [188 475]
------------------------------
MSE for test set 52  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220

MSE for test set 102  is 1.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

MSE for test set 154  is 1.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

MSE for test set 210  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

MSE for test set 259  is 1.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 171 172 173 174 175 176 177 178 179 180
 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
 235 236 237 23

MSE for test set 309  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

MSE for test set 359  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

MSE for test set 414  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
 235 236 237 23

MSE for test set 469  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

 883 884 885 886 887 888] TEST: [310]
------------------------------
MSE for test set 525  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 22

MSE for test set 574  is 1.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

MSE for test set 628  is 1.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

MSE for test set 684  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 23

 883 884 885 886 887 888] TEST: [15]
------------------------------
MSE for test set 740  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220

 883 884 885 886 887 888] TEST: [496]
------------------------------
MSE for test set 791  is 0.0
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 22

In [37]:
MSE_kf_mean = MSE_vec_kf.mean()
MSE_kf_std = MSE_vec_kf.std()
print('test estimate MSE k-fold=', MSE_kf_mean,
      'test estimate MSE standard err=', MSE_kf_std)

test estimate MSE k-fold= 0.215 test estimate MSE standard err= 0.4000312487793922


### 1.4. Bias versus variance
Recall the test estimate MSE from the LOOCV of approximately 0.2115 and the MSE(LOOCV) standard error of about 0.4084. What happens to the estimated MSE and MSE standard error in the $k$-fold cross validation above as $k$ increases? Try values of $k=2, 10, 50, 100, 800$.

**Note that the LOOCV method has low bias (estimated on large number of data) but high variance (errors are based on one draw).** **In contrast, the $k$-fold method has more bias (estimated with less data) but lower variance**. Each test set has more observations.

* $k$-fold cross validation can often provide more accurate estimates of the test error rate.
* $k$-fold is less computationally intensive
* LOOCV has the least bias
* LOOCV is the most computationally expensive

In [40]:
print(k_fold_stat(2))
print(k_fold_stat(10))
print(k_fold_stat(50))
print(k_fold_stat(100))
print(k_fold_stat(800))

(2, 0.20811316934912438, 0.012607551371596318)
(10, 0.2137002042900919, 0.04862207626141106)
(50, 0.2124836601307189, 0.10383210273172495)
(100, 0.21125, 0.1424988493564114)
(800, 0.215, 0.4000312487793922)


## 2. Bootstrapping
This name comes from the expression "to pull oneself up by one‘s own bootstraps." In a way similar to the cross validation methods of the last section, we can use *the bootstrap* to quantify the undertainty associated with a given estimator, learning model, or method. In the econometrics and statistics literature, this often shows up as "bootstrapped standard errors". Bootstrapping is valuable because it is so widely applicable to a range of models.

1. Randomly draw $S$ datasets of size $N_S$ with replacement. Define each training set of observations as $\mathcal{K}_s$ and each corresponding test set as $\mathcal{-K}_{s}$.
2. Calculate the MSE for each test set $\mathcal{-K}_{s}$


**The bootstrap estimate for the test MSE is the average MSE from each random test set.**

$$ CV_{boot} = \frac{1}{S}\sum_{s=1}^S MSE_s $$

In [41]:
N_bs = 800  # S = 800
MSE_vec_bs = np.zeros(N_bs)

In [42]:
for bs_ind in range(N_bs):
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.4)  # training set和test set的size在这一步确定
    LogReg = LogisticRegression(max_iter=200)
    LogReg.fit(X_train, y_train)
    y_pred = LogReg.predict(X_test)
    MSE_vec_bs[bs_ind] = ((y_test - y_pred) ** 2).mean()
    print('MSE for test set', bs_ind, ' is', MSE_vec_bs[bs_ind])

MSE for test set 0  is 0.2303370786516854
MSE for test set 1  is 0.19101123595505617
MSE for test set 2  is 0.19662921348314608
MSE for test set 3  is 0.21629213483146068
MSE for test set 4  is 0.21348314606741572
MSE for test set 5  is 0.21067415730337077
MSE for test set 6  is 0.21629213483146068
MSE for test set 7  is 0.20224719101123595
MSE for test set 8  is 0.18820224719101122
MSE for test set 9  is 0.24157303370786518
MSE for test set 10  is 0.2247191011235955
MSE for test set 11  is 0.21067415730337077
MSE for test set 12  is 0.22752808988764045
MSE for test set 13  is 0.21629213483146068
MSE for test set 14  is 0.23314606741573032
MSE for test set 15  is 0.2050561797752809
MSE for test set 16  is 0.23595505617977527
MSE for test set 17  is 0.19101123595505617
MSE for test set 18  is 0.20224719101123595
MSE for test set 19  is 0.2303370786516854
MSE for test set 20  is 0.21067415730337077
MSE for test set 21  is 0.22752808988764045
MSE for test set 22  is 0.25280898876404495
MS

MSE for test set 222  is 0.1797752808988764
MSE for test set 223  is 0.21348314606741572
MSE for test set 224  is 0.1797752808988764
MSE for test set 225  is 0.20224719101123595
MSE for test set 226  is 0.2303370786516854
MSE for test set 227  is 0.24719101123595505
MSE for test set 228  is 0.21629213483146068
MSE for test set 229  is 0.20786516853932585
MSE for test set 230  is 0.22191011235955055
MSE for test set 231  is 0.20786516853932585
MSE for test set 232  is 0.199438202247191
MSE for test set 233  is 0.17696629213483145
MSE for test set 234  is 0.23876404494382023
MSE for test set 235  is 0.21910112359550563
MSE for test set 236  is 0.23876404494382023
MSE for test set 237  is 0.18258426966292135
MSE for test set 238  is 0.20224719101123595
MSE for test set 239  is 0.2303370786516854
MSE for test set 240  is 0.2247191011235955
MSE for test set 241  is 0.20224719101123595
MSE for test set 242  is 0.21348314606741572
MSE for test set 243  is 0.21067415730337077
MSE for test set 

MSE for test set 411  is 0.22191011235955055
MSE for test set 412  is 0.20786516853932585
MSE for test set 413  is 0.19382022471910113
MSE for test set 414  is 0.21629213483146068
MSE for test set 415  is 0.199438202247191
MSE for test set 416  is 0.2050561797752809
MSE for test set 417  is 0.2050561797752809
MSE for test set 418  is 0.21629213483146068
MSE for test set 419  is 0.19382022471910113
MSE for test set 420  is 0.20786516853932585
MSE for test set 421  is 0.21910112359550563
MSE for test set 422  is 0.20224719101123595
MSE for test set 423  is 0.21067415730337077
MSE for test set 424  is 0.23314606741573032
MSE for test set 425  is 0.21067415730337077
MSE for test set 426  is 0.24157303370786518
MSE for test set 427  is 0.24157303370786518
MSE for test set 428  is 0.2247191011235955
MSE for test set 429  is 0.18820224719101122
MSE for test set 430  is 0.199438202247191
MSE for test set 431  is 0.199438202247191
MSE for test set 432  is 0.20224719101123595
MSE for test set 43

MSE for test set 601  is 0.23595505617977527
MSE for test set 602  is 0.21348314606741572
MSE for test set 603  is 0.1853932584269663
MSE for test set 604  is 0.2050561797752809
MSE for test set 605  is 0.21348314606741572
MSE for test set 606  is 0.199438202247191
MSE for test set 607  is 0.17696629213483145
MSE for test set 608  is 0.21910112359550563
MSE for test set 609  is 0.2443820224719101
MSE for test set 610  is 0.20224719101123595
MSE for test set 611  is 0.199438202247191
MSE for test set 612  is 0.20224719101123595
MSE for test set 613  is 0.21067415730337077
MSE for test set 614  is 0.1797752808988764
MSE for test set 615  is 0.20786516853932585
MSE for test set 616  is 0.21067415730337077
MSE for test set 617  is 0.23314606741573032
MSE for test set 618  is 0.23595505617977527
MSE for test set 619  is 0.199438202247191
MSE for test set 620  is 0.22752808988764045
MSE for test set 621  is 0.20224719101123595
MSE for test set 622  is 0.20224719101123595
MSE for test set 623

In [43]:
MSE_bs_mean = MSE_vec_bs.mean()
MSE_bs_std = MSE_vec_bs.std()
print('test estimate MSE bootstrap=', MSE_bs_mean,
      'test estimate MSE standard err=', MSE_bs_std)

test estimate MSE bootstrap= 0.21225421348314608 test estimate MSE standard err= 0.017542375190641984


## 3. How to use cross validation for model assessment and selection
**Model assessment** is the process of evaluating the performance of a particular model estimated on training data on its prediction accuracy on test data. 

There are many criteria for model assessment. **The most common measure of model accuracy on test data is the mean squared error $MSE$ or root mean squared error $rMSE$**. However, we have seen that the measure $MSE$ varies depending on which cross validation method is used.

[JWHT13] define **Model selection** as the process of "selecting the proper level of flexibility for a model. That is, they define model selection as a process of tuning a particular family or class of models to maximize accuracy on test set prediction. **So model selection involves model assessment.** However, one can expand this definition of model selection to include testing multiple families or classes of model in terms of accuracy--a horse race.

The narrower [JWHT13] definition of model selection is analogous to maximizing the efficiency of a particular horse and rider in a horse race in which the horse is racing against itself. My broader definition of model selection is analogous running a number of horses in a race, with each horse being optimized for efficiency. In this broader definition, many variables are at play and the data must be sampled many times (think of cross-validation optimization on each horse [model]).

This broader definition of model selection is computationally intensive. But that is where TensorFlow shines. [TensorFlow](https://www.tensorflow.org/) is an open source software library developed by the Google Brain AI group. Broadly, TensorFlow is a system of libraries that facilitate efficient, parallel, and scalable use of available processors (CPUs and GPUs) as well as memory management. For statistical learning, TensorFlow is optimized to efficiently run model assessment and model selection algorithms.

Many good empirical papers run a horse race on tuned statistical learning models to maximize predictive accuracy. It is hard to know *ex ante* which model will be the most accurate.

### 3.1. Gopalan, "Predicting Infant Mortality: Minimizing False Negatives"
[G18] is able to maintain overall accuracy, and decrease false negatives from 74% to 7%. She uses a regularization method on one of her variables (class), called "Tomek links", to make the variable more informative. She then tests five different predictive models (random forest, AdaBost, XGBoost, decision tree, logistic regression). She also tried a couple of additional data transformations. 

In her case, the random forest model had the best performance.

**[G13, Table 6] Comparison of Classifiers: Tomek Links.** FNR=false negative rate, FPR=False positive rate, AUROC=area under the ROC curve.

| Model | Train-FNR | Train-FPR  | Train-AUROC | Test-FNR  | Test-FPR  | Test-AUROC |
| --- | --- | --- | --- | --- | --- | --- |
| Random Forest | 0.06 | 0.06 | 0.62  | 0.07 | 0.06 | 0.62  |
| AdaBoost      | 0.40 | 0.07 | 0.51  | 0.45 | 0.07 | 0.50  |
| XGBoost       | 0.40 | 0.07 | 0.51  | 0.43 | 0.07 | 0.51  |
| Decision Tree | 0.03 | 0.06 | 0.62  | 0.07 | 0.06 | 0.62  |
| Logistic Reg  | 0.41 | 0.07 | 0.51  | 0.43 | 0.07 | 0.65  |

## 4. Problem with full-sample estimation
Overfitting. Maximizing accuracy in training set can pick incorporate noise from the data.

## 5. References
* Gopalan, Sushmita V., "Predicting Infant Mortality: Minimizing False Negatives," MACSS Thesis, University of Chicago (April 2018).
* James, Gareth, Deaniela Witten, Trevor Hastie, and Robert Tibshirani, [*An Introduction to Statistical Learning with Applications in R*](http://link.springer.com.proxy.uchicago.edu/book/10.1007%2F978-1-4614-7138-7), New York, Springer (2013).