In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
run src/preprocessing.py

In [3]:
dataset_1.shape, dataset_2.shape, dataset_3.shape, dataset_4.shape, 

((1444, 382), (1444, 390), (1444, 382), (1444, 390))

## Model Selection: Cross-Validation

In the next phase of this project we move into developing our machine learning models. We have previously about model selection and have considered managing the Bias-Variance Tradeoff as we fit our predictive model. We primarily focused on identifying the simplest possible model as a way to making sure that our model generalizes to new data. Now we expand on this by examining three new concepts in model assessment and selection.

1. using cross-validation to study model variance
1. applying regularization to help our models generalize
1. using emsembling to help our models generalize 

One commonly held misconceptions is that cross-validation can to help models to generalize. This is not the case. Rather, cross-validation can be used to help to identify potential issues and to optimize model hyperparameters toward the end of choosing the best possible model.

#### The Validation Set Approach

Cross-validation is a resampling technique and is simply the creative use of collected data. We have already seen a very simple cross-validation approach, the train-test split also called The Validation Set Approach.

![](doc/img/Chapter5/5-1.png)

In [4]:
from time import time
from sklearn.model_selection import train_test_split

In [5]:
(dataset_1.shape,
 dataset_2.shape,
 dataset_3.shape,
 dataset_4.shape)

((1444, 382), (1444, 390), (1444, 382), (1444, 390))

In [6]:
np.testing.assert_allclose(dataset_1.index, target_1.index)
np.testing.assert_allclose(dataset_2.index, target_2.index)
np.testing.assert_allclose(dataset_3.index, target_3.index)
np.testing.assert_allclose(dataset_4.index, target_4.index)

In [7]:
ttsplit_1 = train_test_split(dataset_1, target_1, test_size=0.4, random_state=0)
ttsplit_2 = train_test_split(dataset_2, target_2, test_size=0.4, random_state=0)
ttsplit_3 = train_test_split(dataset_3, target_3, test_size=0.4, random_state=0)
ttsplit_4 = train_test_split(dataset_4, target_4, test_size=0.4, random_state=0)

In [10]:
[obj.shape for obj in ttsplit_1]

[(866, 382), (578, 382), (866,), (578,)]

In [11]:
def fit_score(model, data):
    X_train = data[0]
    X_test  = data[1]
    y_train = data[2]
    y_test  = data[3]
    
    start = time()
    model.fit(X_train, y_train)
    end = time() - start 
    return model.score(X_test, y_test), end

In [12]:
from sklearn.linear_model import Lasso, Ridge

In [13]:
print(fit_score(Ridge(max_iter=1E5), ttsplit_1))
print(fit_score(Ridge(max_iter=1E5), ttsplit_2))
print(fit_score(Ridge(max_iter=1E5), ttsplit_3))
print(fit_score(Ridge(max_iter=1E5), ttsplit_4))

(0.8986086275180848, 0.04617810249328613)
(0.898581658076641, 0.01986098289489746)
(0.8992497770091992, 0.012306928634643555)
(0.8993027289220459, 0.012377023696899414)


In [14]:
print(fit_score(Lasso(max_iter=1E4), ttsplit_1))
print(fit_score(Lasso(max_iter=1E5), ttsplit_2))
print(fit_score(Lasso(max_iter=1E4), ttsplit_3))
print(fit_score(Lasso(max_iter=1E5), ttsplit_4))

(0.8758759487036974, 1.3665411472320557)
(0.8758706940367507, 6.971800804138184)
(0.8734491171077683, 0.6850090026855469)
(0.8734608343787743, 3.6014630794525146)


#### Leave-One-Out Cross-Validation

An alternative to using a single validation set is using **leave-one-out cross-validation** (LOOCV). 

![](doc/img/Chapter5/5-3.png)

Here, instead of creating two sets, we create $n$ sets and fit $n$ models. Using this method, each data point is used as a testing point exactly once. To assess the performance we simply take the average over all models

$$\text{CV}_n=\mathbb{E}\left[MSE(f_i)\right]$$

One draw back to this approach is the substantial time required to set a model for each data point.

In [68]:
from sklearn.model_selection import LeaveOneOut

In [70]:
def fit_score_loo(model, dataset, target):
    loo = LeaveOneOut()
    scores = []
    start = time()
    for train, test in loo.split(dataset, target):
        train = dataset.index[train]
        test = dataset.index[test]

        X_train = dataset.loc[train]
        X_test  = dataset.loc[test]
        y_train = dataset.loc[train]
        y_test  = dataset.loc[test]
    
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))

    end = time() - start 
    scores = np.array(scores)
    print("Mean: {:6} Variance: {:6} Time: {:6}".format(scores.mean(), scores.var(), end))


In [71]:
print(fit_score_loo(Ridge(), dataset_1, target_1))
print(fit_score_loo(Ridge(), dataset_2, target_2))
print(fit_score_loo(Ridge(), dataset_3, target_3))
print(fit_score_loo(Ridge(), dataset_4, target_4))

Mean:    0.0 Variance:    0.0 Time: 179.01318764686584
None
Mean:    0.0 Variance:    0.0 Time: 187.70759654045105
None
Mean:    0.0 Variance:    0.0 Time: 167.30623126029968
None
Mean:    0.0 Variance:    0.0 Time: 159.89035868644714
None


In [72]:
print(fit_score_loo(Lasso(), dataset_1, target_1))
print(fit_score_loo(Lasso(), dataset_2, target_2))
print(fit_score_loo(Lasso(), dataset_3, target_3))
print(fit_score_loo(Lasso(), dataset_4, target_4))

Mean:    0.0 Variance:    0.0 Time: 5569.9769694805145
None
Mean:    0.0 Variance:    0.0 Time: 5727.205169916153
None
Mean:    0.0 Variance:    0.0 Time: 5659.567879199982
None
Mean:    0.0 Variance:    0.0 Time: 5761.360241889954
None


#### K-Fold Cross-Validation

It is usually not practical to use LOOCV. Unacceptable alternative is to use **k-fold cross-validation** (KCV). In this method the data set is split into $k$ groups. Then, $k$ models are fit. Uses exactly one of the groups as a validation set And the remaining data as the training set. As before, the cross validation score is simply the average of the scores across all of the models

$$\text{CV}_k=\mathbb{E}\left[MSE(f_i)\right]$$

![](doc/img/Chapter5/5-5.png)

Typical values of $k$ are $k=5$ or $k=10$.

In [15]:
from sklearn.model_selection import KFold

In [16]:
X_df = pd.DataFrame([
    {'x' : 1},
    {'x' : 2},
    {'x' : 3},
    {'x' : 4},
    {'x' : 5},
    {'x' : 6},
    {'x' : 7},
    {'x' : 8},
    {'x' : 9},
    {'x' : 10},
])

X_df.index = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

In [17]:
X_df = X_df.drop('E')

In [18]:
X_df

Unnamed: 0,x
A,1
B,2
C,3
D,4
F,6
G,7
H,8
I,9
J,10


In [19]:
kf = KFold(n_splits=3)
splitter = kf.split(X_df)

In [24]:
train, test = next(splitter)
train, test

(array([0, 1, 2, 3, 4, 5]), array([6, 7, 8]))

In [21]:
train = X_df.index[train]

In [22]:
train

Index(['D', 'F', 'G', 'H', 'I', 'J'], dtype='object')

In [25]:
def fit_score_kfold(model, dataset, target, folds=5):
    kf = KFold(n_splits=folds)
    scores = []
    start = time()
    for train, test in kf.split(dataset, target):
        train = dataset.index[train]
        test = dataset.index[test]

        X_train = dataset.loc[train]
        X_test  = dataset.loc[test]
        y_train = target.loc[train]
        y_test  = target.loc[test]
    
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))
    
    scores = np.array(scores)
    end = time() - start 

    print("Mean: {:6} Variance: {:6} Time: {:6}".format(scores.mean(), scores.var(), end))

In [55]:
fit_score_kfold(Ridge(), dataset_1, target_1)
fit_score_kfold(Ridge(), dataset_2, target_2)
fit_score_kfold(Ridge(), dataset_3, target_3)
fit_score_kfold(Ridge(), dataset_4, target_4)

Mean: 0.8853121172449191 Variance: 0.0002229144246784079 Time: 0.1946556568145752
Mean: 0.8853098599136107 Variance: 0.00022254648260988997 Time: 0.24270319938659668
Mean: 0.8855857104448723 Variance: 0.00022384015204386532 Time: 0.2041628360748291
Mean: 0.8855848356167636 Variance: 0.00022431601124553446 Time: 0.33058595657348633


In [56]:
fit_score_kfold(Lasso(), dataset_1, target_1)
fit_score_kfold(Lasso(), dataset_2, target_2)
fit_score_kfold(Lasso(), dataset_3, target_3)
fit_score_kfold(Lasso(), dataset_4, target_4)



Mean: 0.870535279080752 Variance: 0.000367807488798026 Time: 3.989431381225586
Mean: 0.8705024797773 Variance: 0.00037061494130101127 Time: 3.5346145629882812
Mean: 0.8676419438563763 Variance: 0.00021040580637601814 Time: 3.3818252086639404
Mean: 0.8675954486257649 Variance: 0.00020691094833489974 Time: 3.176971912384033


### Stratified Shuffle Split

In [36]:
from sklearn.model_selection import StratifiedKFold

In [26]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

kf = KFold(n_splits=3)
splitter = kf.split(X)

train_index, test_index = next(splitter)

In [31]:
test_index

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

In [32]:
y[test_index]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

In [33]:
y[train_index]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [38]:
kf = StratifiedKFold(n_splits=3)
splitter = kf.split(X, y)

train_index, test_index = next(splitter)

In [39]:
test_index

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  50,  51,  52,  53,  54,  55,  56,  57,  58,
        59,  60,  61,  62,  63,  64,  65,  66, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116])

In [40]:
y[test_index]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2])

In [41]:
y[train_index]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### Bias-Variance Trade-Off for k-Fold Cross-Validation

In terms of bias, it is clear that LOOCV will have lower bias than KCV when $k < n$. This is because each model is trained using $n-1$ points which is nearly all of the training data. Since KCV uses less of the data, it has less ability to learn the phenomenon represented by the data and is therefore more biased then LOOCV.

On the other hand, LOOCV has more variance than KCV. This is because LOOCV involve the fitting and then averaging of performance of $n$ models, whereas KCV does this over $k$ models. Furthermore, the $n$ LOOCV models are more correlated with each other than are the $k$ KCV models. This is clear because each LOOCV model is identical to any other LOOCV model save for one point. Meanwhile each KCV model differs from any other KCV model in $n/k$ points. It can be shown that the meani of highly correlated quantities has higher variance then does the mean of quantities that are not as highly correlated. In other words, the LOOCV has higher variance than does the KCV.