In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
run src/preprocessing.py

## Model Selection: Cross-Validation

In the next phase of this project we move into developing our machine learning models. We have previously about model selection and have considered managing the Bias-Variance Tradeoff as we fit our predictive model. We primarily focused on identifying the simplest possible model as a way to making sure that our model generalizes to new data. Now we expand on this by examining three new concepts in model assessment and selection.

1. using cross-validation to study model variance
1. applying regularization to help our models generalize
1. using emsembling to help our models generalize 

One commonly held misconceptions is that cross-validation can to help models to generalize. This is not the case. Rather, cross-validation can be used to help to identify potential issues and to optimize model hyperparameters toward the end of choosing the best possible model.

#### The Validation Set Approach

Cross-validation is a resampling technique and is simply the creative use of collected data. We have already seen a very simple cross-validation approach, the train-test split also called The Validation Set Approach.

![](doc/img/Chapter5/5-1.png)

In [3]:
from time import time
from sklearn.model_selection import train_test_split

In [4]:
(dataset_1.shape,
 dataset_2.shape,
 dataset_3.shape,
 dataset_4.shape)

((1444, 382), (1444, 390), (1444, 382), (1444, 390))

In [5]:
np.testing.assert_allclose(dataset_1.index, target_1.index)
np.testing.assert_allclose(dataset_2.index, target_2.index)
np.testing.assert_allclose(dataset_3.index, target_3.index)
np.testing.assert_allclose(dataset_4.index, target_4.index)

In [6]:
ttsplit_1 = train_test_split(dataset_1, target_1, test_size=0.4, random_state=0)
ttsplit_2 = train_test_split(dataset_2, target_1, test_size=0.4, random_state=0)
ttsplit_3 = train_test_split(dataset_3, target_1, test_size=0.4, random_state=0)
ttsplit_4 = train_test_split(dataset_4, target_1, test_size=0.4, random_state=0)

In [7]:
def fit_score(model, data):
    X_train = data[0]
    X_test  = data[1]
    y_train = data[2]
    y_test  = data[3]
    
    start = time()
    model.fit(X_train, y_train)
    end = time() - start 
    return model.score(X_test, y_test), end

In [8]:
from sklearn.linear_model import Lasso, Ridge

In [9]:
print(fit_score(Ridge(max_iter=1E5), ttsplit_1))
print(fit_score(Ridge(max_iter=1E5), ttsplit_2))
print(fit_score(Ridge(max_iter=1E5), ttsplit_3))
print(fit_score(Ridge(max_iter=1E5), ttsplit_4))

(0.89860862751808546, 0.05988335609436035)
(0.89858127471660521, 0.04999995231628418)
(0.89924977700919972, 0.11284351348876953)
(0.89930376234587361, 0.15570759773254395)


In [11]:
print(fit_score(Lasso(max_iter=1E4), ttsplit_1))
print(fit_score(Lasso(max_iter=1E5), ttsplit_2))
print(fit_score(Lasso(max_iter=1E4), ttsplit_3))
print(fit_score(Lasso(max_iter=1E5), ttsplit_4))

(0.87587594870369756, 2.7831077575683594)
(0.87587066389618007, 14.507377624511719)
(0.87344492815283581, 1.4013075828552246)
(0.87345792431143909, 8.512555122375488)


#### Leave-One-Out Cross-Validation

An alternative to using a single validation set is using **leave-one-out cross-validation** (LOOCV). 

![](doc/img/Chapter5/5-3.png)

Here, instead of creating two sets, we create $n$ sets and fit $n$ models. Using this method, each data point is used as a testing point exactly once. To assess the performance we simply take the average over all models

$$\text{CV}_n=\mathbb{E}\left[MSE(f_i)\right]$$

One draw back to this approach is the substantial time required to set a model for each data point.

In [14]:
from sklearn.model_selection import LeaveOneOut

In [15]:
def fit_score_loo(model, dataset, target):
    loo = LeaveOneOut()
    scores = []
    for train, test in loo.split(dataset, target):
        train = dataset.index[train]
        test = dataset.index[test]

        X_train = dataset.loc[train]
        X_test  = dataset.loc[test]
        y_train = dataset.loc[train]
        y_test  = dataset.loc[test]
    
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))
    
    scores = np.array(scores)
    print("Mean: {} Variance: {}").format(scores.mean(), scores.var())

In [None]:
# print(fit_score_loo(Ridge(), dataset_1, target_1))
# print(fit_score_loo(Ridge(), dataset_2, target_2))
# print(fit_score_loo(Ridge(), dataset_3, target_3))
# print(fit_score_loo(Ridge(), dataset_4, target_4))

In [16]:
# print(fit_score_loo(Lasso(), dataset_1, target_1))
# print(fit_score_loo(Lasso(), dataset_2, target_2))
# print(fit_score_loo(Lasso(), dataset_3, target_3))
# print(fit_score_loo(Lasso(), dataset_4, target_4))

#### K-Fold Cross-Validation

It is usually not practical to use LOOCV. Unacceptable alternative is to use **k-fold cross-validation** (KCV). In this method the data set is split into $k$ groups. Then, $k$ models are fit. Uses exactly one of the groups as a validation set And the remaining data as the training set. As before, the cross validation score is simply the average of the scores across all of the models

$$\text{CV}_k=\mathbb{E}\left[MSE(f_i)\right]$$

![](doc/img/Chapter5/5-5.png)

Typical values of $k$ are $k=5$ or $k=10$.

In [17]:
from sklearn.model_selection import KFold

In [23]:
def fit_score_kfold(model, dataset, target, folds=5):
    kf = KFold(n_splits=folds)
    scores = []
    start = time()
    for train, test in kf.split(dataset, target):
        train = dataset.index[train]
        test = dataset.index[test]

        X_train = dataset.loc[train]
        X_test  = dataset.loc[test]
        y_train = dataset.loc[train]
        y_test  = dataset.loc[test]
    
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))
    
    scores = np.array(scores)
    end = time() - start 

    print("Mean: {:6} Variance: {:6} Time: {:6}".format(scores.mean(), scores.var(), end))

In [25]:
fit_score_kfold(Ridge(), dataset_1, target_1)
fit_score_kfold(Ridge(), dataset_2, target_2)
fit_score_kfold(Ridge(), dataset_3, target_3)
fit_score_kfold(Ridge(), dataset_4, target_4)

Mean: 0.9981262140537511 Variance: 9.051588022447506e-08 Time: 8.507905960083008
Mean: 0.99856147117169 Variance: 5.8902910436906337e-08 Time: 6.726489067077637
Mean: 0.9969821835724456 Variance: 2.5471613800450136e-07 Time: 5.750875234603882
Mean: 0.9976435153720546 Variance: 1.59744732253966e-07 Time: 6.65703558921814


In [26]:
fit_score_kfold(Lasso(), dataset_1, target_1)
fit_score_kfold(Lasso(), dataset_2, target_2)
fit_score_kfold(Lasso(), dataset_3, target_3)
fit_score_kfold(Lasso(), dataset_4, target_4)

Mean: 0.0005571268766496533 Variance: 2.7351592891329313e-06 Time: 39.25697994232178
Mean: 0.15485748423960147 Variance: 5.6508959079436564e-05 Time: 40.771477460861206
Mean: -0.003698071537686329 Variance: 2.3624052947981027e-07 Time: 39.45966958999634
Mean: 0.0849277256112875 Variance: 1.9073768861526476e-06 Time: 40.17126703262329


### Bias-Variance Trade-Off for k-Fold Cross-Validation

In terms of bias, it is clear that LOOCV will have lower bias than KCV when $k < n$. This is because each model is trained using $n-1$ points which is nearly all of the training data. Since KCV uses less of the data, it has less ability to learn the phenomenon represented by the data and is therefore more biased then LOOCV.

On the other hand, LOOCV has more variance than KCV. This is because LOOCV involve the fitting and then averaging of performance of $n$ models, whereas KCV does this over $k$ models. Furthermore, the $n$ LOOCV models are more correlated with each other than are the $k$ KCV models. This is clear because each LOOCV model is identical to any other LOOCV model save for one point. Meanwhile each KCV model differs from any other KCV model in $n/k$ points. It can be shown that the meani of highly correlated quantities has higher variance then does the mean of quantities that are not as highly correlated. In other words, the LOOCV has higher variance than does the KCV.