# Cross-Validation for (hyper)parameter tuning

All Rights Reserved © <a href="http://www.louisdorard.com" style="color: #6D00FF;">Louis Dorard</a>

<img src="http://s3.louisdorard.com.s3.amazonaws.com/ML_icon.png">

## Load data

Same as in Evaluate notebook...

In [4]:
from pandas import read_csv

path = "/data/"
data = read_csv(path + "boston-housing.csv", index_col=0)
target_column = "medv"
features = data.drop(target_column, axis=1)
outputs = data[target_column]
X = features.values.astype(float)
y = outputs.values

## Define models to compare

In [30]:
from sklearn.ensemble import RandomForestRegressor

SEED = 8

estimator1 = RandomForestRegressor(max_features=0.5, max_depth=3, random_state=SEED)
estimator2 = RandomForestRegressor(max_features=0.75, max_depth=3, random_state=SEED)
estimator3 = RandomForestRegressor(max_features=0.5, max_depth=9, random_state=SEED)
estimator4 = RandomForestRegressor(max_features=0.75, max_depth=9, random_state=SEED)

In [34]:
from sklearn.model_selection import cross_val_score

SCORING = "r2"
FOLDS = 10
verbose = 1
s1 = cross_val_score(estimator1, X, y, scoring=SCORING, cv=FOLDS, verbose=verbose)
s2 = cross_val_score(estimator2, X, y, scoring=SCORING, cv=FOLDS, verbose=verbose)
s3 = cross_val_score(estimator3, X, y, scoring=SCORING, cv=FOLDS, verbose=verbose)
s4 = cross_val_score(estimator4, X, y, scoring=SCORING, cv=FOLDS, verbose=verbose)
print("Estimator 1: " + str(s1.mean()))
print("Estimator 2: " + str(s2.mean()))
print("Estimator 3: " + str(s3.mean()))
print("Estimator 4: " + str(s4.mean()))

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.2s finished


Estimator 1: 0.3884065446793275
Estimator 2: 0.38356298484116846
Estimator 3: 0.5391733055255037
Estimator 4: 0.4907245368790047


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.2s finished


Remarks:

- Try setting `verbose` to 1
- See possible values of `scoring` parameter in [online documentation](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

## Fix the folds to be used for CV

In [36]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
results = cross_val_score(estimator1, X, y, scoring=SCORING, cv=kfold, verbose=verbose)
print(results)

[0.74011971 0.86028778 0.82026618 0.84185195 0.70178173 0.74864892
 0.76457725 0.74218364 0.7183325  0.69049269]


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.2s finished


In [37]:
results.mean()

0.7628542369259114

Results are much better... can you guess why?

Let's inspect splits, for instance the 2nd one:

In [25]:
splits = []
for train_index, test_index in kfold.split(X):
    split = {
        'train_index': train_index,
        'test_index': test_index
    }
    splits.append(split)

In [26]:
train_index = splits[2]['train_index']
test_index = splits[2]['test_index']
print(test_index)

[  5  11  17  22  44  53  67  71  80  87  92  95 103 131 146 149 158 159
 166 175 176 195 197 208 213 225 233 238 239 258 266 268 287 300 301 321
 333 352 366 367 383 403 412 417 436 469 470 472 487 494 500]


In [28]:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]