#Model selection: choosing estimators and their parameters

##1.Score, and cross-validated scores
As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.

In [1]:
from sklearn import datasets , svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1,kernel='linear')
svc.fit(X_digits[:-100],y_digits[:-100]).score(X_digits[-100:],y_digits[-100:])

0.97999999999999998

To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing:

In [2]:
import numpy as np
X_folds = np.array_split(X_digits,3)#平均分成了三分
y_folds = np.array_split(y_digits,3)
scores = list()
for k in range(3):
    X_train = list(X_folds) #使用list来进行复制
    X_test  = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test  = y_train.pop(k)
    y_train = np.concatenate(y_train) #将list连接成一个ndarray
    print len(X_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)

1198
1198
1198
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]


This is called a KFold cross validation

##Cross-validation generators
The code above to split data in train and test sets is tedious to write. Scikit-learn exposes cross-validation generators to generate list of indices for this purpose:

In [3]:
from sklearn import cross_validation
k_fold = cross_validation.KFold(n=6, n_folds=3)
for train_indices, test_indices in k_fold:
    print('Train: %s | test: %s' % (train_indices, test_indices))


Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]


The cross-validation can then be implemented easily:

In [4]:
kfold = cross_validation.KFold(len(X_digits), n_folds=3)
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) for train, test in kfold]


[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

To compute the score method of an estimator, the sklearn exposes a helper function:

In [None]:
cross_validation.cross_val_score(svc, X_digits, y_digits, cv=kfold, n_jobs=-1)

##3.Grid-search and cross-validated estimators
The sklearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:


In [None]:
>>> from sklearn.grid_search import GridSearchCV
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
...                    n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])        
GridSearchCV(cv=None,...
>>> clf.best_score_                                  
0.925...
>>> clf.best_estimator_.C                            
0.0077...

>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])   