Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally. Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques.

Grid Search Workflow

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [2]:
iris=datasets.load_iris()

In [3]:
iris.data.shape,iris.target.shape

((150, 4), (150,))

In [4]:
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.4,random_state=0)

In [5]:
X_train.shape,y_train.shape

((90, 4), (90,))

In [6]:
X_test.shape,y_test.shape

((60, 4), (60,))

In [7]:
clf=svm.SVC(kernel='linear',C=1).fit(X_train,y_train)

In [8]:
clf.score(X_test,y_test)

0.9666666666666667

In [9]:
from sklearn.model_selection import cross_val_score
clf=svm.SVC(kernel='linear',C=1)
scores=cross_val_score(clf,iris.data,iris.target,cv=5)
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

Then mean score and the 95% confidence interval of the score estimate are hence given by

In [10]:
print("Accuracy: %0.2f(+/- %0.2f)"%(scores.mean(),scores.std()*2))

Accuracy: 0.98(+/- 0.03)


In [11]:
from sklearn import metrics
scores=cross_val_score(clf,iris.data,iris.target,cv=5,scoring='f1_macro')
scores

array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

In [12]:
from sklearn.model_selection import ShuffleSplit
n_samples=iris.data.shape[0]
cv=ShuffleSplit(n_splits=5,test_size=0.3,random_state=0)
cross_val_score(clf,iris.data,iris.target,cv=cv)

array([0.97777778, 0.97777778, 1.        , 0.95555556, 1.        ])

In [13]:
def custom_cv_2folds(X):
    n=X.shape[0]
    i=1
    while i<=2:
        idx=np.arange(n*(i-1)/2,n*i/2,dtype=int)
        yield idx,idx
        i+=1
custom_cv=custom_cv_2folds(iris.data)
cross_val_score(clf,iris.data,iris.target,cv=custom_cv)

array([1.        , 0.97333333])

In [14]:
from sklearn.model_selection import cross_val_score

In [15]:
clf=svm.SVC(kernel='linear',C=1)
scores=cross_val_score(clf,iris.data,iris.target,cv=5)

In [16]:
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [17]:
print("Accuracy :%0.2f(+/-%0.2f)"%(scores.mean(),scores.std()*2))

Accuracy :0.98(+/-0.03)


In [18]:
from sklearn import metrics
scores=cross_val_score(clf,iris.data,iris.target,cv=5,scoring='f1_macro')
scores

array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

In [20]:
import numpy as np
from sklearn.model_selection import KFold

In [21]:
X=['a','b','c','d']
kf=KFold(n_splits=2)
for train,test in kf.split(X):
    print("%s %s"%(train,test))

[2 3] [0 1]
[0 1] [2 3]


In [22]:
X=np.array([[0.,0.],[1.,1.],[-1.,-1.],[2.,2.]])
y=np.array([0,1,0,1])
X_train,X_test,y_train,y_test=X[train],X[test],y[train],y[test]

In [23]:
import numpy as np
from sklearn.model_selection import RepeatedKFold
X=np.array([[1,2],[3,4],[1,2],[3,4]])
random_state=1288382
rkf=RepeatedKFold(n_splits=2,n_repeats=2,random_state=random_state)
for train,test in rkf.split(X):
    print("%s %s"%(train,test))

[2 3] [0 1]
[0 1] [2 3]
[1 3] [0 2]
[0 2] [1 3]


In [26]:
from sklearn.model_selection import StratifiedKFold
X=np.ones(10)
y=[0,0,0,0,1,1,1,1,1,1]
skf=StratifiedKFold(n_splits=3)
for train,test in skf.split(X,y):
    print("%s%s"%(train,test))

[2 3 6 7 8 9][0 1 4 5]
[0 1 3 4 5 8 9][2 6 7]
[0 1 2 4 5 6 7][3 8 9]


In [28]:
from sklearn.model_selection import GroupKFold

In [29]:
X=[0.1,0.2,2.2,2.4,2.3,4.55,5.8,8.8,9,10]
y=['a','b','b','b','c','c','c','d','d','d']
groups=[1,1,1,2,2,2,3,3,3,3]

In [30]:
gkf=GroupKFold(n_splits=3)

In [31]:
for train,test in gkf.split(X,y,groups=groups):
    print("%s%s"%(train,test))

[0 1 2 3 4 5][6 7 8 9]
[0 1 2 6 7 8 9][3 4 5]
[3 4 5 6 7 8 9][0 1 2]
