# Scikit-learn CV 
tutorial from : scikit learn [official site](http://scikit-learn.org/stable/modules/cross_validation.html)

In [1]:
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [3]:
iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

In [4]:
X_train,X_test,y_train,y_test = train_test_split(
    iris.data, iris.target, test_size =0.4, random_state=0)

In [5]:
X_train.shape, y_train.shape

((90, 4), (90,))

In [6]:
X_test.shape, y_test.shape

((60, 4), (60,))

In [8]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train,y_train)
clf.score(X_test,y_test)

0.96666666666666667

## How to do CV

In [None]:
# from sklearn.model_selection import ShuffleSplit # Hold out 
# from sklearn.model_selection import KFold # kfold 
# from sklearn.model_selection import LeaveOneOut # LOO

In [10]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear',C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores

array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ])

In [11]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


### change score defination to **F1**

In [13]:
# from sklearn import metrics
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')
scores

array([ 0.96658312,  1.        ,  0.96658312,  0.96658312,  1.        ])

It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:

In [14]:
from sklearn.model_selection import ShuffleSplit
n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)

In [17]:
cross_val_score(clf, iris.data, iris.target, cv=cv)

array([ 0.97777778,  0.97777778,  1.        ])

### Data transform with held out data

In [19]:
from sklearn import preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test)
clf.score(X_test_transformed, y_test)  

0.93333333333333335

### make pipeline

In [24]:
from sklearn.pipeline import make_pipeline
clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, iris.data, iris.target, cv=cv)

array([ 0.97777778,  0.93333333,  0.95555556])

## `cross_validate`

In [26]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score

In [27]:
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
                        cv=5, return_train_score=False)

In [31]:
scores['test_recall_macro']

array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ])

Or as a dict mapping scorer name to a predefined or custom scoring function:

# CV iterators

iid 

- k fold

In [32]:
from sklearn.model_selection import KFold
import numpy as np 

In [40]:
X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)

In [34]:
kf

KFold(n_splits=2, random_state=None, shuffle=False)

In [41]:
for train, test in kf.split(X):
    print('{}{}'.format(train,test))

[2 3][0 1]
[0 1][2 3]


In [42]:
X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
y = np.array([0, 1, 0, 1])
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]

- Leave one out(LOO)

In [46]:
from sklearn.model_selection import LeaveOneOut
X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
     print("%s %s" % (train, test))

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


- Shuffle and split (

In [47]:
from sklearn.model_selection import ShuffleSplit
X = np.arange(5)
ss = ShuffleSplit(n_splits=3, test_size=0.25,
     random_state=0)
for train_index, test_index in ss.split(X):
     print("%s %s" % (train_index, test_index))

[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]


- StratifiedKFold 

is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

In [51]:
from sklearn.model_selection import StratifiedKFold

X = np.ones(10)
y = [0, 1, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


- grouped K-fold [doc](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data)

# Cross validation of time series data

[doc](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data)

## Time series split

time series data samples that are observed at fixed time intervals.

In [53]:
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)

TimeSeriesSplit(max_train_size=None, n_splits=3)


In [54]:
for train, test in tscv.split(X):
    print("%s %s" % (train, test))

[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]


# Note on SHUFFLE
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.

Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:

    This consumes less memory than shuffling the data directly.
    By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
    The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.
    To get identical results for each split, set random_state to an integer.
