SKLearn model selection - Every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data.

Here, Normal cross validation - svc.fit run without last 100 and score checked without first 100

In [1]:
from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])

0.97999999999999998

In [2]:
X_digits[:-1795]

array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
         15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
          8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
          5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
          1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
          0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.],
       [  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.,   0.,   0.,   0.,
         11.,  16.,   9.,   0.,   0.,   0.,   0.,   3.,  15.,  16.,   6.,
          0.,   0.,   0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.,   0.,
          0.,   1.,  16.,  16.,   3.,   0.,   0.,   0.,   0.,   1.,  16.,
         16.,   6.,   0.,   0.,   0.,   0.,   1.,  16.,  16.,   6.,   0.,
          0.,   0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.]])

In [3]:
y_digits[:20]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing:

1) split data into sets or folds

x = np.arange(8.0)

np.array_split(x, 3)

2) looping for each fold 
3) We use 'list' to copy, in order to 'pop' later on
3) popping array values to test score fold
4) merged folds to single array 

X_train = np.concatenate(X_train)

5) calculating score and appending it to list foreach fold


In [4]:
import numpy as np
x = np.arange(8.0)
np.array_split(x, 3)

[array([ 0.,  1.,  2.]), array([ 3.,  4.,  5.]), array([ 6.,  7.])]

In [5]:

X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)

In [6]:
scores = list()
for k in range(3): #looping for each fold    
    X_train = list(X_folds) # We use 'list' to copy, in order to 'pop' later on
    X_test  = X_train.pop(k) #popping array values 
    X_train = np.concatenate(X_train) #merged folds to single array
    y_train = list(y_folds)
    y_test  = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test)) #calculating score and appending it to list foreach fold
print(scores)

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]


Cross-validation generators: Scikit-learn has a collection of classes which can be used to generate lists of train/test indices for popular cross-validation strategies.

Example below, usage of the split method:

In [7]:
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "b", "c", "c", "c"]
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
     print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]


In [8]:
kfold = KFold(n_splits=3)
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
         for train, test in k_fold.split(X_digits)]

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

The cross-validation score can be directly calculated using the cross_val_score helper.

n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.

In [9]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)

array([ 0.93489149,  0.95659432,  0.93989983])

In [10]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold,
                scoring='precision_macro')

array([ 0.93969761,  0.95911415,  0.94041254])

Stratified K-Folds cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Simple Example with numpy Array:

In [11]:
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)

print(skf)  

for train_index, test_index in skf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]


StratifiedKFold Generation for crowdflower data, breaks data into set of train and test indices

In [52]:
import os
os.chdir("C:\\Users\\hgamit\\workspace\\crowdf\\test")
import dill as cPickle
from sklearn.model_selection import StratifiedKFold

In [53]:
## load data
with open("../processed/train.processed.csv.pkl", "rb") as f:
    dfTrain = cPickle.load(f)

In [54]:
skf = [0]*3
skf

[0, 0, 0]

In [66]:
dfTrain.shape

(10158, 12)

In [58]:
for stratified_label,key in zip(["relevance", "query"], ["median_relevance", "qid"]):
    for run in range(3):
        random_seed = 2015 + 1000 * (run+1)
        y = dfTrain[key]
        X = y.values.reshape(len(y), 1)
        skf[run] = StratifiedKFold(shuffle=True, random_state=random_seed)
        for fold, (validInd, trainInd) in enumerate(list(skf[run].split(X, y))):
            print("================================")
            print("Index for run: %s, fold: %s" % (run+1, fold+1))
            print("Train (num = %s)" % len(trainInd))
            print(trainInd[:10])
            print("Valid (num = %s)" % len(validInd))
            print(validInd[:10])
    with open("%s/stratifiedKFold.%s.pkl" % ("./", stratified_label), "wb") as f:
        cPickle.dump(skf, f, -1)


Index for run: 1, fold: 1
Train (num = 3386)
[ 3  8 10 11 12 20 21 22 28 33]
Valid (num = 6772)
[ 0  1  2  4  5  6  7  9 13 14]
Index for run: 1, fold: 2
Train (num = 3386)
[ 2  5  6 14 16 17 18 19 23 25]
Valid (num = 6772)
[ 0  1  3  4  7  8  9 10 11 12]
Index for run: 1, fold: 3
Train (num = 3386)
[ 0  1  4  7  9 13 15 24 27 30]
Valid (num = 6772)
[ 2  3  5  6  8 10 11 12 14 16]
Index for run: 2, fold: 1
Train (num = 3386)
[ 0  1  9 11 17 21 27 28 31 34]
Valid (num = 6772)
[ 2  3  4  5  6  7  8 10 12 13]
Index for run: 2, fold: 2
Train (num = 3386)
[ 3  5  6  8 10 14 23 25 26 30]
Valid (num = 6772)
[ 0  1  2  4  7  9 11 12 13 15]
Index for run: 2, fold: 3
Train (num = 3386)
[ 2  4  7 12 13 15 16 18 19 20]
Valid (num = 6772)
[ 0  1  3  5  6  8  9 10 11 14]
Index for run: 3, fold: 1
Train (num = 3386)
[ 1  3  8 10 11 12 13 14 15 19]
Valid (num = 6772)
[ 0  2  4  5  6  7  9 16 17 18]
Index for run: 3, fold: 2
Train (num = 3386)
[ 4  5 16 18 22 24 27 30 32 38]
Valid (num = 6772)
[ 0  1  



Index for run: 1, fold: 1
Train (num = 3469)
[ 4  5 12 14 16 18 20 22 29 30]
Valid (num = 6689)
[ 0  1  2  3  6  7  8  9 10 11]
Index for run: 1, fold: 2
Train (num = 3393)
[ 3  8 13 15 19 24 25 26 27 32]
Valid (num = 6765)
[ 0  1  2  4  5  6  7  9 10 11]
Index for run: 1, fold: 3
Train (num = 3296)
[ 0  1  2  6  7  9 10 11 17 21]
Valid (num = 6862)
[ 3  4  5  8 12 13 14 15 16 18]
Index for run: 2, fold: 1
Train (num = 3470)
[ 0  2  6  7  9 10 12 14 15 16]
Valid (num = 6688)
[ 1  3  4  5  8 11 13 18 20 22]
Index for run: 2, fold: 2
Train (num = 3393)
[13 18 24 25 38 39 40 42 43 48]
Valid (num = 6765)
[0 1 2 3 4 5 6 7 8 9]
Index for run: 2, fold: 3
Train (num = 3295)
[ 1  3  4  5  8 11 20 22 23 30]
Valid (num = 6863)
[ 0  2  6  7  9 10 12 13 14 15]
Index for run: 3, fold: 1
Train (num = 3470)
[ 0  3  7  8  9 13 16 20 26 27]
Valid (num = 6688)
[ 1  2  4  5  6 10 11 12 14 15]
Index for run: 3, fold: 2
Train (num = 3393)
[ 2  6 11 12 14 19 21 22 24 28]
Valid (num = 6765)
[ 0  1  3  4  5  7

StratifiedKFold old version(cross_validation) vs new version(model_selection): Comparison and None argument issue (https://github.com/scikit-learn/scikit-learn/issues/7126), 

In [39]:
import sklearn.cross_validation
import sklearn.model_selection
y = np.array([0, 0, 1, 1, 1, 0, 0, 1])
X = y.reshape(len(y), 1)

In [46]:
# In the old version all that is needed is the labels
skf_old = sklearn.cross_validation.StratifiedKFold(y, random_state=0)
indicies_old = list(skf_old)
skf_old

sklearn.cross_validation.StratifiedKFold(labels=[0 0 1 1 1 0 0 1], n_folds=3, shuffle=False, random_state=0)

In [48]:
indicies_old

[(array([4, 5, 6, 7]), array([0, 1, 2, 3])),
 (array([0, 1, 2, 3, 6, 7]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5]), array([6, 7]))]

In [50]:
# The new version seems to require a data array for some reason
skf_new = sklearn.model_selection.StratifiedKFold(random_state=0)
indicies_new = list(skf_new.split(X, y))
skf_new

StratifiedKFold(n_splits=3, random_state=0, shuffle=False)

In [67]:
indicies_new

[(array([4, 5, 6, 7]), array([0, 1, 2, 3])),
 (array([0, 1, 2, 3, 6, 7]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5]), array([6, 7]))]

LabelBinarizer: Binarize labels in a one-vs-all fashion
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html

At learning time, this simply consists in learning one regressor or binary classifier per class. In doing so, one needs to convert multi-class labels to binary labels (belong or does not belong to the class). LabelBinarizer makes this process easy with the transform method.

At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the inverse_transform method.

In [94]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit([1, 2, 6, 4, 2])
lb.classes_

array([1, 2, 4, 6])

In [98]:
lb.transform([1, 2, 4, 6, 2])

array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 0],
       [0, 0, 0, 0]], dtype=int32)

Binary targets transform into column vector

In [89]:
lb = preprocessing.LabelBinarizer()
lb.fit(['yes', 'no', 'no', 'yes'])
lb.classes_
lb.transform(['no', 'yes'])

array([[0],
       [1]], dtype=int32)

In [104]:
import numpy as np
lb.fit(np.array([[0, 1, 1], [1, 0, 0]]))
lb.classes_

array([0, 1, 2])

In [107]:
lb.transform([0, 1, 2, 0, 2])

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1]], dtype=int32)