## Model Evaluating, Part 1

- Obtain unbiased estimates of a model's performance
- Create a pipeline, and use k-fold cross validation
- Streamlining workflows with pipelines


- Cross-validation
  - Holdout Method: classic and popular method. Split dataset into separate training and test dataset
  - K-fold cross-validation: Randomly split into k folds without replacement. k-1 for training, k for test
  - Leave-one-out (LOO): Special case of k-fold, where k = number of training samples (k=n)

#### Import and transform and divide the dataset

In [12]:
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

from sklearn.preprocessing import LabelEncoder

# Assign features to NumPy array X
X = df.loc[:, 2:].values
y = df.loc[:, 1].values

# Transform class labels into integers
le = LabelEncoder()
y = le.fit_transform(y)

# Divide the dataset
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Verify the transformation of class variables
le.transform(['M', 'B'])

array([1, 0])

#### Create a Pipeline

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Pass a list of tuples to create a pipeline
pipe_lr = Pipeline([('scl', StandardScaler()), 
                    ('pca', PCA(n_components=2)), 
                    ('clf', LogisticRegression(random_state=1))])

# Fit the data 
pipe_lr.fit(X_train, y_train)

print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

Test Accuracy: 0.947


#### Stratified K-fold Cross-validation

In [13]:
import numpy as np
from sklearn.cross_validation import StratifiedKFold

kfold = StratifiedKFold(y=y_train, n_folds=10, random_state=1)

scores = []

for k, (train, test) in enumerate(kfold):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1, np.bincount(y_train[train]), score))

Fold: 1, Class dist.: [256 153], Acc: 0.891
Fold: 2, Class dist.: [256 153], Acc: 0.978
Fold: 3, Class dist.: [256 153], Acc: 0.978
Fold: 4, Class dist.: [256 153], Acc: 0.913
Fold: 5, Class dist.: [256 153], Acc: 0.935
Fold: 6, Class dist.: [257 153], Acc: 0.978
Fold: 7, Class dist.: [257 153], Acc: 0.933
Fold: 8, Class dist.: [257 153], Acc: 0.956
Fold: 9, Class dist.: [257 153], Acc: 0.978
Fold: 10, Class dist.: [257 153], Acc: 0.956


#### Stratified K-fold Cross-validation Scoring in Scikit

In [16]:
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=1)

print('CV accuracy scores: %s' % scores); print()
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy scores: [ 0.89130435  0.97826087  0.97826087  0.91304348  0.93478261  0.97777778
  0.93333333  0.95555556  0.97777778  0.95555556]

CV accuracy: 0.950 +/- 0.029
