# Model Evaluation
* What we've done so far to evaluate our supervised models:
 * split our dataset into a training set and a test: __`train_test_split()`__
 * built a model on the training set: __`fit()`__
 * evaluated the model it on the test set: __`score()`__

In [1]:
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# create a synthetic dataset
X, y = make_blobs(random_state=0)
print(X, y)
# split data and labels into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# instantiate a model and fit it to the training set
logreg = LogisticRegression().fit(X_train, y_train)
# evaluate the model on the test set
print("Test set score: {:.2f}".format(logreg.score(X_test, y_test)))

[[ 2.63185834  0.6893649 ]
 [ 0.08080352  4.69068983]
 [ 3.00251949  0.74265357]
 [-0.63762777  4.09104705]
 [-0.07228289  2.88376939]
 [ 0.62835793  4.4601363 ]
 [-2.67437267  2.48006222]
 [-0.57748321  3.0054335 ]
 [ 2.72756228  1.3051255 ]
 [ 0.34194798  3.94104616]
 [ 1.70536064  4.43277024]
 [ 2.20656076  5.50616718]
 [ 2.52092996 -0.63858003]
 [ 2.50904929  5.7731461 ]
 [-2.27165884  2.09144372]
 [ 3.92282648  1.80370832]
 [-1.62535654  2.25440397]
 [ 0.1631238   2.57750473]
 [-1.59514562  4.63122498]
 [-2.63128735  2.97004734]
 [-2.17052242  0.69447911]
 [-1.56618683  1.74978876]
 [-0.88677249  1.30092622]
 [ 0.08848433  2.32299086]
 [ 0.9845149   1.95211539]
 [ 2.18217961  1.29965302]
 [ 1.28535145  1.43691285]
 [ 0.89011768  1.79849015]
 [-1.89608585  2.67850308]
 [-0.75511346  3.74138642]
 [ 1.12031365  5.75806083]
 [ 3.54351972  2.79355284]
 [ 1.64164854  0.15020885]
 [ 2.47034915  4.09862906]
 [-1.98243652  2.93536142]
 [ 0.85624076  3.86236175]
 [ 0.87305123  4.71438583]
 

# What are we actually interested in measuring?
* how well our model generalizes to new, unseen data
* (...NOT how well our model fit the training data)

# Cross Validation
* a statistical method of evaluating generalization performance
* more stable and thorough than using a split into a training and a test set
* instead, data is split repeatedly and multiple models are trained
* most common method of cross-validation is _k-fold_ cross-validation (k = 5 or 10)

In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris = load_iris()
logreg = LogisticRegression()
scores = cross_val_score(logreg, iris.data, iris.target)
print("Cross-validation scores: {}".format(scores))

Cross-validation scores: [0.96078431 0.92156863 0.95833333]


* How many folds were there by default?
* Let's increase that to see if we do better...

In [3]:
scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
print("Cross-validation scores: {}".format(scores))

Cross-validation scores: [1.         0.96666667 0.93333333 0.9        1.        ]


In [4]:
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Average cross-validation score: 0.96


* we can conclude that we expect the model to be around 96% accurate on average
* relatively high variance in the accuracy between folds, ranging from 90-100%
 * model could be very dependent on the particular folds used for training
 * could also just be a consequence of the small size of the dataset

# Uh-Oh!

In [5]:
from sklearn.datasets import load_iris
iris = load_iris()
print("Iris labels:\n{}".format(iris.target))

Iris labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


* If the data is structured in a certain way (like this) we can have major variation between the folds.
* Even with the folding model, the data structure itself can cause issues.

# Stratified Cross Validation

* Split the data such that the proportions between classes are the same in each fold as they are in the whole dataset
![alt-text](images/cv-2.png)

# KFold Validation

In [7]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5)

In [8]:
print("Cross-validation scores:\n{}".format(
    cross_val_score(logreg, iris.data, iris.target, cv=kfold)))

Cross-validation scores:
[1.         0.93333333 0.43333333 0.96666667 0.43333333]


In [9]:
from collections import Counter

def show_fold_contents(kf):
    for fold in kf.split(iris.data):
        c = Counter()
        for datapoint in fold[0]:
            c.update([iris.target_names[iris.target[datapoint]]])
        print(c)
        
show_fold_contents(kfold)

Counter({'versicolor': 50, 'virginica': 50, 'setosa': 20})
Counter({'virginica': 50, 'versicolor': 40, 'setosa': 30})
Counter({'setosa': 50, 'virginica': 50, 'versicolor': 20})
Counter({'setosa': 50, 'versicolor': 40, 'virginica': 30})
Counter({'setosa': 50, 'versicolor': 50, 'virginica': 20})


In [10]:
kfold = KFold(n_splits=3)
print("Cross-validation scores:\n{}".format(
    cross_val_score(logreg, iris.data, iris.target, cv=kfold)))

Cross-validation scores:
[0. 0. 0.]


In [11]:
show_fold_contents(kfold)

Counter({'versicolor': 50, 'virginica': 50})
Counter({'setosa': 50, 'virginica': 50})
Counter({'setosa': 50, 'versicolor': 50})


* What happened between 3 and 5 splits?

In [12]:
kfold = KFold(n_splits=3, shuffle=True, random_state=0)
print("Cross-validation scores:\n{}".format(
    cross_val_score(logreg, iris.data, iris.target, cv=kfold)))

Cross-validation scores:
[0.9  0.96 0.96]


# Leave One Out Validation

In [13]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(logreg, iris.data, iris.target, cv=loo)
print("Number of cv iterations: ", len(scores))
print("Mean accuracy: {:.2f}".format(scores.mean()))

Number of cv iterations:  150
Mean accuracy: 0.95


* Another type of cross-fold
* Train on everything but one sample, rotate and do it again, leaving out a different sample.
* What is the tradeoff?

# Shuffle Split Cross Validation

![alt-text](images/cv-3.png)
* Combine shuffle with split with cross validation
* The data blender is on "High" now

In [14]:
from sklearn.model_selection import ShuffleSplit
shuffle_split = ShuffleSplit(test_size=.25, train_size=.75, n_splits=10)
scores = cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split)
print('Cross-validation scores:\n', scores)

Cross-validation scores:
 [0.97368421 0.97368421 0.86842105 0.94736842 0.81578947 0.92105263
 0.94736842 0.94736842 0.97368421 0.89473684]


# GroupKFold Validation

* suppose we want to build a system to recognize emotions from pictures of faces

![alt-text](images/emotions.jpg)

* we collect a dataset of pictures of 100 people
 * each person is captured multiple times, showing various emotions
* goal is to build a classifier that correctly identifies emotions of people not in the dataset
* you could use the default stratified cross-validation to measure the performance of a classifier
* however, it's likely that pictures of the same person will be in both training and test set
* much easier for a classifier to detect emotions in a face that is part of the training set, compared to a completely new face
* in order to accurately evaluate generalization to new faces, we must ensure that  training and test sets contain images of different people
* to do this, we can use GroupKFold, which takes an array of groups as argument that we can use to indicate which person is in the image
* the groups array here indicates groups in the data that should not be split when creating the training and test sets, and should not be confused with the class label
![alt-text](images/cv-4.png)

In [15]:
from sklearn.model_selection import GroupKFold
# create synthetic dataset
X, y = make_blobs(n_samples=12, random_state=0)
# assume the first three samples belong to the same group,
# then the next four, etc.
groups = [0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]
scores = cross_val_score(logreg, X, y, groups, cv=GroupKFold(n_splits=3))
print("Cross-validation scores:\n", scores)

Cross-validation scores:
 [0.75       0.8        0.66666667]


In [16]:
# Müller and Guido

# Model Performance - Linear Classifiers
![alt-text](images/svc.png)

# Regularization Parameters
* Do we care more about generalizing, or getting things right?
* Support Vector Machines have hyperparameters C and gamma
* C = _cost_ for making errors
  * A large C gives you low bias and high variance
  * A small C gives you higher bias and lower variance

![alt-text](images/svc-1.png)

### gamma = kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’
 * higher values of gamma will result in exact fit as per training data set, i.e., generalization error and overfitting
 * gamma control how "pointy" the peaks are–low gamma = smooth, higher gamma = pointy

![alt-text](images/gamma.png)

In [17]:
# naive grid search implementation
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(
                    iris.data, iris.target, random_state=0)
print("Size of training set: {}   size of test set: {}".format(
         X_train.shape[0], X_test.shape[0]))

best_score = 0

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        # for each combination of parameters, train an SVC
        svm = SVC(gamma=gamma, C=C)
        svm.fit(X_train, y_train)
        # evaluate the SVC on the test set
        score = svm.score(X_test, y_test)
        # if we got a better score, store the score and parameters
        if score > best_score:
            best_score = score
            best_parameters = {'C': C, 'gamma': gamma}

print("Best score: {:.2f}".format(best_score))
print("Best parameters: {}".format(best_parameters))

Size of training set: 112   size of test set: 38
Best score: 0.97
Best parameters: {'C': 100, 'gamma': 0.001}


# What's the problem with the above?

# Train-Validate-Test

![alt-text](images/train-validate-test.png)
### Split the data 3 ways now....
* Training (to train the model on)
* Validation (to refine hyperparameters on)
* Test ( to actually test)

In [18]:
from sklearn.svm import SVC
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(
    iris.data, iris.target, random_state=0)

In [19]:
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(
    X_trainval, y_trainval, random_state=1)
len(X_train)

84

In [20]:
best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        # for each combination of parameters train an SVC
        svm = SVC(gamma=gamma, C=C)
        svm.fit(X_train, y_train)
        # evaluate the SVC on the validation set
        score = svm.score(X_valid, y_valid)
        # if we got a better score, store the score and parameters
        if score > best_score:
            best_score = score
            best_parameters = {'C': C, 'gamma': gamma}
            
print(best_score, best_parameters)

0.9642857142857143 {'C': 10, 'gamma': 0.001}


### A note about ** the syntax below
* __`best_parameters`__ is a dict(ionary), i.e., a group of key/value pairs
* the __`SVC`__ method accepts _keyword arguments_, meaning instead of just passing the args directly, they have to be passed as __`key=value`__ arguments
* prefacing a dict with __`**`__ causes the key/value pairs in the dict to be "exploded" into __`key=value`__ arguments

In [23]:
def func(x, y, z):
    print('arg x is', x)
    print('arg y is', y)
    print('arg z is', z)
    print('done')

func(3, 4, 5)
d = { 'x': 1, 'z': 3, 'y': 2 }
func(**d)
#func(**d) # becomes func(x=1, y=2, z=3)

arg x is 3
arg y is 4
arg z is 5
done
arg x is 1
arg y is 2
arg z is 3
done


In [24]:
print("Size of training set: {} size of validation set: {} size of test set: {}".format(
         X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
# rebuild a model on the combined training and validation set,
# and evaluate it on the test set
svm = SVC(**best_parameters)
svm.fit(X_trainval, y_trainval)
test_score = svm.score(X_test, y_test)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: ", best_parameters)
print("Test set score with best parameters: {:.2f}".format(test_score))

Size of training set: 84 size of validation set: 28 size of test set: 38
Best score on validation set: 0.96
Best parameters:  {'C': 10, 'gamma': 0.001}
Test set score with best parameters: 0.92


# Cross Validation and Grid Search

In [None]:
import numpy as np
# reference: manual_grid_search_cv
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        # for each combination of parameters,
        # train an SVC
        svm = SVC(gamma=gamma, C=C)
        # perform cross-validation
        scores = cross_val_score(svm, X_trainval, y_trainval, cv=5)
        # compute mean cross-validation accuracy
        score = np.mean(scores)
        # if we got a better score, store the score and parameters
        if score > best_score:
            best_score = score
            best_parameters = {'C': C, 'gamma': gamma}
# rebuild a model on the combined training and validation set
svm = SVC(**best_parameters)
svm.fit(X_trainval, y_trainval)
print(best_parameters, best_score)
        

# GridSearchCV

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

grid_search = GridSearchCV(SVC(),
                param_grid, cv=5, return_train_score=True)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(
     iris.data, iris.target, random_state=0)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [27]:
print("Test set score: {:.2f}".format(grid_search.score(X_test, y_test)))

Test set score: 0.97


In [29]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'C': 100, 'gamma': 0.01}
Best cross-validation score: 0.97


In [30]:
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Best estimator:
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
