# Week 3: Model Evaluation and Selection


## Evaluation

A big loop: 

Representation --> Train models --> Evaluation --> Feature and model refinement --> 

Choose an evaluation method that matches the goal of your analysis. Compute your evaluation metric for multiple different models, then select the model with the 'best' performance for that evaluation metric.

## Accuracy

Just looking at accuracy may not provide a good picture of performance. For example, if there are datasets with a very small number of "yes" and a large number of "no" events. This is called an "imbalanced class."

Ex: Two classes  
* Relevant (R): the positive class
* Not_Relevant (NR): the negative class

Out of 1000 randomly selected items, on average only one has an R label and 999 have N label.

Accuracy is calculated as Number labeled correctly / Total labeled. 

If you misclassified the one R event as NR, you would get an accuracy of 999/1000 = 99.9%. That is misleading because you missed 100% of the R events. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

dataset = load_digits()
X, y = dataset.data, dataset.target

There are 64 features and 1797 data points in the data.

In [4]:
print(X.shape)
print(y.shape)

(1797, 64)
(1797,)


In [6]:
for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):
    print(class_name, class_count)

0 178
1 182
2 177
3 183
4 181
5 182
6 181
7 179
8 174
9 180


In [9]:
y_binary_imbalanced = y.copy()
y_binary_imbalanced[y_binary_imbalanced != 1] = 0

print('Original labels', y[1:30])
print('New binary labels', y_binary_imbalanced[1:30])

Original labels [1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]
New binary labels [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]


In [10]:
np.bincount(y_binary_imbalanced)

array([1615,  182], dtype=int64)

In [11]:
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state = 0)

svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)

0.9088888888888889

## Dummy Classifier 
Create a dummy classifier. They don't even use the data to classifer, they just use a predefined strategy.

* most_frequent  
* stratified  
* uniform  
* constant  

This provides a *null metric*- a baseline of if you performed classification without any information about the data. It shouldn't be used for classification but can be used as a sanity check.

In [12]:
from sklearn.dummy import DummyClassifier

dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train,y_train)

y_dummy_predictions = dummy_majority.predict(X_test)
y_dummy_predictions

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [13]:
dummy_majority.score(X_test,y_test)

0.9044444444444445

Everything was labeled '0' but still produces a pretty good accuracy.

If your classifier performs about as well as the null classifier, it could be a sign of:  
* Ineffective, erroneous or missing features  
* Poor choice of kernel or hyperparameter  
* Large class imbalance  

Try the linear classifier instead.

In [14]:
svm = SVC(kernel = 'linear', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)

0.9777777777777777

In a dummy regressor, strategy may be:  
* mean
* median
* quantile
* constant

## Binary Prediction Outcomes, Conflusion Matrix

Confusion matrix:

* True negative (and predicted negative)
* True Positive (and predicted positive)
* False positive (predicted positive, but a true negative)
* False negative (predicted negative, but a true positive)

Organized as:

[[TN   FP]  
 [FN   TP]]

This can be done easily in scikit-learn. Try it with a **dummy classifier where everything is classified as the most frequent value** (in this case it is 0). 

In [15]:
from sklearn.metrics import confusion_matrix

# The negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train,y_train)

y_majority_predicted = dummy_majority.predict(X_test)
confusion = confusion_matrix(y_test, y_majority_predicted)

print('Most frequent class (dummy classifier)\n', confusion)

Most frequent class (dummy classifier)
 [[407   0]
 [ 43   0]]


Look at this in detail:

* TN: 407, identified *correctly* as negative (0)
* FP: 0, nothing identified as positive that is actually negative
* FN: 43, identified *falsely* as negative (0)-- actually positive (1)
* TP: 0, identified *correctly* as positive (1)

This makes sense, since nothing can be classified as positive.

Now try it with **Random predictions with same class proportion as training set**

In [16]:
dummy_classprop = DummyClassifier(strategy = 'stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)
confusion = confusion_matrix(y_test, y_classprop_predicted)

print('Random class-proportional prediction (dummy classifier)\n', confusion)

Random class-proportional prediction (dummy classifier)
 [[356  51]
 [ 37   6]]


Now, since the predicted values contain positive and negative values, we have samples in each part of the confusion matrix. Move on...

**SVC Linear kernel**

In [18]:
svm = SVC(kernel='linear', C=1).fit(X_train,y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)

print('Support vector machine classifier (linear kernel, C=1)\n', confusion)

Support vector machine classifier (linear kernel, C=1)
 [[402   5]
 [  5  38]]


This is the best result so far.

**Linear Regression**

In [19]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train, y_train)
lr_predicted = lr.predict(X_test)
confusion = confusion_matrix(y_test, lr_predicted)

print('Logistic regression classifier (default settings)\n',confusion)

Logistic regression classifier (default settings)
 [[401   6]
 [  6  37]]


**Decision Tree Classifier**

In [20]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=2).fit(X_train,y_train)
tree_predicted = dt.predict(X_test)
confusion = confusion_matrix(y_test,tree_predicted)

print('Decision tree classifier (max_depth = 2)\n',confusion)

Decision tree classifier (max_depth = 2)
 [[400   7]
 [ 17  26]]


Other interesting calculations:
    
Classification error = $(FN+FP)/(FN+FP+TN+TP)$

**Recall**, or True Positive Rate. What fraction of all positive instances does the classifier correctly identify as positive? $TP / (TP+FN)$

What if it's important to decrease false positive? Look at **precision**, what fraction of positive predictions are true: $TP / (TP+FP)$ 

False positive rate, a.k.a. specificity: $FP/(TN+FP)$

**Tradeoff between precision and recall**

* Recall-oriented machine learning tasks:
    * Search and information extraction in legal discovery
    * Tumor detection
    * Often paired with a human expert to filter out false positives
* Precision-oriented machine learning tasks:
    * Search engine ranking, query suggestion
    * Document classification
    * Many customer-facing tasks (users remember failurs)
    
** F1 score** combines precision and recall into a single number.

$$ F_1 = 2* (precision*recall)/(precision+recall) = (2 TP)/(2 TP + FN + FP)$$

In [21]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print(accuracy_score(y_test, tree_predicted))
print(precision_score(y_test, tree_predicted))
print(recall_score(y_test, tree_predicted))
print(f1_score(y_test, tree_predicted))

0.9466666666666667
0.7878787878787878
0.6046511627906976
0.6842105263157895


To get a summary of all of those metrics...

In [22]:
from sklearn.metrics import classification_report

print(classification_report(y_test, tree_predicted, target_names=['not 1','1']))

             precision    recall  f1-score   support

      not 1       0.96      0.98      0.97       407
          1       0.79      0.60      0.68        43

avg / total       0.94      0.95      0.94       450



## Decision Function

* Provides information on how confidently the classifier predicts the positive class (large-magnitude positive values) or negative class (large-magnitude negative values.  
* Choosing a fixed decision threshold gives a classification rule  
* By sweeping the decision threshold through the entire range of possible score values, we get a series of classification outcomes that form a curve