<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px">

# Classifier Evaluation and the Confusion Matrix

Week 4 | 4.2

---

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

In [2]:
spam = pd.read_csv('~/DSI-SF-5/datasets/spam/spam_words_wide.csv')

In [3]:
spam.head()

Unnamed: 0,is_spam,getzed,86021,babies,sunoco,ultimately,thk,voted,spatula,fiend,...,itna,borin,thoughts,iccha,videochat,freefone,pist,reformat,strict,69698
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
print spam.shape, spam.is_spam.sum()

(5572, 1001) 747


---

### The Baseline Accuracy

The concept of baseline accuracy is extremely important and too often forgotten when evaluating a model. 

> **Baseline Accuracy**: The accuracy that can be achieved by a model simply by guessing the majority class every time.

We are trained to think of "50% accuracy" as guessing by chance. In fact, a 50% accuracy is only guessing by chance in a very specific context: when we have equal proportion of positive and negatvie (1 and 0) labels in our dataset, and when we are predicting between two classes.

In reality, your dataset is unlikely to have balanced classes, and the more unbalanced it is the higher the baseline accuracy becomes. This is important to remember because if 99% of your observations are of one class, predicting 99% of them correctly with a model is performing at chance.

You can calculate baseline accuracy as: 

**`baseline_accuracy = majority_class_N / total_population`**

#### Calculate the baseline accuracy for the spam dataset

In [5]:
baseline_acc = 1. - spam.is_spam.mean()
baseline_acc

0.8659368269921034

---

### Set up a kNN model to predict spam

It's up to you what predictors you want to use and how you want to parameterize the model.

In [13]:
from sklearn.neighbors import KNeighborsClassifier

y = spam.is_spam.values
X = spam.iloc[:, 1:100]

knn = KNeighborsClassifier()

#### Cross-validate the accuracy of the model

Use 10 folds. How does the performace compare to the baseline accuracy?

In [14]:
from sklearn.model_selection import cross_val_score

accs = cross_val_score(knn, X, y, cv=10)
print accs
print np.mean(accs)

[ 0.8781362   0.88530466  0.88351254  0.87992832  0.8781362   0.87432675
  0.88689408  0.87589928  0.88309353  0.88669065]
0.881192220024


---

### Predicted labels and predicted probabilities

Sklearn classification models typically have functions to predict the labels (classes) of observations as well as the _predicted probability_ of labels, which are the probabilities that they belong to a class.

The `.predict()` function will return the predicted labels for a design matrix. The `.predict_proba()` function will return the probabilities of belonging to classes (classes are in columns in ascending order).

Fit the knn model and print out the predicted labels and predicted probabilities for a few points below.

In [15]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [16]:
knn.predict(X.iloc[0:10, :])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [17]:
print y[0:10]

[0 0 1 0 0 1 0 0 1 1]


In [18]:
knn.predict_proba(X.iloc[0:10, :])

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

---

### The Confusion Matrix

|   |Predicted Positive | Predicted Negative |   
|---|---|---|
|**Actual Positive** | True Positive (TP)  | False Negative (FN)  |  
|**Actual Negative**  | False Positive (FP)  | True Negative (TN)  | 

In a binary classifier, the "true" class is labeled with 1 and the "false" class is labeled with 0. 

> **True Positive**: A positive class observation (1) is correctly classified as positive by the model.

> **False Positive**: A negative class observation (0) is incorrectly classified as positive.

> **True Negative**: A negative class observation is correctly classified as negative.

> **False Negative**: A positive class observation is incorrectly classified as negative.

Columns of the confusion matrix sum to the predictions by class. Rows of the matrix sum to the actual values within each class.

As the name suggests, the labels can be confusing. Remember that the first word (True or False) indicates whether or not the guess is correct. The second word (Positive or Negative) indicates the label of the _guess_ (not the actual label).

#### Calculate the confusion matrix metrics for your model below

In [19]:
predicted = knn.predict(X)

In [20]:
tp = np.sum((y == 1) & (predicted == 1))
fp = np.sum((y == 0) & (predicted == 1))
tn = np.sum((y == 0) & (predicted == 0))
fn = np.sum((y == 1) & (predicted == 0))
print tp, fp, tn, fn

107 18 4807 640


#### Verify this is the same as the numbers you get from sklearn's `confusion_matrix`

In [21]:
from sklearn.metrics import confusion_matrix

In [22]:
confusion_matrix(y, predicted)

array([[4807,   18],
       [ 640,  107]])

---

### Type I error and p-values

In the context of hypothesis testing false positives and false negatives are often referred to as Type I and Type II error, respectively. 

Type I error is the incorrect rejection of the null hypothesis when in fact the null hypothesis is true. This is equivalent to a false positive in classification, in which the model labels an observation as "true" when in fact it is "false". 

Type I error directly corresponds to the p-value: **the p-value is the probability of incorrectly rejecting the null hypothesis.**

---

### Type II error and "power"

Type II error on the other hand directly corresponds to false negatives. A Type II error in the context of hypothesis testing would be to accept the null hypothesis when in fact the alternative hypothesis is true. 

Whereas Type I error corresponds to the concept of _statistical significance_, Type II error corresponds to the idea of _statistical power._ The power of a test is:

### $$ \text{power} = 1 - P(\text{Type II error}) $$

More intuitively, **power measures our ability to detect an effect that is present.**

We can visualize the ideas of significance, power, and error types in a matrix the same as our confusion matrix from above:

|   |Accept $H_0$ | Reject $H_0$ |   
|---|---|---|
|**$H_0$ is True** | P(correct) <br> _(1 - alpha)_  | P(type I error) <br> _(alpha, significance)_  |  
|**$H_0$ is False**  | P(type II error) <br> _(beta)_  | P(correct) <br> _(1 - beta, power)_ | 

---

### Accuracy

The accuracy metric can be constructed using the components of the confusion matrix. With the total population as:

**`total_population = tp + fp + tn + fn`**

The accuracy can be calculated as:

**`accuracy = (tp + tn) / total_population`**

Which is just the proportion of correct guesses, regardless of class. The `.score()` function attached to sklearn classification model objects defaults to returning the accuracy of the model's predictions given an `X` and `y`.

The inverse of the accuracy is known as the **misclassification rate**, which is calculated:

**`misclassification_rate = (fp + fn) / total_population`**

In [28]:
from sklearn.metrics import accuracy_score

total_population = tp + fp + tn + fn

print accuracy_score(y, predicted)
print float(tp + tn) / total_population

0.881909547739
0.881909547739


---

### Sensitivity / Recall / True Positive Rate

The true positive rate is the percent of times when the label is 1 the model actually predicted 1. This is alternatively known as the **Sensitivity** or **Recall**. 

This is calculated as:

**`sensitivity = tp / (tp + fn)`**


In [30]:
from sklearn.metrics import recall_score

print recall_score(y, predicted)
print float(tp) / (tp + fn)

0.143239625167
0.143239625167


---

### False Positive Rate

Alternatively, the false positive rate measures the fraction of times the model predicts a 1 when the target class is actually a 0. 

**`fpr = fp / (tn + fp)`**

In [31]:
print float(fp) / (tn + fp)

0.00373056994819


---

### Specificity, or the True Negative Rate

The true negative rate measures the fraction of times the classifier predicted the class was 0 out of all the times the class was 0. It can be considered the sister metric to Sensitivity, which measures the same thing but for positives.

**`specificity = tn / (tn + fp)`**

In [32]:
specificity = float(tn) / (tn + fp)
print specificity

0.996269430052


---

### Precision, or Positive Predictive Value

The precision measures the fraction of times that the classifier guessed correctly when it was predicting the true (1) class.

**`precision = tp / (tp + fp)`**

The idea of the classifier being _precise_ is subtly different than _accurate_. Precision is a measure of correctness only for when the classifier guesses the positive class, regardless of how many times it actually "tries".

In [34]:
from sklearn.metrics import precision_score

print precision_score(y, predicted)
print float(tp) / (tp + fp)

0.856
0.856


---

## F1-score and the `classification_report`

sklearn contains a function `classification_report` in the `metrics` submodule that helps diagnose the effectiveness of your classifier. The report focuses on the precision, recall, and a metric known as the f1-score.

The f1-score is the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of the precision and recall metrics. Blending the two is useful: precision measures how effectively the classifier performs when it is predicting a 1, whereas recall measures how many of the total 1 classes out of all the 1-labeled observations were predicted correctly. 

### $$ F_1 = 2 \cdot \frac{1}{\tfrac{1}{\mathrm{recall}} + \tfrac{1}{\mathrm{precision}}} = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$$

By combining the two we have a measure of the classifiers ability to find the positive labeled observations as well as how permissive it is for identification errors on those labels.

You can print out the report of these three metrics on both of the classes (or more if you have a multi-class problem) using the `classification_report` function.

In [23]:
from sklearn.metrics import classification_report

In [25]:
print classification_report(y, predicted)

             precision    recall  f1-score   support

          0       0.88      1.00      0.94      4825
          1       0.86      0.14      0.25       747

avg / total       0.88      0.88      0.84      5572



---

### Table of Common Classification Terms

<br><br>

|  TERM | DESCRIPTION  |
|---|---|
|**TRUE POSITIVES** | The number of "true" classes correctly predicted to be true by the model. <br><br> `TP = Sum of observations predicted to be 1 that are actually 1`<br><br>The true class in a binary classifier is labeled with 1.|
|**TRUE NEGATIVES** | The number of "false" classes correctly predicted to be false by the model. <br><br> `TP = Sum of observations predicted to be 0 that are actually 0`<br><br>The false class in a binary classifier is labeled with 0.|
|**FALSE POSITIVES** | The number of "false" classes incorrectly predicted to be true by the model. This is the measure of **Type I error**.<br><br> `TP = Sum of observations predicted to be 1 that are actually 0`<br><br>Remember that the "true" and "false" refer to the veracity of your guess, and the "positive" and "negative" component refer to the guessed label.|
|**FALSE NEGATIVES** | The number of "true" classes incorrectly predicted to be false by the model. This is the measure of **Type II error.**<br><br> `TP = Sum of observations predicted to be 0 that are actually 1`<br><br>|
|**TOTAL POPULATION** | In the context of the confusion matrix, the sum of the cells. <br><br> `total population = tp + tn + fp + fn`<br><br>|
|**SUPPORT** | The marginal sum of rows in the confusion matrix, or in other words the total number of observations belonging to a class regardless of prediction. <br><br>|
|**ACCURACY** | The number of correct predictions by the model out of the total number of observations. <br><br> `accuracy = (tp + tn) / total_population`<br><br>|
|**PRECISION** | The ability of the classifier to avoid labeling a class as a member of another class. <br><br> `Precision = True Positives / (True Positives + False Positives)`<br><br>_A precision score of 1 indicates that the classifier never mistakenly classified the current class as another class.  precision score of 0 would mean that the classifier misclassified every instance of the current class_ |
|**RECALL/SENSITIVITY**    | The ability of the classifier to correctly identify the current class. <br><br>`Recall = True Positives / (True Positives + False Negatives)`<br><br>A recall of 1 indicates that the classifier correctly predicted all observations of the class.  0 means the classifier predicted all observations of the current class incorrectly.|
|**SPECIFICITY** | Percent of times the classifier predicted 0 out of all the times the class was 0.<br><br> `specificity = tn / (tn + fp)`<br><br>|
|**FALSE POSITIVE RATE** | Percent of times model predicts 1 when the class is 0.<br><br> `fpr = fp / (tn + fp)`<br><br>|
|**F1-SCORE** | The harmonic mean of the precision and recall. The harmonic mean is used here rather than the more conventional arithmetic mean because the harmonic mean is more appropriate for averaging rates. <br><br>`F1-Score = 2 * (Precision * Recall) / (Precision + Recall)` <br><br>_The f1-score's best value is 1 and worst value is 0, like the precision and recall scores. It is a useful metric for taking into account both measures at once._ |