# Week 5: Binary Classification

This notebook is heavily inspired by Andre Guernon work, that can be found here: https://github.com/ageron/handson-ml/blob/master/04_training_linear_models.ipynb

## Setup

In [None]:
# Python ≥ 3.9 is required
import sys
assert sys.version_info >= (3, 9)

# Scikit-Learn ≥1.0 is required
import sklearn
assert sklearn.__version__ >= "1.0"

# Common imports
import numpy as np
import pandas as pd
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

from time import time

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## The MNIST Dataset

We will be using the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents.

We will use `sklearn.datasets.fetch_openml()` to fetch dataset from openml by name or dataset id.

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml(
    'mnist_784',
    version=1,
    as_frame=False # we want the dataset as NumPy ndarray not as a pandas DataFrame
)
mnist.keys()

In [None]:
mnist['DESCR']

Let's import the dataset, inputs and labels:

In [None]:
X, y = mnist['data'], mnist['target']

In [None]:
X.shape

In [None]:
y.shape

`X` contains 70,000 images each of them contains 784 features, because each of them is a 28x28 picture. Each feature is a pixel intensity encoded in an 8-bit scale: from 0 (white) to 255 (black)

Let's display one or more images using matplotlib `imshow()`

In [None]:
digit = X[9]
digit_img = digit.reshape(28, 28)

plt.imshow(digit_img, cmap='binary')
plt.axis('off')
plt.show()

In [None]:
y[9]

The label is a string. We must convert it to a number for it to work on a Machine Learning algorithm.

In [None]:
y = y.astype(np.uint8)

Here's a helper function to show a bunch of digit images from the dataset:

In [None]:
def show_digits(instances, images_per_row=10, **opts):
    """
    Utility function to display the MNIST digits on a grid
    """
    # the size of our images (28x28)
    size = 28
    images_per_row = min(len(instances), images_per_row)
    # convert images from 1-D to 2-D arrays
    images = [instance.reshape(size, size) for instance in instances]
    # compute how many rows you need in the grid
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    # create empty "dummy" images to fill potential remaining slots at the end of the grid
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    # concatenate all the images in a single grid image
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    # plot the grid image
    plt.imshow(image, cmap = mpl.cm.binary, **opts)
    plt.axis("off")

In [None]:
plt.figure(figsize=(9,9))
example_images = X[:100]
show_digits(example_images, images_per_row=10)
plt.show()

## Split the dataset in training and test set

Well set aside 10,000 samples for testing purposes. The data set is already shuffled for us so we can just take the last 10,000 samples for our test set.

In [None]:
# use slicing to create training and test set
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Training and evaluating a binary classifier

Let's define two set of labels for the training and test set, named `y_train_8` and `y_test_8`. These must contain the value `True` whenever the original label is an 8, `False` otherwise

In [None]:
# Implemented using boolean masking
y_train_8 = y_train == 8
y_test_8 = y_test == 8

In [None]:
y_train_8[:20], y_train[:20]

### **Exercise**: 
Let's suppose we want to implement a binary classfier to classify even vs odd digits. Define two set of labels for the training and test set, named `y_train_even` and `y_test_even`. These must contain the value `True` whenever the original label is a digit representing an even number, `False` if it's an odd number

In [None]:
# Write your solution here:


##### Logistic regression classifier

We can train a logistic regression classifier by either using `sklearn.linear_model.LogisticRegression` or by using `sklearn.linear_model.SGDClassifier` with `loss` argument set as `log` (or `log_loss` if using `v1.1+`).

In [None]:
from sklearn.linear_model import LogisticRegression

log_cl = LogisticRegression()
log_cl.fit(X_train, y_train_8)

### **Question:** 
A warning crops up if running the cell above. What is the issue? How could we fix it?

In [None]:
### Attempt a solution here (at home)



In [None]:
# make some predictions
log_cl.predict(X_train[10:20])

### **Exercise:**
Let's try using an `SGDClassifier` to fit a logistic regressor using Stochastic Gradient Descent rather than the explicit solution.

**NB:** Check the SGDClassifier [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) to see what parameter you need to pass to `loss`

In [None]:
# Write your solution here


sgd_cl = ...


In [None]:
sgd_cl.predict(X_train[10:20])

In [None]:
y_train[10:20]

It has correctly predicted the "8" at index 17. However this belongs to the data it used during the training phase.
We need a validation set to fairly evaluate the performance of our logistic regression classifier.

### 4.1.1 Performance Measures: measuring Accuracy Using Cross-Validation

We'll now use `cross_val_score()` to assess the accuracy of our Classifier `sgd_cl` on (X_train, y_train), using 3-fold cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_cl, X_train, y_train_8, cv=3, scoring="accuracy")

Our accuracy is 93 % in the first two runs and 84% in the third. The classifier looks very performant, but is it really the case?

### **Exercise (at home):** 
Use `cross_val_score()` to assess the accuracy of an SGD classifier implementing an online support vector machine (SVM), on (X_train, y_train), using 5-fold cross-validation.

It is more or less accurate than the SGD classifier implementing logistic regression?

In [None]:
# Write your solution here


Let's create a dummy classifier that never predicts that a digit is an "8". It will just always return `False` (i.e. `0`) as a predicted label.

In [None]:
from sklearn.base import BaseEstimator
class Never8Classifier(BaseEstimator):
    
    def fit(self, X, y=None):
        pass
    
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool) 
never_8_clf = Never8Classifier()

### **Exercise 3:** 
Use `cross_val_score()` to assess the accuracy of our Classifier `never_8_clf` on (X_train, y_train), using 3-fold cross-validation. Which accuracy do you expect?

In [None]:
# Write your solution here


It has over 90% accuracy! This is simply because only about 10% of the images are 8s, so if you always guess that an image is not a 8, you will be right about 90% of the time. 

Accuracy _per_ _se_ is not the preferred metrics when dealing with classifiers. This is even more true in this case, as we are dealing with a skewed dataset.

## Performance Measures: Confusion Matrix

A more reliable way to measure the performance of a classifier is to look at the so-called _confusion matrix_. The aim is to quantify how many times members of a class C1 are misclassified as members of the class C2. To do that we will use the `cross_value_predict()` (rather the CV scores, it returns the predictions) function together with the `confusion_matrix()` metric.

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(
    sgd_cl, X_train, y_train_8, cv=3
)
confusion_matrix(y_train_8, y_train_pred)

The ideal perfect classifier would have true positives and true negatives only. In this case the confusion matrix would have zero values outside the main diagonal.

In [None]:
y_train_perfect_pred = y_train_8
confusion_matrix(
    y_train_8, 
    y_train_perfect_pred
)

## Performance Measures: Precision, Recall and Harmonic mean (F1 score)

* Precision or Positive Predicted Value: $$ PPV = \frac{TP}{TP + FP} $$
* Recall or TPR or Sensitivity $$ TPR  = \frac{TP}{TP + FN} $$
* Harmonic Mean of Precision and Recall $$ F1 = \frac{2}{\frac{1}{Precision}+\frac{1}{Recall}} = \frac{TP}{TP + \frac{FN+FP}{2}} $$

### **Exercise:** 
compute precision, recall and F1-score for our `sgd_cl` we have trained above with `cross_val_predict()` using the `sklearn.metrics` module.

In [None]:
# write your solution here


Now our classifier should look not as good than before. It has a 50% precision and 68% recall. 

Notice that F1 tends to favour models that have similar precision and recall. But in some context you might prefer a higher precision, while in others a higher recall, depending on the task.

There is however a trade-off between precision and recall.

### Precision/Recall trade-off

A classifier such as our `SGDClassifier` performs the classification task by computing a score based on a "decision function". If a score is greater than a given threshold value, the instance is labeled with the positive class, otherwise with the negative class (from the theory of Logistic regression, if you remember, an estimated probabilty of class "1" greater than 0.5 means that we assign the value to class "1"). Raising this threshold will reduce the number of FP, thus increasing the precision. However, it will also increase the number of FN thus reducing the recall score.

Let's try to manipulate the `SGDClassifier`'s threshold manually, using the classifier's `.decision_function()` method

In [None]:
y_scores = sgd_cl.decision_function(X_train[:5])
y_scores

In [None]:
threshold = 0
y_pred_on_scores = y_scores > threshold
y_pred_on_scores

In [None]:
threshold = 2000
y_pred_on_scores = y_scores > threshold
y_pred_on_scores

Raising the threshold increases the number of FN, decreasing the recall. How can we than determine the right threshold value for our task? First let's use `cross_val_predict()`using the 'decision_function' method on our entire training set, and then let's use the computed score together with the `precision_recall_curve()` to compute precision and recall for all the possible threshold values.

In [None]:
from sklearn.metrics import precision_recall_curve
y_scores = cross_val_predict(
    sgd_cl,
    X_train,
    y_train_8,
    cv=3,
    method='decision_function'
)
precisions, recalls, thresholds = precision_recall_curve(
    y_train_8, y_scores
)

In [None]:
import seaborn as sns

def plot_precision_and_recall_vs_threshold(precisions, recalls, thresholds):
    sns.lineplot(x=thresholds, y=precisions[:-1])

fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=thresholds, y=precisions[:-1], ax=ax)
sns.lineplot(x=thresholds, y=recalls[:-1], ax=ax)
plt.legend(['Precision', 'Recall'], loc="center right", fontsize=16) 
plt.xlabel("Threshold", fontsize=16)        
plt.grid(True)                                      
plt.show()

In [None]:
# Let's find the threshold for which we can achieve a 90% precision
threshold_90_prec = thresholds[np.argmax(precisions >= 0.90)]
threshold_90_prec

In [None]:
y_train_pred_90 = (y_scores >= threshold_90_prec)
precision_score(y_train_8, y_train_pred_90), recall_score(y_train_8, y_train_pred_90)

Now we have reached a 90% precision, at the expense of recall, which is now 18%!

## Performance Measures: The ROC curve

Another tool that can be used to evaluate a classifier performance is the receiver-operating curve. The ROC curves plots the true positive rate (TPR, i.e. recall) vs the false positive rate (FPR). 

Specificity or TNR: $$ TNR  = \frac{TN}{TN + FP} $$


False positive rate: $$FPR = 1 - TNR$$
The ROC curve plots sensitivity agains (1-specificity) 

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_8, y_scores)

In [None]:
plt.figure(figsize=(14, 8))
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') 
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (FPR)', fontsize=16)
    plt.ylabel('True Positive Rate (TPR)', fontsize=16)
    plt.grid(True)
plot_roc_curve(fpr, tpr)
plt.show()


You can measure the _area under the curve_ (AUC) if you want to compare the performance of different classifiers. 

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_8, y_scores)

As a general rule, prefer the Precision/Recall curve if the positive class is uncommon or if you worry more about the false positives rather than the false negatives. In the other scenarios, prefer the ROC curve.

#### ROC AUC for KNNs and Random Forests

Let's try two different classifiers: a K-Nearest Neighbours classifier and a Random Forest classifier.

The K-Nearest Neighbours algorithm checks the K closest (i.e. most similar instances) in the training set and assigns as predicted class for the new instance the most represented class in the neighbourhood. 

The Random Forest algorithm is an ensemble method which trains a number of decision tree classifiers on various sub-samples of the training set and uses averaging techniques to improve the predictive accuracy and control over-fitting.

We will see more on Decision Trees and Ensemble methods next week.

**NOTE:** K-Nearest Neighbour and Random Forest classifiers do not have a `decision_function()` method that returns the predicted scores for each instance in cross validation. They do have, however, a `predict_proba()` method that returns an array containing a row per instance and a column per class. This array contains the predicted probability that each instance belongs to a class. This can be used to draw ROC curves in lieu of `decision_function()`. Scikit-learn classifiers usually implement either one or the other method so you need to check their API to find out the one you need to use.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
kn_cl = KNeighborsClassifier(n_neighbors=9)
y_probs_kn = cross_val_predict(
    kn_cl,
    X_train,
    y_train_8,
    cv=3, 
    method='predict_proba'
)

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest_cl = RandomForestClassifier(
    n_estimators=100, # a "forest" of 100 decision trees
    random_state=77
)
y_probs_forest = cross_val_predict(
    forest_cl,
    X_train,
    y_train_8,
    cv=3,
    method="predict_proba"
)

If you check `y_probs_forest` you will see that it has a column with the probabilities for the `0` class and a column for the probabilities for the `1` class, one row per sample. Each row adds up to 1.

In [None]:
y_probs_forest

In [None]:
# Nearest neighbours scores
y_scores_kn = y_probs_kn[:, 1]    # score = proba of positive class
fpr_kn, tpr_kn, thresholds_kn = roc_curve(y_train_8, y_scores_kn)

### **Exercise:** 
let's compute the scores, TPR, FPR for the Random Forest classifier

In [None]:
# Write your solution here
y_scores_forest = ...
...

### **Exercise:** 
Use the scores computed above for `sgd_cl`, `kn_cl`, `forest_cl` to plot out the three ROC curves on the same plot. You can use, if you wish the `plot_roc_curve()` function defined above. Afterward compute the area under the curve for `mnb_cl` and `forest_cl`. Which is the best and the worst classifier?

In [None]:
# Write your solution here:
plt.figure(figsize=(14, 8))
...
...
...
plt.legend(loc="lower right")
plt.show()

In [None]:
roc_auc_score(y_train_8, y_scores_kn)

In [None]:
roc_auc_score(y_train_8, y_scores_forest)

### **Exercise (at home)**: 
train and evaluate a [Logistic Regressor classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) and a [Linear SVM classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) in 5-fold cross-validation using the even vs odd labels we have defined above (`y_train_even`, `y_test_even`). Try and choose e sound combination of hyperparameters. What metric could you use for evaluation? Would accuracy be a good choice?

Plot the ROC curves for the two classifiers and compute the AUC.

In [None]:
# Write your solution down here
# Add as many cells as you need


## Addendum: more flexible cross-validation

Sometimes you need more control over cross-validation than what is offered out of the box with `cross_val_score()`. In the example in the cell below we are going to use the `StratifiedKFold` class to implement cross-validation

#### <span style="color:red">**NB: this cell will take quite a while to run. Be patient**</span>

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=77, shuffle=True)

for train_index, val_index in skfolds.split(X_train, y_train_8):
    # make a clone (copy) of our Stochastic Gradient Classifier
    clone_sgd_cl = clone(sgd_cl) 
    
    # get training and validation set for current CV iteration
    X_train_f = X_train[train_index]
    X_val_f = X_train[val_index]
    y_train_f = y_train[train_index]
    y_val_f = y_train[val_index]
    
    # train the SGD classifier
    clone_sgd_cl.fit(X_train_f, y_train_f)
    # make predictions on validation set
    y_pred = clone_sgd_cl.predict(X_val_f)
    # count number of correct predictions
    n_correct = sum(y_pred == y_val_f)
    # print out accuracy score
    print(n_correct / len(y_val_f))  