# Tutorial 6 (Week 8) - Classification Analysis

## Learning Objectives

After completing this tutorial, you should be able to:

+ Understand model thresholding
+ Use sklearn to plot ROC curves in binary classification
+ Use sklearn to calculate AUROC
+ Use sklearn to plot ROC curves in multi-class classification

This tutorial is based on this [ROC and AUC tutorial](https://www.kaggle.com/code/jacoporepossi/tutorial-roc-auc-clearly-explained) and the Scikit-learn [ROC User Guide](https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc).

## Table of Contents

* [Dataset](#Dataset)
* [Confusion Matrix](#Confusion-Matrix)
* [Model Thresholds](#Model-Thresholds)
* [Receiver Operating Characteristic (ROC)](#ROC)
* [Area Under the Curve (AUC or AUROC)](#AUROC)
* [ROC Curve Application -- Comparability](#Comparability)
* [AUROC Properties](#AUROC-Properties)
* [AUROC in Multi-Class Classification](#Multi-Class)

## Dataset <a class="anchor" id="Dataset"></a>

Let us first create a toy dataset for experimenting. The Scikit-learn `datasets` module has a handy function [`make_classification()`](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) to generate a random n-class classification problem. 

We create a dataset with 50 samples (with default number of features) and 2 classes, with 40:60 proportion of samples assigned to each class. We set the class separation to be 0.1, which is a factor determining how spread out the classes are (the larger the value, the more spread out and the easier the classification task is). 

In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_classification

x, y = make_classification( n_classes = 2,
                            class_sep = 0.1,
                            n_samples = 50,
                            weights = [0.4, 0.6],
                            random_state = 42 )

# x contains the samples
print( x.shape )

# y contains the labels
print( y.shape )

Let's visualize the generated data, taking 2 dimensions.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Associate label value 0 (resp. 1) with the color blue (resp. orange) and label text 'positive' (resp. 'negative')
for c, i, t in zip( ['blue', 'orange'], [0, 1], ['positive', 'negative'] ):
    plt.scatter( x[y==i, 0], x[y==i, 1], color=c, alpha=.5, label=t )

plt.legend()
plt.title( 'Data' )
plt.show()

Next, we make a random prediction.

In [None]:
np.random.seed(42)

y_pred = np.random.choice([0, 1], size=(50))
y_pred

## Confusion Matrix <a class="anchor" id="Confusion-Matrix"></a>

Let's build the confusion matrix for our random prediction. We can use `sklearn.metrics.confusion_matrix` to get the raw counts as we have seen in Tutorial 4.

In [None]:
# TODO
# c = ?

We can then calculate the True Positive Rate (TPR) and False Positive Rate (FPR) from those counts.

In [None]:
# TODO
# tn, fp, fn, tp = ?
# tpr = ?
# fpr = ?

## Model Thresholds <a class="anchor" id="Model-Thresholds"></a>

The goal of classification is to predict a class label. However, many machine learning algorithms predict a probability or scoring of class membership, and we need to interpret this to map the prediction to a specific class label. This mapping is achieved using a _threshold_ (e.g., 0.5), where all predictions at or above the threshold are mapped to one class and all other values are mapped to another class.

In scikit-learn we can generally use two functions to perform prediction on new data: `predict` and `predict_proba`.

The `predict_proba` function returns a two-dimensional array (`n_samples` x `n_classes`), containing the estimated probabilities for each instance and each class. For example, a prediction for 4 samples with 2 possible classes (0 or 'positive', and 1 or 'negative') may look like this:

```
array( [[0.90, 0.10],
        [0.25, 0.75],
        [0.78, 0.22],
        [0.05, 0.95]])
```
This prediction says that the first sample has 90% probability of belonging to the positive class (and 10% probability of belonging to the negative class), the second sample has 75% probability of belonging to the negative class (and 25% probability of belonging to the positive class), and so on.

The `predict` function simply gives the class with the maximum probability. For the above example, it will return:

```
array( [0, 1, 0, 1] )
```

Using `predict_proba`, we can adjust how our model predicts a class or the other by varying the threshold. For instance, we can set threshold = 0.8 for the negative class, so that the model will predict the negative class only for samples that have probability >= 80% of belonging to the negative class. 

For the above example, the model will predict the second sample as the positive class instead of the negative class as previously, since it has only 75% (<80%) probability of belonging to the negative class.

Varying the thresholds will thus _affect TPR and FPR_.

## Receiver Operating Characteristic (ROC) <a class="anchor" id="ROC"></a>

ROC curves are typically used in _binary classification_ to study the output of a classifier. An ROC curve is built by plotting FPR on the X axis and TPR on the Y axis using different threshold values.

Let's test it with a simple model.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit( x, y )

We call `predict_proba()` and take the first column of the result, i.e., the probabilities of the samples belonging to the positive class.

In [None]:
y_score = model.predict_proba(x)[:, 1]
y_score

Now let's build the ROC curve, using the sklearn function [`roc_curve()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) from the `metrics` module. We provide it with the true label and the predicted `y_score`, and it essentially:
- determines various threshold values, and
- calculates the FPR and TPR values for each threshold.

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve( y, y_score )

We can gather them in a DataFrame for ease of viewing.

In [None]:
roc_df = pd.DataFrame( zip(fpr, tpr, thresholds), columns = ["FPR", "TPR", "Threshold"])

print( roc_df.shape )
roc_df.head()

As explained in the API reference, `thresholds[0]` represents no instances being predicted and is arbitrarily set to `np.inf`.

_Question to ponder: How do you think the `roc_curve()` function determines what threshold values to use?_

Let's now create the plot with FPR on the X axis and TPR on the Y axis.

In [None]:
fig, ax = plt.subplots()

ax.plot( fpr, tpr )

ax.set_xlabel( 'False positive rate' )
ax.set_ylabel( 'True positive rate' )

plt.show()

The top left corner of the plot is the "ideal" point: FPR of zero and TPR of one.

## Area Under the Curve (AUC or AUROC) <a class="anchor" id="AUROC"></a>

The area underneath the entire ROC curve is called AUROC (or AUC) and is always represented as a value between 0 to 1.

We can see that the nearer the ROC curve is to the "ideal" point, the larger the area under the curve will be. Thus, we usually want to maximize AUROC, as this means achieving highest possible TPR and lowest possible FPR. 

_Question to ponder: Is it always the case?_

We can compute the AUROC using the sklearn function [`roc_auc_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), giving it the prediction scores, similar to how we use `roc_curve()`. 

In [None]:
from sklearn.metrics import roc_auc_score

auroc = roc_auc_score( y, y_score )
auroc

We can redraw the ROC curve plot with the area filled.

In [None]:
fig, ax = plt.subplots()

ax.plot(fpr, tpr)

ax.fill_between( fpr, tpr, step="pre", alpha=0.4 )

ax.set_xlabel( 'False positive rate' )
ax.set_ylabel( 'True positive rate' )

# Project points to x-axis
plt.vlines( fpr, 0, tpr, linestyle="dashed" )

plt.show()

The dashed lines illustrate how the AUROC calculation is done -- the function basically calculates the area of each rectangle and sums them up.

## ROC Curve Application - Comparability <a class="anchor" id="Comparability"></a>

The ROC curve is valuable mainly for two reasons:

- It lets us select an optimal threshold for that model, and 
- It gives us a visual way to compare different classifiers.

Let's illustrate the second point by using another classifier together with the previous one.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( max_depth=1 )
rf.fit( x, y )

Following the same steps as the first classifier, perform the prediction, build the ROC curve, and compute the AUROC.

In [None]:
# TODO
# y_score_rf = ?

In [None]:
# TODO
# fpr_rf, tpr_rf, thresholds_rf = ?

In [None]:
# TODO
# auroc_rf = ?

Now we can put them together, and also add the ROC curve of a hypothetical _perfect classifier_, i.e., one that will always have TPR = 1 regardless of the FPR. 

In [None]:
fig, ax = plt.subplots()

# Logistic Regression
ax.plot( fpr, tpr )

# Random Forest
ax.plot( fpr_rf, tpr_rf )

# Perfect Classifier
ax.plot([0, 0, 1], [0, 1, 1])

ax.set_xlabel( 'False positive rate' )
ax.set_ylabel( 'True positive rate' )
ax.legend(
    [
        'LogisticRegression - AUROC {:.3f}'.format(auroc),
        'RandomForest - AUROC {:.3f}'.format(auroc_rf),
        'Perfect classifier'
    ]
)

plt.show()

We can see that, compared to the LogisticRegression classifier, the RandomForest classifier has:
- higher ROC curve (closer to the perfect classifier), and
- larger AUROC value.

Therefore, we can say that the RandomForest classifier is doing a better job than the LogisticRegression classifier at classifying the positive class in the dataset.

## AUROC Properties <a class="anchor" id="AUROC-Properties"></a>

AUROC has the following properties that are desirable for measuring classification performance:

- Scale-invariant: It measures how well predictions are ranked, rather than their absolute values.
- Threshold-invariant: It measures the quality of the model's predictions irrespective of what classification threshold is chosen.


## AUROC in Multi-Class Classification <a class="anchor" id="Multi-Class"></a>

We have so far talked about binary classification. The `roc_auc_score` function can also be used in _multi-class classification_. Scikit-learn currently supports two averaging strategies:

- The __one-vs-one (OvO)__ algorithm computes the average of the pairwise AUROC scores.

- The __one-vs-rest (OvR)__ algorithm computes the average of the AUROC scores for each class against all other classes.

In both cases, the predicted labels are provided in an array with values from 0 to` n_classes`, and the scores correspond to the probability estimates that a sample belongs to a particular class. Both algorithms support weighting uniformly (`average='macro'`) and by prevalence (`average='weighted'`).

Let's illustrate using the iris dataset which we have seen in past tutorial.

In [None]:
from sklearn import svm, datasets
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier

In [None]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

Let's shuffle and split it into training and test datasets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=0 )

For the purpose of this illustration, we use [`OneVsRestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) with [`SVC`](https://scikit-learn.org/1.0/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) as the estimator to learn from the data.

In [None]:
random_state = np.random.RandomState(0)

classifier = OneVsRestClassifier(
    svm.SVC( kernel="linear", probability=True, random_state=random_state )
)
y_score = classifier.fit( X_train, y_train ).decision_function( X_test )

We can now perform the prediction.

In [None]:
y_prob = classifier.predict_proba( X_test )

Let's calculate the AUROC values for each of the algorithms, OvO and OvR, each with macro-averaging and weighted-averaging. 

In [None]:
macro_roc_auc_ovo = roc_auc_score( y_test, y_prob, multi_class="ovo", average="macro" )
weighted_roc_auc_ovo = roc_auc_score( y_test, y_prob, multi_class="ovo", average="weighted" )
print(
    "One-vs-One ROC AUC scores:\n{:.6f} (macro),\n{:.6f} "
    "(weighted by prevalence)".format(macro_roc_auc_ovo, weighted_roc_auc_ovo)
)

In [None]:
macro_roc_auc_ovr = roc_auc_score( y_test, y_prob, multi_class="ovr", average="macro" )
weighted_roc_auc_ovr = roc_auc_score( y_test, y_prob, multi_class="ovr", average="weighted" )
print(
    "One-vs-Rest ROC AUC scores:\n{:.6f} (macro),\n{:.6f} "
    "(weighted by prevalence)".format(macro_roc_auc_ovr, weighted_roc_auc_ovr)
)