# ROC curves, AUC (area under the curve) and MCC (Matthews' correlation coefficient)

In this notebook, we illustrate how to:

1. draw the ROC curve
2. calculate the AUC (area under the curve)
3. calculate the MCC (Matthew's correlation coefficient)

## Loading libraries

First of all, we load some necessary general libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Get the data

We are using results from a mock binary classification problem.
This was based on the [breast cancer Wisconsin dataset](https://github.com/scikit-learn/scikit-learn/blob/6e9039160f0dfc3153643143af4cfdca941d2045/sklearn/datasets/data/breast_cancer.csv) from the Python library `sklearn`.

In this dataset, the objective is to diagnose the status of breast cancer:

- `0`: malignant cancer
- `1`: benign cancer

The dataset contains **569 examples**:

- 212 malignant
- 357 benign

And the prediction (classification) is based on 30 numeric features related to the cancer lesions (size, shape etc.: full description can be found [here](https://scikit-learn.org/1.5/datasets/toy_dataset.html#breast-cancer-dataset)).

Actually, in this example we used a random subset of the 30 features, by selecting 8 features: in this way, the problem was harder, and we obtained more classification errors, which is instrumental to the illustration of different metrics to measure model performance.

The dataset is imbalanced: the ratio between the two classes is not 1, but 0.6.
Again, this is instrumental in showing the relative advantage of using different performance metrics rather than just looking at the error rate / overall accuracy.

An 80% / 20% training / test data split was used to train the classification model and measure performance: the **test results** are used here to show ROC curves, AUC and MCC.

We trained two classification models: the second (`mod2`) was designed as to increase overfitting and produce results which are biased towards the majority class (useful for the illustration of performance metrics).

In [None]:
DATASET_URL = 'https://raw.githubusercontent.com/ne1s0n/bioinformateachers/refs/heads/main/dlb/data/predictions.csv'

The dataframe contains:

- the **original test observation** (the "truth": malignant or benign)
- the **predicted class** (binary) for the base and alternative (mod2) models
- the **two probabilities**: of being '0' (malignant) or '1' (benign), for the base and alternative models (mod2)

In [None]:
bc_data = pd.read_csv(DATASET_URL)
bc_data.head()

The test dataset was generated by taking a 20% random subset of the data: **114 test examples**.

In [None]:
len(bc_data)

In [None]:
## 1: benign
## 0: malignant
bc_data[['y_test']].value_counts()

---

Let's get a look at the original **confusion matrix**: first we get the two vectors of predictions and observations, and then construct the matrix of correct predictions (diagonal) and errors (off diagonal).

(Remember: in the `sklearn` confusion matrix, true labels are on the rows, predicted labels are on the columns)

In [None]:
y_test = np.array(bc_data['y_test'])
y_pred = np.array(bc_data['y_pred'])

In [None]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

---

The next step will be to bring the **probabilities of prediction** in the game, not just the predicted classes.

Keep in mind that the predicted classes have been obtained based on the **0.5 threshold**: each test example is attributed to the class whose probability is larger than 50%.

In [None]:
y_probs = np.array(bc_data[bc_data.columns[3:5]])
print(y_probs[0:5,:])

---

### ROC curves

This is a binary classification problem, therefore the model usually focuses on the probability for just one class (being the other unambiguously obtained as reciprocal to 1).
Most commonly, the probability of class `1` ("case") is modeled:

$$
P(y=1 | X)
$$

Therefore, this probability is used in the calculation of the **ROC curve**.

In [None]:
from sklearn.metrics import roc_curve, auc, roc_auc_score

In [None]:
prob_y_eq_1 = y_probs[:,1]

The **ROC curve** is based on looking at classification results from the perspective of all (many) classification thresholds, not just the standard 50%.

This means that probabilities ($P(y=1)$) are evaluated against threshold 0%, 0.5% 10% $\ldots$ 90%, 95%, 100%): for each threshold the **false positive rate** (FP/(FP+TN)) and the **true positive rate** (TP/(TP+FN) = 1-FNR) are calculated and then plotted against each other.

The function `roc_curve` from `sklearn` takes in input the correct test labels (the "truth") and the prediction probabilities obtained from the classification model.

This function returns the ingredients needed to draw the ROC curve: the FPR and TPR calculated against several classification thresholds (by default, 20 thresholds are considered):

In [None]:
fpr, tpr, thrs = roc_curve(y_test, prob_y_eq_1)

We obtain **20 values** for **FPR** and for **TPR**:

In [None]:
len(fpr)

In [None]:
df = pd.DataFrame(np.vstack((fpr,tpr)).T, columns=['FPR','TPR'])
df.head(10)

We now have all the elements to plot the ROC curve for this classification problem.

Typically, a ROC curve is visually contrasted against chance accuracy, which for a binary classification problem is 50%, and it is represented by a straight line bisecting the plot:

In [None]:
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    #label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
#plt.legend(loc="lower right")
plt.show()

Ideally, the ROC curve of the chose classification model should be as close as possible to the **top left corner** of the above plot; conversely, the closer the ROC curve gets to the dashed line (chance accuracy), the worse the performance of the model.

### AUC

The area under the (ROC) curve -the **AUC**- provides a summary numeric score for the model performance, in terms of **True** and **False Positive Rates**, over **multiple classification thresholds**.

This is a very effective way of summarising the information visually conveyed by the plot of the ROC curve:

- if the ROC curve follows exactly the left top corner of the plot, we have perfect classification accuracy, and the AUC is 100% (**AUC = 1** : the entire plotting area is under the ROC curve)
- if the ROC curve is collapsed with the diagonal bisecting the plot, we have perfect chance accuracy, and the AUC is 50% (**AUC = 0.5**); our classification modes is no better than tossing a coin in making predictions!

Intuitively, the higher the AUC the better the binary classification model.
However, what is to be considered a good AUC score obvioulsy depends a lot on the specific classification problem at hand.
Generally speaking, **AUC > 0.8** usually indicates a good model performance.

In [None]:
roc_auc = auc(fpr, tpr)

In [None]:
print("AUC is", round(roc_auc,3))

We can easily add the AUC value to the ROC curve plot:

In [None]:
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.3f)" % roc_auc
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()

As we can see below, the overall accuracy of classification (1 - error rate) is $0.904$, but with a sizeable difference between the two classes: the accuracy among positive cases (**TPR**) is $0.945$, and it is higher than the accuracy in the negative cases (**TNR**: $0.829$).

This is most likely linked to the class imbalance: in this dataset, the number of positive cases (benign breast cancer) is 60% larger than the number of negative cases (malignant brest cancer).

The AUC combines all this information into a single metric, by looking at both TPR and FPR (1- FNR) over all possible classification thresholds: the AUC depends both on TPR / FPR and on the probabilities of classification.

In [None]:
accuracy = (y_test == y_pred).sum()/len(y_test)
print(round(accuracy, 3))

In [None]:
n_positives = len(bc_data[bc_data["y_test"]==1])
n_negatives = len(bc_data[bc_data["y_test"]==0])

print("N. of positive test examples:", n_positives)
print("N. of negative test examples:", n_negatives)

In [None]:
true_positive_preds = len(bc_data[(bc_data.y_test==1) & (bc_data.y_pred == 1)])
true_negative_preds = len(bc_data[(bc_data.y_test==0) & (bc_data.y_pred == 0)])

tpr_val = true_positive_preds/n_positives
tnr_val = true_negative_preds/n_negatives

print("TPR is:", round(tpr_val,3))
print("TNR is:", round(tnr_val,3))

#### Alternative model (with overfitting)

We classified the same data (breast cancer) usign a second alternative model, where we did some overfitting on purpose.

In [None]:
y_pred2 = np.array(bc_data['y_pred_mod2'])

### confusion matrix from model 2
cnf_matrix2 = metrics.confusion_matrix(y_test, y_pred2)
cnf_matrix2

This second model made the same number of errors on the test data (n_error = 11), and had therefore the exact same overall accuracy: $0.904$ (see below).

However, the distribution of errors is now different:

- TPR = 72/73 = $0.986$
- TNR = 31/41 = $0.756$

We see that the model is now more skewed towards the majority class, thus making many more errors in the minority class.
The overall accuracy is not able to capture this difference between the two classification models, therefore we can resort to the analysis of ROC curves.

We see (below) that the ROC curve for model 2 (green dotted line) appears to be more often than not closer to chance accuracy compared to base model (solid yellow line): there is therefore [second-order stochastic dominance](https://en.wikipedia.org/wiki/Stochastic_dominance) between the two models.

This is confirmed by the AUC: AUC for model 2 is 0.949, which is lower than that for the first model, which is 0.956 (not much, but still lower, hence better TPR / TNR trade-off over all classification thresholds).

In [None]:
### overall accuracy from model 2
accuracy = (y_test == y_pred2).sum()/len(y_test)
print(round(accuracy, 3))

In [None]:
### TPR and TNR from model 2
true_positive_preds = len(bc_data[(bc_data.y_test==1) & (bc_data.y_pred_mod2 == 1)])
true_negative_preds = len(bc_data[(bc_data.y_test==0) & (bc_data.y_pred_mod2 == 0)])

tpr_mod2 = true_positive_preds/n_positives
tnr_mod2 = true_negative_preds/n_negatives

print("TPR is:", round(tpr_mod2,3))
print("TNR is:", round(tnr_mod2,3))

##### ROC curve and AUC from the alternative classification model

In [None]:
y_probs = np.array(bc_data[bc_data.columns[5:7]])
prob_y_eq_1_mod2 = y_probs[:,1]
fpr2, tpr2, thrs = roc_curve(y_test, prob_y_eq_1_mod2)

In [None]:
roc_auc2 = auc(fpr2, tpr2)
print("AUC for the alternative model is", round(roc_auc2,3))

In [None]:
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.3f)" % roc_auc,
)
plt.plot(
    fpr2,
    tpr2,
    color="darkgreen",
    lw=lw,
    linestyle=":",
    label="ROC curve (area = %0.3f)" % roc_auc2,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()

### MCC

The ROC AUC is an excellent metric for binary classification problems, that summarizes the accuracy in both classes and makes a trade-off between TPR and TNR (there are also extensions to multiclass classification: [volume under the ROC surface](https://link.springer.com/chapter/10.1007/978-3-540-39857-8_12)). Under some extreme circumstances, though, AUC can be a misleading measure of the model performance.

The AUC only takes into account two of the four basic ratios from the confusion matrix: TPR and FPR = 1-TNR (the accuracy measured on the true labels). However, when data are strongly imbalanced this can be suboptimal:
- e.g. positive examples >> negative examples: > false negatives (positive examples wrongly classified as negative)
- $\rightarrow$ small change in  FNR (TPR = 1 - FNR) (the denominator is the sum of the many true positives and the few false negatives)
- $\rightarrow$ larger change in NPV (calculations restricted to the few negative predictions)

For example, if we have 1020 'cases' (positive examples) and make twelve errors (FN = 12), the TPR would be 1008/1020 = 98.8% (calculations within positive cases only).
If FN increase by 50% (new FN = 18), the TPR would decrease to 98.2% (almost imperceptible change).
The FPR would remain unchanged (only the number of FN has increased), thereby producing almost no change in ROC-AUC.

However, with the same results and, say, 20 true negatives (remember,
this is the highly unbalanced minority class), the NPV would go from
62.5% to 52.6%: this sharp decrease would hardly go unnoticed!

Instead of considering only two error metrics (the accuracy
from the perspective of the true labels), a better option would be to
consider all four: TPR, TNR, PPV and NPV. This is the **Matthews
Correlation Coefficient** (**MCC**), and offers another perspective to model performance:

$$
\phi = \frac{(TP \cdot TN - FP \cdot FN)}{\sqrt{(TP+FP) \cdot (TP+FN) \cdot (TN+FP) \cdot (TN+FN)}}
$$

In [None]:
## get the prediction / error counts
tn, fp, fn, tp = cnf_matrix.ravel()

In [None]:
## calculate MCC (a.k.a. phi) by hand
phi =(tp*tn - fp*fn) / np.sqrt((tp+fp)*(fn+tn)*(tp+fn)*(fp+tn))
print("MCC is", round(phi,3))

In [None]:
## use the function matthews_corrcoef from scikit-learn
from sklearn.metrics import matthews_corrcoef

y_test = np.array(bc_data['y_test'])
y_pred = np.array(bc_data['y_pred'])
matthews_corrcoef(y_test, y_pred)

What with the second classification model? (column '`y_pred_mod2`') in the dataset:

In [None]:
y_test = np.array(bc_data['y_test'])
y_pred = np.array(bc_data['y_pred_mod2'])

In [None]:
phi2 = matthews_corrcoef(y_test, y_pred)
print("MCC for model 2 is", round(phi2,3))

The MCC for model 2 is higher than that of model 1, although results in terms of AUC are the other way around.
Let's look at the distribution of predictions and errors for the two models, to try and understand what is going on.

Recap: this is the structure of the the confusion matrix:

```

      | pred - | pred +
------|--------|--------
obs - |   TN   |   FP
obs + |   FN   |   TP


```

- more positive than negative observations: 73 vs 41
- model 2 gives more FP and fewer FN

<u>Observation-wise</u>
- we expect TPR to not change much (TP are most abundant)
- we expect FPR to be higher for model 2 (more errors in the minority class)
- therefore AUC will be lower for model 2 (similar TPR, larger FPR)

<u>Prediction-wise</u>
- we expect PPV to not change much (again, TP are most abundant)
- we expect NPV to be higher for model 2 (only 1 error among negative predictions)

In [None]:
np.concatenate((cnf_matrix, cnf_matrix2))

In [None]:
## model 1

tn, fp, fn, tp = cnf_matrix.ravel()

FDR = fp/(fp+tp)
FOR = fn/(tn+fn)
PPV = tp/(tp+fp)
NPV = tn/(tn+fn)

FPR = fp/(fp+tn)
FNR = fn/(tp+fn)
TPR = tp/(tp+fn)
TNR = tn/(fp+tn)

accuracy = (tp+tn)/(tp+tn+fp+fn)
ndec = 3

dict1 = {'accuracy' : round(accuracy, ndec), 'AUC' : round(roc_auc,ndec), 'TPR' : round(TPR,ndec),
         'TNR' : round(TNR,ndec), 'FPR' : round(FPR,ndec), 'FNR' : round(FNR,ndec),
         'FDR' : round(FDR,ndec), 'FOR' : round(FOR,ndec), 'PPV' : round(PPV,ndec),
         'NPV' : round(NPV,ndec), 'MCC' : round(phi,ndec)}

In [None]:
## model 2

tn, fp, fn, tp = cnf_matrix2.ravel()

FDR = fp/(fp+tp)
FOR = fn/(tn+fn)
PPV = tp/(tp+fp)
NPV = tn/(tn+fn)

FPR = fp/(fp+tn)
FNR = fn/(tp+fn)
TPR = tp/(tp+fn)
TNR = tn/(fp+tn)

accuracy = (tp+tn)/(tp+tn+fp+fn)
ndec = 3

dict2 = {'accuracy' : round(accuracy, ndec), 'AUC' : round(roc_auc2,ndec), 'TPR' : round(TPR,ndec),
         'TNR' : round(TNR,ndec), 'FPR' : round(FPR,ndec), 'FNR' : round(FNR,ndec),
         'FDR' : round(FDR,ndec), 'FOR' : round(FOR,ndec), 'PPV' : round(PPV,ndec),
         'NPV' : round(NPV,ndec), 'MCC' : round(phi2,ndec)}

In [None]:
df = pd.DataFrame.from_records([dict1, dict2],index=['model_1', 'model_2'])
df

As a matter of fact, we see:

- for model 2, TPR is 4.3% higher, but FPR is 42.7% higher: these are the two ingredients of AUC, which is therefore lower for model 2 (worst relative performance)
- for model 2, PPV is 3.3% lower, but NPV is 8.3% higher, and this is reflected in a slightly higher MCC (better relative model performance)
