# Confusion matrix

In this notebook, we illustrate how to:

1. build the confusion matrix
2. plot the confusion matrix
3. extract the basic error rates from the confusion matrix

## Load the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Get the data

We are using results from a mock binary classification problem.
This was based on the [breast cancer Wisconsin dataset](https://github.com/scikit-learn/scikit-learn/blob/6e9039160f0dfc3153643143af4cfdca941d2045/sklearn/datasets/data/breast_cancer.csv) from the Python library `sklearn`.

In this dataset, the objective is to diagnose the status of breast cancer:

- `0`: malignant cancer
- `1`: benign cancer

The dataset contains **569 examples**:

- 212 malignant
- 357 benign

And the prediction (classification) is based on 30 numeric features related to the cancer lesions (size, shape etc.: full description can be found [here](https://scikit-learn.org/1.5/datasets/toy_dataset.html#breast-cancer-dataset))

Actually, in this example we used a random subset of the 30 features, by selecting 8 features: in this way, the problem was harder, and we obtained more classification errors, which is instrumental to the illustration of different metrics to measure model performance.

The dataset is imbalanced: the ratio between the two classes is not 1, but 0.6.
Again, this is instrumental in showing the relative advantage of using different performance metrics rather than just looking at the error rate / overall accuracy.

An 80% / 20% training / test data split was used to train the classification model and measure performance: the **test results** are used here to show ROC curves, AUC and MCC.

From the **test dataset**, we read only the two columns that we need to for the confusion matrix:

- the vector of **observed test labels**
- the vector of **predicted labels** (classes)

In [None]:
DATASET_URL = 'https://raw.githubusercontent.com/ne1s0n/bioinformateachers/refs/heads/main/dlb/data/predictions.csv'

In [None]:
columns = ['y_test', 'y_pred']

bc_data = pd.read_csv(DATASET_URL, usecols=columns)
bc_data.head()

In [None]:
predictions = bc_data['y_pred']
observations = bc_data['y_test']

predicted_labels = np.where(predictions == 1.0, "benign", "malignant")
target_labels = np.where(observations == 1.0, "benign", "malignant")

We can first have a look at the class distribution (malignant/benign) among observations and predictions:

In [None]:
from operator import index

labs, counts = np.unique(predicted_labels, return_counts=True)
dict1 = {k:v for (k,v) in zip(labs,counts)}
dict1['set'] = 'predictions'

labs, counts = np.unique(target_labels, return_counts=True)
dict2 = {k:v for (k,v) in zip(labs,counts)}
dict2['set'] = 'observations'


df = pd.DataFrame.from_records([dict1,dict2])
df = df.set_index('set')
df

In [None]:
df.plot.bar(rot=0)

## The confusion matrix

The confusion matrix has the following basic form; for a binary classification problem, it is a 2x2 table with -usually- observed classes on the rows and predicted classes on the columns:

- TN: true negatives
- FP: false positives
- FN: false negatives
- TP: true positives

```

      | pred - | pred +
------|--------|--------
obs - |   TN   |   FP
obs + |   FN   |   TP


```

We can use the  `confusion_matrix` function from `scikit-learn` to construct the confusion matrix from our vectors of observed and predicted labels (more [details here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)).
By specifying the labels, you can change the order of positive and negative cases (1s and 0s) in the confusion matrix (be mindful of this!).

In [None]:
from sklearn.metrics import confusion_matrix

conf_mat_df = confusion_matrix(y_true = target_labels, y_pred = predicted_labels, labels=["malignant","benign"])
print(conf_mat_df)

In [None]:
34/41

The bottom-right cell contains the number of true positives (TP): we can do a quick sanity check by subsetting from the dataset of test results only the rows where positive observations are predicted correctly: we get the n. 69, which is correct!

In [None]:
## sanity check
## let's get the n. of true positives (y_test == 1 AND y_pred == 1)
bc_data.loc[(bc_data['y_test'] == 1) & (bc_data['y_pred'] == 1)].shape[0]

The confusion matrix can be plotted by using a heatmap:

In [None]:
import seaborn as sn

figure = plt.figure(figsize=(8, 8))
sn.heatmap(conf_mat_df, annot=True,cmap=plt.cm.Blues)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
conf_mat_norm = confusion_matrix(target_labels, predicted_labels, normalize='true', labels=["malignant","benign"])
print(conf_mat_norm)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=conf_mat_norm)

In [None]:
disp.plot(xticks_rotation=45)

## Basic error metrics from the confusion matrix

The basic error metric is the overall **error rate**, which is the ratio between the n. of errors and the total n. of predictions:

```
                 # errors
error rate = -------------------
                # predictions

```

The obvious counterpart is the overall **accuracy**:

```
                 # correct predictions
accuracy = -----------------------------
                   # predictions

```

In [None]:
## ERROR RATE
results = bc_data['y_test'] != bc_data['y_pred']
error_rate = results.sum()/len(results)

print("The error rate is:", round(error_rate, 3), "(or", round(error_rate, 3)*100, "%)")

In [None]:
## ACCURACY
results = bc_data['y_test'] == bc_data['y_pred']
accuracy = results.sum()/len(results)

print("The overall accuracy is:", round(accuracy, 3), "(or", round(accuracy, 3)*100, "%)")

We can calculate these metrics by using the `accuracy_score()` function from `scikit-learn`:

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(target_labels, predicted_labels)
print("Accuracy is:", accuracy)

In [None]:
print("Error rate is", 1-accuracy)

#### Error breakdown

In [None]:
tn, fp, fn, tp = conf_mat_df.ravel()

print("TN;", tn)
print("FP:", fp)
print("FN:", fn)
print("TP:", tp)

We can look at results in the confusion matrix, either from the perspective of predictions (column-wise) or from that of the observed values (row-wise).

From the perspective of the observed (true) values, we get the following basic error and accuracy metrics:

- FPR: false positive rate
- FNR: false negative rate
- TNR: true negative rate
- TPR: true positive rate

```

      | pred - | pred + |     error        |    accuracy
------|--------|--------|------------------|-----------------
obs - |   TN   |   FP   | FPR = FP/(TN+FP) | TNR = TN/(TN+FP)
obs + |   FN   |   TP   | FNR = FN/(FN+TP) | TPR = TP/(FN+TP)


```

## Extension to the multiclass confusion matrix