# Confusion matrix

In this notebook, we illustrate how to:

1. build the confusion matrix
2. plot the confusion matrix
3. extract the basic error rates from the confusion matrix

## Load the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Get the data

We are using results from a mock binary classification problem.
This was based on the [breast cancer Wisconsin dataset](https://github.com/scikit-learn/scikit-learn/blob/6e9039160f0dfc3153643143af4cfdca941d2045/sklearn/datasets/data/breast_cancer.csv) from the Python library `sklearn`.

In this dataset, the objective is to diagnose the status of breast cancer:

- `0`: malignant cancer
- `1`: benign cancer

The dataset contains **569 examples**:

- 212 malignant
- 357 benign

And the prediction (classification) is based on 30 numeric features related to the cancer lesions (size, shape etc.: full description can be found [here](https://scikit-learn.org/1.5/datasets/toy_dataset.html#breast-cancer-dataset))

Actually, in this example we used a random subset of the 30 features, by selecting 8 features: in this way, the problem was harder, and we obtained more classification errors, which is instrumental to the illustration of different metrics to measure model performance.

In [None]:
DATASET_URL = 'https://raw.githubusercontent.com/ne1s0n/bioinformateachers/refs/heads/main/dlb/data/predictions.csv'

In [None]:
columns = ['y_test', 'y_pred']

bc_data = pd.read_csv(DATASET_URL, usecols=columns)
bc_data.head()

In [None]:
from sklearn.metrics import confusion_matrix

predictions = bc_data['y_pred']
observations = bc_data['y_test']

predicted_labels = np.where(predictions == 1.0, "case", "control")
target_labels = np.where(observations == 1.0, "case", "control")

In [None]:
from operator import index

labs, counts = np.unique(predicted_labels, return_counts=True)
dict1 = {k:v for (k,v) in zip(labs,counts)}
dict1['set'] = 'predictions'

labs, counts = np.unique(target_labels, return_counts=True)
dict2 = {k:v for (k,v) in zip(labs,counts)}
dict2['set'] = 'observations'


pd.DataFrame.from_records([dict1,dict2])

```

      | pred - | pred +
------|--------|--------
obs - |   TN   |   FP
obs + |   FN   |   TP


```

In [None]:
conf_mat_df = confusion_matrix(predicted_labels, target_labels, labels=["control","case"])
print(conf_mat_df)

In [None]:
## sanity check
## let's get the n. of true positives (y_test == 1 AND y_pred == 1)
bc_data.loc[(bc_data['y_test'] == 1) & (bc_data['y_pred'] == 1)].shape[0]

In [None]:
import seaborn as sn

figure = plt.figure(figsize=(8, 8))
sn.heatmap(conf_mat_df, annot=True,cmap=plt.cm.Blues)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
conf_mat_norm = confusion_matrix(target_labels, predicted_labels, normalize='true', labels=["control","case"])
print(conf_mat_norm)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=conf_mat_norm)

In [None]:
disp.plot(xticks_rotation=45)

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(target_labels, predicted_labels)

In [None]:
accuracy

In [None]:
tn, fp, fn, tp = conf_mat_df.ravel()
tp