<a href="https://colab.research.google.com/github/nickstone1911/data-analysis-practice/blob/main/Confusion_Matrix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ABA Tech Lesson 08: Creating the Confusion Matrix in `scikit-learn`

## Learning Objectives
In this lesson we will:
1. Learn the syntax for creating a confusion matrix in `scikit-learn`
2. Customize the confusion matrix for analysis and presentation
3. Learn how to call `classification_report` to get a nice summary of classification metrics



# Initial Imports

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


# Load Data

In this example we are going to load a popular machine learning dataset based on the historic Titanic tragedy.

>- Download/move this file to your working directory: [titanic_train_clean.csv](https://docs.google.com/spreadsheets/d/13ewKdSsxu-kMTq-2IjMHwv5J-H6qbJENbPkM7MX7xL4/edit?usp=sharing)
>- Note: this will not be a complete example of how to analyze this dataset but instead we focus on learning the syntax for the confusion matrix



In [None]:
mydir = '/content/drive/MyDrive/BAIM4205'

In [None]:
os.chdir(mydir)
os.getcwd()

'/content/drive/MyDrive/BAIM4205'

Load `titanic_train_clean.csv`

In [None]:
tt = pd.read_excel('titanic_train_clean.xlsx')
tt.head()

FileNotFoundError: [Errno 2] No such file or directory: 'titanic_train_clean.xlsx'

# Define features and target

>- The titanic problem is usually to try and predict who survived so the target is `Survived`
>- For this tutorial we are only going to select a few features, so that we can focus on learning about the confusion matrix syntax

In [None]:
X = tt.iloc[:, -4:]
y = tt['Survived']
X.head()

In [None]:
y.head()

---

# End of Video 1

---

# Fit a Basic Model

`train_test_split()`

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 101)

In [None]:
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()

logmodel.fit(X_train, y_train)

# Make Predictions

In [None]:
y_pred = logmodel.predict(X_test)

# Evaluate with Confusion Matrix

>- Confusion Matrix: `confusion_matrix`
>- Classification Report: `classification_report`
>>- We can check precision, recall, f1-score using classification report

>- Import statement:

```python
from sklearn.metrics import classification_report, confusion_matrix
```

>- Note: the following is the default way `scikit-learn` returns a confusion matrix so we are going to be consistent with this. If you look at other documentation you may see the confusion matrix set up slightly differently:

\begin{array}{|c|c|}
\hline
(TN) &  (FP) \\
\hline
(FN) & (TP) \\
\hline
\end{array}

---

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

Using `ravel()` from `numpy` we can flatten this array which will allow us to use multi variable assignment for tn, fp, fn, tp.

In [None]:
tn, fp, fn, tp = cm.ravel()

Now we can assign values to variables for true negative (tn), false positive (fp), false negative (fn), and true positive (tp)
>- This will help when calculating the various evaluation metrics

Calculate a accuracy:

In [None]:
accuracy = (tp + tn) / (cm.sum())
accuracy

---
# End of Video 2

---

# Create a DataFrame from confusion matrix

>- Creating a DataFrame from the confusion matrix will allow us to perform other calculations and more easily see our results

In [None]:
cm_df = pd.DataFrame(cm, index = ['died(0)', 'lived(1)'], columns = ['pred_died(0)', 'pred_lived(1)'])

cm_df

Plot a heatmap of the confusion matrix

In [None]:
sns.heatmap(cm, annot=True, cmap = 'Blues', fmt='d')

plt.title('Confusion Matrix')
plt.xlabel('Prediction Labels')
plt.ylabel('Actual Labels')
plt.show()

Now, we can add in new columns for row and column totals.
>- This can help calculate other evaluation metrics

In [None]:
cm_df['column_total'] = cm_df.apply(lambda x: x.sum(), axis = 1)

cm_df.loc['row_total'] = cm_df.apply(lambda x: x.sum(), axis = 0)

cm_df

Recall of the negative class (died) can be found by:
>- Later we will find this in the classification report

In [None]:
cm_df['pred_died(0)']['died(0)'] / cm_df['column_total']['died(0)']

# Classification Report

>- `scikit_learn` provides the `classification_report` which makes getting key classification metrics convenient.

In [None]:
print(classification_report(y_test, y_pred, target_names = ['died', 'lived']))

To convert the classification report to a DataFrame we pass in `output_dict=True` the call then DataFrame constructor

In [None]:
report = classification_report(y_test, y_pred, target_names = ['died', 'lived'], output_dict = True)

round(pd.DataFrame(report), 2).transpose()

---
# End of Video 3

---