# ML Workflow - Supervised Learning (Classification)

![Image](./img/scikit_learn.png)


In [None]:
# imports 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

---

## The Classification Problem

- One-Class Classification

- Multiclass Classification

- Multilabel Classification (also known as Multioutput Classification, but not exactly the same)

- Multitask Classification (also known as Multiclass-multioutput classification)

![Image](./img/classification_problem.png)

---

## One-Class Classification

- [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

- [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

- [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
#Load toy dataset

cancer = datasets.load_breast_cancer(as_frame=True)
description = cancer.DESCR

cancer = cancer['data'].merge(cancer['target'], left_index=True, right_index=True)
cancer

In [None]:
print(description)

In [None]:
# Load synthetic dataset

X, y = datasets.load_breast_cancer(return_X_y=True)
#X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
print(X.shape, y.shape)

In [None]:
# Train and validation sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}")
print(f"X_train: {type(X_train)}, X_test: {type(X_test)}, y_train: {type(y_train)}, y_test: {type(y_test)}")

In [None]:
%%time

# Model definition

model = SGDClassifier()
#model = SVC()
#model = DecisionTreeClassifier()

hyperparameters = model.get_params()

print(type(model), '\n')
print('Model hyperparameters:', hyperparameters, '\n')

In [None]:
%%time

# Model training

model.fit(X_train, y_train)

print('Model:', model, '\n')
print('Model hyperparameters:', hyperparameters, '\n')

In [None]:
%%time

# Model predictions

predictions = model.predict(X_test)

print(type(predictions))

In [None]:
predictions

In [None]:
# Visual check

check = pd.DataFrame({'Ground truth':y_test, 'Predictions':predictions, 'Diff':y_test-predictions})
check.tail(20)

__Accuracy__

In [None]:
model.score(X_test, y_test)

In [None]:
accuracy_score(y_test, predictions)

In [None]:
accuracy_score(y_test, predictions, normalize=False)

---

### Classification Metrics Definitions

![Image](./img/confusion_matrix_.JPG)

- TP = True Positives (predict 1 when 1)

- TN = True Negatives (predict 0 when 0)

- FP = False Positives (predict 1 when 0)

- FN = False Negatives (predict 0 when 1)

---


![Image](./img/precision_recall_f1.jpg)

- Accuracy = (TP+TN)/(TP+TN+FP+FN)

- Precision = TP/(TP+FP)

- Recall = TP/(TP+FN)

- F1 = 2 * (Precision * Recall) / (Precision + Recall)

In [None]:
# Precision

precision_score(y_test, predictions)

In [None]:
# Recall

recall_score(y_test, predictions)

__F-Score__ is the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) between Precision and Recall

In [None]:
f1_score(y_test, predictions)

__Confusion Matrix__

In [None]:
confusion_matrix(y_test, predictions)

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(13,8))
ax = sns.heatmap(confusion_matrix(y_test, predictions), annot=True)
b, t = ax.get_ylim()
ax.set_ylim(b + 0.5, t - 0.5)
plt.title('Confussion Matrix')
plt.ylabel('Ground Truth')
plt.xlabel('Prediction')
plt.show();

__Receiver Operating Characteristics Curve__

It tells how much the model is capable of distinguishing between classes.

![Image](./img/roc_curve.JPG)


- __TPR__ = True Positive Rate = __TP/(TP+FN)__

- Specificity = TN/(TN+FP)

- __FPR__ = False Positive Rate = 1 - Specificity = __FP/(FP+TN)__

In [None]:
# Area under the curve (AUC)

roc_auc_score(y_test, predictions)

__ROC Curve Interpretation__

The ROC is a curve of probability, therefore we can use the distributions of those probabilities to interpret the meaning of it.


__Perfect Classifier:__

![Image](./img/roc_01.JPG)

__Real World Classifier:__

![Image](./img/roc_02.JPG)

__Random Classifier:__

![Image](./img/roc_03.JPG)

__Reciprocating the classes:__

![Image](./img/roc_04.JPG)

---