In [30]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml

# Classification Metrics

Here, I'm running through various different classification metrics, and using scikit learn's `metrics` modules equivalents as a benchmark to make sure mine are running as expected.

First, I'll load MNIST as a default classification problem, and use a `SGDClassifier` to get some baseline scores, then compare my home coded metrics against scikit learn's.

In [8]:
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist['target']

In [9]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [10]:
# making it a binary classification problem if required
y_train_2 = (y_train == '2')
y_test_2 = (y_test == '2')

In [11]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier()
sgd_clf.fit(X_train, y_train_2)

SGDClassifier()

In [12]:
y_pred_2 = sgd_clf.predict(X_test)

## Accuracy

Accuracy is defined as the amount of correct predictions divided by thet total amount of predictions made. Below I check `scikit learn`'s version of this against my own to see how my implementation performs.



In [13]:
# sklearn
from sklearn.metrics import accuracy_score

accuracy_score(y_test_2, y_pred_2)

0.9557

In [14]:
# my version
from machine_learning.metrics import accuracy

accuracy(y_test_2, y_pred_2)

0.9557

It works, which is good. Accuracy is considered to be quite a flawed metric in evaluating classifiers. This is because it handles datasets where the target variable isn't evenly distributed poorly. Imagine a dataset with 99% of the samples having a target of `0` and 1% having `1`. You can make a 99% accurate classifier by predicting `0` for every single instance. A practical example of this is shown below:

In [15]:
# check balance of whole dataset classes
print(f'Train set = {y_train_2.sum()/len(y_train_2)}')
print(f'Test set = {y_test_2.sum()/len(y_test_2)}')

Train set = 0.0993
Test set = 0.1032


In [16]:
# create an array of always false predictions
y_pred_never_2 = np.zeros(len(y_test_2), dtype=bool)

In [17]:
# evaluate this with our accuracy metrics
accuracy(y_test_2, y_pred_never_2)

0.8968

As shown, you can score high accuracy with poor classifiers, so more nuanced metrics should be used for proper evaluation of a classifier.

## Precision

The precision of a classifier is defined as the amount of true positives divided by the sum of true positives and false positives. This can be thought of as the accuracy of the classifier's positive predictions



In [18]:
from sklearn.metrics import precision_score

precision_score(y_test_2, y_pred_2)

0.7196122296793438

In [19]:
from machine_learning.metrics import precision

precision(y_test_2, y_pred_2)

0.7196122296793438

So the implemention of precision looks like it's working.

## Recall

Recall is defined as the number of true positives divided by the sum of true positives and false negatives. In other words, it's the ratio of positive instances that are correctly identified by the classifier.

In [20]:
from sklearn.metrics import recall_score

recall_score(y_test_2, y_pred_2)

0.935077519379845

In [21]:
from machine_learning.metrics import recall

recall(y_test_2, y_pred_2)

0.935077519379845

This implementation looks to be working too.

### Precision Recall Curve

All of the previous metrics were using a threshold for prediction 0.5. We can change this value, and observe the tradeoffs we're making in precision and recall as a result of this.