## Get Data - MNIST

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', as_frame=False)

In [None]:
X, y = mnist.data, mnist.target
X

In [None]:
print(X.shape, y.shape)

### Visualize a sample

In [None]:
import matplotlib.pyplot as plt
def plot_digit(image_data):
    image = image_data.reshape(28,28)
    plt.imshow(image, cmap="binary")
    plt.axis("off")

In [None]:
sample = X[5]
plot_digit(sample)
plt.show()

In [None]:
for i in range(9):
    plt.subplot(3,3,i+1)
    plot_digit(X[i])

plt.show()

### Split trainset and testset
The MNIST dataset returned by fetch_openml() is actually already split into a training set (the first 60,000 images) and a test set

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] 

## Training a Binary Classifier
Let’s simplify the problem for now and only try to identify one digit—for example, the number 5. This “5-detector” will be an example of a binary classifier, capable of distinguishing between just two classes, 5 and non-5.

In [None]:
y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

In [None]:
sgd_clf.predict([X[0]])

## Performance Measures

### Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

That’s right, it has over 90% accuracy! This is simply because only about
10% of the images are 5s, so if you always guess that an image is not a 5, you
will be right about 90% of the time. Beats Nostradamus.
This demonstrates why accuracy is generally not the preferred performance
measure for classifiers, especially when you are dealing with ***skewed datasets***
(i.e., when some classes are much more frequent than others). A much better
way to evaluate the performance of a classifier is to look at the confusion
matrix (CM).

### Confusion Matrices
To compute the confusion matrix, you first need to have a set of predictions
so that they can be compared to the actual targets. 

In [None]:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Just like the cross_val_score() function, cross_val_predict() performs k-fold
cross-validation, but instead of returning the evaluation scores, it returns the
predictions made on each test fold. This means that you get a clean prediction
for each instance in the training set.

In [None]:
from sklearn.metrics import  confusion_matrix
cm = confusion_matrix(y_train_5, y_train_pred)
cm

Each row in a confusion matrix represents an actual class, while each column
represents a predicted class. The first row of this matrix considers non-5
images (the **negative class**): 53,892 of them were correctly classified as non-
5s (they are called ***true negatives***), while the remaining 687 were wrongly
classified as 5s (***false positives***, also called type I errors). The second row
considers the images of 5s (the **positive class**): 1,891 were wrongly classified
as non-5s (***false negatives***, also called type II errors), while the remaining
3,530 were correctly classified as 5s (***true positives***).
> **|TN FP|** <br>
> **|FN TP|**

### Precision and Recall
The confusion matrix gives you a lot of information, but sometimes you may
prefer a more concise metric. An interesting one to look at is the accuracy of
the positive predictions; this is called the **precision** of the classifier. <br>
***precision = TP / (TP+FP)*** <br>
Precision is typically used along with another metric named
recall, also called sensitivity or the true positive rate (TPR): this is the ratio
of positive instances that are correctly detected by the classifier. <br>
***recall = TP / (TP + FN)***

In [None]:
from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5, y_train_pred), recall_score(y_train_5, y_train_pred))

It is often convenient to combine precision and recall into a single metric
called the F score, especially when you need a single metric to compare two
classifiers. The F score is the harmonic mean of precision and recall. <br>
F1 = 2*(1/precision + 1/recall) = 2*(precision * recall)/(precision + recall)

In [None]:
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

### The Precision/Recall Trade-off
For each instance, it computes a score based on a
decision function. If that score is greater than a threshold.
Scikit-Learn does not let you set the threshold directly, but it does give you
access to the decision scores that it uses to make predictions. Instead of
calling the classifier’s predict() method, you can call its decision_function()
method, which returns a score for each instance, and then use any threshold
you want to make predictions based on those scores.

In [None]:
y_scores = sgd_clf.decision_function([X[0]])
y_scores

In [None]:
threshold = 0
y_some_digit_pred = (y_scores > threshold)

How do you decide which threshold to use? First, use the cross_val_predict()
function to get the scores of all instances in the training set, but this time
specify that you want to return decision scores instead of predictions:

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

With these scores, use the precision_recall_curve() function to compute
precision and recall for all possible thresholds (the function adds a last
precision of 0 and a last recall of 1, corresponding to an infinite threshold):

In [None]:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)


In [None]:
plt.figure(figsize=(9,3))
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")
plt.legend()
plt.grid()
plt.axis([-40000,40000,0,1])
plt.show()


In [None]:
idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]

### The ROC Curve
The receiver operating characteristic (ROC) curve is another common tool
used with binary classifiers. It is very similar to the precision/recall curve, but
instead of plotting precision versus recall, the ROC curve plots the true
positive rate (another name for recall) against the false positive rate (FPR).
The FPR (also called the fall-out) is the ratio of negative instances that are
incorrectly classified as positive. It is equal to 1 – the true negative rate
(TNR), which is the ratio of negative instances that are correctly classified as
negative. The TNR is also called specificity. Hence, the ROC curve plots
sensitivity (recall) versus 1 – specificity. <br>
To plot the ROC curve, you first use the roc_curve() function to compute the
TPR and FPR for various threshold values:


In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")
[...] # beautify the figure: add labels, grid, legend, arrow, and text
plt.show()

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

Now, we create a RandomForestClassifier, which can compare with SGDClassifier in terms of PR curve and F1 Score

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)

The precision_recall_curve() function expects labels and scores for each
instance, so we need to train the random forest classifier and make it assign a
score to each instance. But the RandomForestClassifier class does not have a
decision_function() method, due to the way it works. Luckily, it has a predict_proba() method that returns class
probabilities for each instance, and we can just use the probability of the
positive class as a score, so it will work fine.

In [None]:
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
method="predict_proba")
y_probas_forest[:2]

In [None]:
y_scores_forest = y_probas_forest[:, 1]
precisions_forest, recalls_forest, thresholds_forest = precision_recall_curve(
y_train_5, y_scores_forest)

# Plot
plt.plot(recalls_forest, precisions_forest, "b-", linewidth=2,
label="Random Forest")
plt.plot(recalls, precisions, "--", linewidth=2, label="SGD")
plt.show()

You now know how to train binary classifiers, choose the appropriate metric
1
for your task, evaluate your classifiers using cross-validation, select the
precision/recall trade-off that fits your needs, and use several metrics and
curves to compare various models. You’re ready to try to detect more than
just the 5s.

## Multiclass Classification
One way to create a system that can classify the digit images into 10 classes
(from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector,
a 1-detector, a 2-detector, and so on). Then when you want to classify an
image, you get the decision score from each classifier for that image and you
select the class whose classifier outputs the highest score. This is called the
one-versus-the-rest (OvR) strategy, or sometimes one-versus-all (OvA). <br>



In [None]:
from sklearn.svm import SVC

svm_clf = SVC(random_state=42)
svm_clf.fit(X_train[:2000], y_train[:2000])

That was easy! We trained the SVC using the original target classes from 0 to
9 (y_train), instead of the 5-versus-the-rest target classes (y_train_5). Since
there are 10 classes (i.e., more than 2), Scikit-Learn used the OvO strategy
and trained 45 binary classifiers. Now let’s make a prediction on an image:


In [None]:
svm_clf.predict([X_train[0]])

If you call the
decision_function() method, you will see that it returns 10 scores per
instance: one per class.

In [None]:
some_digit_scores = svm_clf.decision_function([X_train[0]])
some_digit_scores.round(2)

In [None]:
some_digit_scores.argmax()

If you want to force Scikit-Learn to use one-versus-one or one-versus-the￾rest, you can use the OneVsOneClassifier or OneVsRestClassifier classes.
Simply create an instance and pass a classifier to its constructor (it doesn’t
even have to be a binary classifier).

In [None]:
from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC(random_state=42))
ovr_clf.fit(X_train[:2000], y_train[:2000])


In [None]:
ovr_clf.predict([X_train[0]])

Training an SGDClassifier on a multiclass dataset and using it to make
predictions is just as easy:

In [None]:
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([X_train[0]])

In [None]:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype("float64"))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

## Error Analysis
Here, we will assume that you have found a promising model and you want
to find ways to improve it. One way to do this is to analyze the types of errors
it makes. <br>
First, look at the confusion matrix. For this, you first need to make
predictions using the cross_val_predict() function; then you can pass the
labels and predictions to the confusion_matrix() function, just like you did
earlier. However, since there are now 10 classes instead of 2, the confusion
matrix will contain quite a lot of numbers, and it may be hard to read.


In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred)
plt.show()

In [None]:
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred,
normalize="true", values_format=".0%")
plt.show()


To make the errors stand out more, we try putting zero weight on the correct predictions.

In [None]:
sample_weight = (y_train_pred != y_train)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred,
sample_weight=sample_weight,
normalize="pred", values_format=".0%")
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred,
sample_weight=sample_weight,
normalize="true", values_format=".0%")
plt.show()

## Multilable Classification

Until now, each instance has always been assigned to just one class. But in
some cases you may want your classifier to output multiple classes for each
instance.

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= '7')
y_train_odd = (y_train.astype('int8') % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

In [None]:
knn_clf.predict([X_train[0]])

There are many ways to evaluate a multilabel classifier, and selecting the
right metric really depends on your project. One approach is to measure the
F score for each individual label (or any other binary classifier metric
discussed earlier), then simply compute the average score.

In [None]:
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

This approach assumes that all labels are equally important, which may not
be the case.  One simple option is to give each label a weight equal to its
support (i.e., the number of instances with that target label). To do this,
simply set average="weighted" when calling the f1_score() function. <br><br>
If you wish to use a classifier that does not natively support multilabel
classification, such as SVC, one possible strategy is to train one model per
label. However, this strategy may have a hard time capturing the
dependencies between the labels. For example, a large digit (7, 8, or 9) is
twice more likely to be odd than even, but the classifier for the “odd” label
does not know what the classifier for the “large” label predicted. To solve this
issue, the models can be organized in a chain: when a model makes a
prediction, it uses the input features plus all the predictions of the models that
come before it in the chain. <br><br>
The good news is that Scikit-Learn has a class called ChainClassifier that
does just that! By default it will use the true labels for training, feeding each
model the appropriate labels depending on their position in the chain. But if
you set the cv hyperparameter, it will use cross-validation to get “clean” (out-of-sample) predictions from each trained model for every instance in the
training set, and these predictions will then be used to train all the models
later in the chain.


In [None]:
from sklearn.multioutput import ClassifierChain
chain_clf = ClassifierChain(SVC(), cv=3, random_state=42)
chain_clf.fit(X_train[:2000], y_multilabel[:2000])

## Multioutput Classification

The last type of classification task we’ll discuss here is called multioutput–
multiclass classification (or just multioutput classification). It is a
generalization of multilabel classification where each label can be multiclass.

To illustrate this, let’s build a system that removes noise from images. It will
take as input a noisy digit image, and it will (hopefully) output a clean digit
image, represented as an array of pixel intensities, just like the MNIST
images. Notice that the classifier’s output is multilabel (one label per pixel)
and each label can have multiple values (pixel intensity ranges from 0 to
255). This is thus an example of a multioutput classification system.


In [None]:
np.random.seed(42) # to make this code example reproducible
# Add noise to images
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
# Now, the original images are the truth
y_train_mod = X_train
y_test_mod = X_test

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[0]])
plot_digit(clean_digit)
plt.show()