<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/marco-canas/intro-Machine-Learning/blob/main/classes/class_25_desempa%C3%B1o_clasificador/class_25_medidas_desempeno_clasificador.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table> 

# Medidas de desempeño de un Clasificador y clasificación multiclase: Clase 26 

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

from sklearn.datasets import fetch_openml


In [2]:
%%time 
mnist = fetch_openml('mnist_784', version = 1, as_frame = False) 

Wall time: 1min 1s


In [3]:
X,y = mnist['data'], mnist['target'] 

# Conversión de `str` a `int64`

In [4]:
y = y.astype(np.int64)

In [None]:
X_train, X_test, y_train, y_test = X[:60_000], X[60_000:], y[:60_000], y[60_000:] 

In [None]:
y_train_5 = (y_train==5)
y_test_5 = (y_test==5)

In [None]:
y[0]

In [None]:
from sklearn.linear_model import SGDClassifier 

In [None]:
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)

In [None]:

%%time
sgd_clf.fit(X_train, y_train_5) 

In [None]:
sgd_clf.predict(X_train[:5])

# La validación cruzada con la medida de desempeño de la exactitud 

In [None]:
%%time 

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

In [None]:
from sklearn.model_selection import cross_val_score 

In [None]:
%%time 

puntajes = cross_val_score(sgd_clf, X_train, y_train_5, cv = 10, scoring = 'accuracy') 

In [None]:
puntajes 

Nota:  

`shuffle=True` se omitió por error en versiones anteriores del libro.

In [None]:
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

In [None]:
%%time 

never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [None]:
from sklearn.metrics import precision_score, recall_score  

In [None]:
 precision_score(y_train_5, y_predict2) 

In [None]:
recall_score(y_train_5, y_predict2)

Now your 5-detector does not look as shiny as it did when you looked at its accuracy. 

When it claims an image represents a 5, it is correct only 72.9% of the time. Moreover, it only detects 75.6% of the 5s.

It is often convenient to combine precision and recall into a single metric called the $F_{1}$ score, in particular if you need a simple way to compare two classifiers. 

The $F_{1}$ score is the harmonic mean of precision and recall (Equation 3-3). 

Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. 

As a result, the classifier will only get a high $F_{1}$ score if both recall and precision are high.

$$ F_{1} = \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}} } = 2 \cdot \frac{}{} =   $$

To compute the $F_{1}$ score, simply call the `f1_score()` function:

In [None]:
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

The F score favors classifiers that have similar precision and recall. 

This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall. 

For example, if you trained a classifier to detect videos that are safe for kids, you would
probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that has a much higher recall but lets a few really bad videos show up in your product (in such cases, you may even want to add a human pipeline to check the classifier’s video selection). 

On the other hand, suppose you train a classifier to detect shoplifters in surveillance images: it is probably fine if your classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a few false alerts, but almost all shoplifters will get
caught).

Unfortunately, you can’t have it both ways: increasing precision reduces
recall, and vice versa. This is called the precision/recall trade-off.

## Precision/Recall Trade-off

To understand this trade-off, let’s look at how the SGDClassifier makes its classification decisions. 

For each instance, it computes a score based on a decision function. 

If that score is greater than a threshold, it assigns the instance to the positive class; otherwise it assigns it to the negative class.

Figure 3-3 shows a few digits positioned from the lowest score on the left to the highest score on the right. 

Suppose the decision threshold is positioned at the central arrow (between the two 5s): you will find 4 true positives (actual 5s) on the right of that threshold, and 1 false positive (actually a 6). 

Therefore, with that threshold, the precision is 80% (4 out of 5). But out of 6 actual 5s, the classifier only detects 4, so the recall is 67% (4 out of 6). 

If you raise the threshold (move it to the arrow on the right), the false positive (the 6) becomes a true negative, thereby increasing the precision (up to 100% in this case), but one true positive becomes a false negative, decreasing recall down to 50%. 

Conversely, lowering the threshold increases recall and reduces precision.

<img src = 'https://github.com/marco-canas/intro-Machine-Learning/blob/main/classes/class_26_multiclase/figura_3_3_various_thresholds.jpg?raw=true'>

Scikit-Learn does not let you set the threshold directly, but it does give you access to the decision scores that it uses to make predictions. 

Instead of calling the classifier’s `predict()` method, you can call its `decision_function()` method, which returns a score for each instance, and then use any threshold you want to make predictions based on those scores:

In [None]:
y_scores = sgd_clf.decision_function([X[0]])
y_scores


In [None]:
threshold = 0
y_some_digit_pred = (y_scores > threshold)


The SGDClassifier uses a threshold equal to 0, so the previous code returns the same result as the predict() method (i.e., True). 

Let’s raise the threshold:

In [None]:
>>> threshold = 8000
>>> y_some_digit_pred = (y_scores > threshold)
>>> y_some_digit_pred

This confirms that raising the threshold decreases recall. 

The image actually represents a 5, and the classifier detects it when the threshold is 0, but it misses it when the threshold is increased to 8,000.

How do you decide which threshold to use? 

First, use the cross_val_predict() function to get the scores of all instances in the training set, but this time specify that you want to return decision scores instead of predictions:

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

With these scores, use the precision_recall_curve() function to compute precision and recall for all possible thresholds:

In [None]:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

Finally, use Matplotlib to plot precision and recall as functions of the threshold value (Figure 3-4):

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    [...] # highlight the threshold and add the legend, axis label, and grid
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

<img src = 'https://github.com/marco-canas/intro-Machine-Learning/blob/main/classes/class_26_multiclase/figura_3_4_thresholds.jpg?raw=true'>

## NOTE

You may wonder why the precision curve is bumpier than the recall curve in Figure 3-4.

The reason is that precision may sometimes go down when you raise the threshold (although in general it will go up). 

To understand why, look back at Figure 3-3 and notice what happens when you start from the central threshold and move it just one digit to the right: precision goes from 4/5 (80%) down to 3/4 (75%). 

On the other hand, recall can only go down when the threshold is increased, which explains why its curve looks smooth.

Another way to select a good precision/recall trade-off is to plot precision directly against recall, as shown in Figure 3-5 (the same threshold as earlier is highlighted).

<img src = 'https://github.com/marco-canas/intro-Machine-Learning/blob/main/classes/class_26_multiclase/figura_3_5_recall_precision.jpg?raw=true'>

You can see that precision really starts to fall sharply around 80% recall.
You will probably want to select a precision/recall trade-off just before that
drop—for example, at around 60% recall. But of course, the choice depends
on your project.

Suppose you decide to aim for 90% precision. You look up the first plot and
find that you need to use a threshold of about 8,000. To be more precise you
can search for the lowest threshold that gives you at least 90% precision
(np.argmax() will give you the first index of the maximum value, which in
this case means the first True value):

In [None]:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)] # ~7816

To make predictions (on the training set for now), instead of calling the
classifier’s predict() method, you can run this code:

In [None]:
y_train_pred_90 = (y_scores >= threshold_90_precision)

Let’s check these predictions’ precision and recall:

In [None]:
>>> precision_score(y_train_5, y_train_pred_90)

In [None]:
>>> recall_score(y_train_5, y_train_pred_90)

Great, you have a 90% precision classifier! As you can see, it is fairly easy
to create a classifier with virtually any precision you want: just set a high
enough threshold, and you’re done. But wait, not so fast. A high-precision
classifier is not very useful if its recall is too low!

## TIP

If someone says, “Let’s reach 99% precision,” you should ask, “At what recall?”

## The ROC Curve

The receiver operating characteristic (ROC) curve is another common tool
used with binary classifiers. It is very similar to the precision/recall curve,
but instead of plotting precision versus recall, the ROC curve plots the true
positive rate (another name for recall) against the false positive rate (FPR).
The FPR is the ratio of negative instances that are incorrectly classified as
positive. It is equal to 1 – the true negative rate (TNR), which is the ratio of
negative instances that are correctly classified as negative. The TNR is also
called specificity. Hence, the ROC curve plots sensitivity (recall) versus 1 –
specificity.
To plot the ROC curve, you first use the roc_curve() function to compute
the TPR and FPR for various threshold values:

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Then you can plot the FPR against the TPR using Matplotlib. This code
produces the plot in Figure 3-6:

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
    [...] # Add axis labels and grid
plot_roc_curve(fpr, tpr)
plt.show()

Once again there is a trade-off: the higher the recall (TPR), the more false positives (FPR) the classifier produces. 

The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

<img src = 'https://github.com/marco-canas/intro-Machine-Learning/blob/main/classes/class_26_multiclase/figura_3_6_false_positive_rate_vs_true_positive_rate.jpg?raw=true'>