## Chapter 3. Classification

### MNIST

Shuffle the training set: 1. guarantee that all cross-validation folds will be similar; 2. some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row.

### Training a Binary Classifier

Stochastic Gradient Descent_ (SGD) classifier: the true gradient of is approximated by a gradient at a single example.

Advantage: capable of handling very large datasets efficiently. 

This is in part because SGD deals with training instances independently, one at a time (suitable for _online learning_)

### Performance Measures

#### Measuring Accuracy Using Cross-Validation

K-fold cross-validation means splitting the training set into K-folds, then making predictions and evaluating them on each fold using a model trained on the remaining folds.

#### Confusion Matrix

Each row in a confusion matrix represents an _actual class_, while each column represents a _predicted class_.

true negatives (TN) false positives (FP)
false negatives (FN) true positives (TP)

Precision, positive predictive value (PPV)
$$precision = \frac{TP}{TP+FP}$$

Recall, true positive rate (TPR):
$$recall = \frac{TP}{TP+FN}$$

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig3-2.png" width=400px alt="fig3-2" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 3-2. An illustrated confusion matrix_</div>

_F-1 score_ is the harmonic mean of precision and recall:

$$F_1 = 2 \frac{precision \cdot recall}{precision+recall}$$

F-measure:
$$F_\beta = (1+\beta^2) \cdot \frac{precision \cdot recall}{(\beta^2 \cdot precision)+recall}$$

The $F_1$ score favors classifiers that have similar precision and recall.Two other commonly used F measures are the $F_2$ measure, which weighs recall higher than precision (by placing more emphasis on false negatives), and the $F_{0.5}$ measure, which weighs recall lower than precision (by attenuating the influence of false negatives).

#### Precision/Recall Tradeoff

Classification, for each instance, it computes a score based on a _decision function_, and if that score is greater than a threshold, it assigns the instance to the positive class, or else it assigns it to the negative class.

Lowering the threshold increases recall and reduces precision. 

Method 1 to decide threshold: precision and recall as functions of the threshold value.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig3-4.png" width=400px alt="fig3-4" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 3-4. Precision and recall versus the decision threshold_</div>

Method 2: plot precision directly against recall.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig3-5.png" width=400px alt="fig3-5" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 3-5. Precision versus recall_</div>

#### The ROC Curve

The _receiver operating characteristic_ (ROC) curve is another common tool used with binary classifiers. It plots the _true positive rate_ (TPR, sensitivity, recall) against the _false positive rate_ (FPR), _sensitivity (recall)_ versus _1 – specificity_. 

$FPR = 1 - TNR$: the ratio of negative instances that are correctly classified as negative. TNR: the ratio of negative instances that are correctly classified as negative. The TNR is also called _specificity_.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig3-6.png" width=400px alt="fig3-6" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 3-6. ROC curve_</div>

One way to compare classifiers is to measure the _area under the curve_ (AUC).

<font color=blue>_TIP_</font>
Prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise.


