## Classification
This module shows how logistic regression can be used for classification tasks, and explores how to evaluate the effectiveness of classification models.

### Classification: Thresholding

Logistic regression returns a probability. You can use the returned probability "as is" (for example, the probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value (for example, this email is spam).

A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam. However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a **classification threshold** (also called the **decision threshold**). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.

The following sections take a closer look at metrics you can use to evaluate a classification model's predictions, as well as the impact of changing the classification threshold on these predictions.

##### Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for making a mistake. For example, mistakenly labeling a non-spam message as spam is very bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly the end of your job.

### Classification: True vs. False and Positive vs. Negative

In this section, we'll define the primary building blocks of the metrics we'll use to evaluate classification models. But first, a fable:

![](img/13-1.png)

Let's make the following definitions:

- "Wolf" is a **positive class**.
- "No wolf" is a **negative class**.

We can summarize our "wolf-prediction" model using a 2x2 [confusion matrix](https://developers.google.com/machine-learning/glossary#confusion_matrix) that depicts all four possible outcomes:

![](img/13-2.png)

A **true positive** is an outcome where the model correctly predicts the positive class. Similarly, a **true negative** is an outcome where the model correctly predicts the negative class.

A **false positiv** is an outcome where the model incorrectly predicts the positive class. And a **false negative** is an outcome where the model incorrectly predicts the negative class.

In the following sections, we'll look at how to evaluate classification models using metrics derived from these four outcomes.

### Classification: Accuracy

Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:

$$ Accuracy = \frac{Number \ of \ correct predictions}{Total \ number \ of \ predictions} $$

For binary classification, accuracy can also be calcuated in terms of positives and negatives as follows:

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

Where *TP = True Positives, TN = True Negatives, FP = False Positives*, and *FN = False Negatives*.

Let's try calculating accuracy for the following model that classified 100 tumors as [malignant](https://en.wikipedia.org/wiki/Malignancy) (the positive class) or [benign](https://en.wikipedia.org/wiki/Benign_tumor) (the negative class):

![](img/13-3.png)

Accuracy comes out to 0.91, or 91% (91 correct predictions out of 100 total examples). That means our tumor classifier is doing a great job of identifying malignancies, right?

Actually, let's do a closer analysis of positives and negatives to gain more insight into our model's performance.

Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP) and 9 are malignant (1 TP and 8 FNs).

Of the 91 benign tumors, the model correctly identifies 90 as benign. That's good. However, of the 9 malignant tumors, the model only correctly identifies 1 as malignant—a terrible outcome, as 8 out of 9 malignancies go undiagnosed!

While 91% accuracy may seem good at first glance, another tumor-classifier model that always predicts benign would achieve the exact same accuracy (91/100 correct predictions) on our examples. In other words, our model is no better than one that has zero predictive ability to distinguish malignant tumors from benign tumors.

Accuracy alone doesn't tell the full story when you're working with a **class-imbalanced data set**, like this one, where there is a significant disparity between the number of positive and negative labels.

### Classification: Precision and Recall

#### Precision
**Precision** attempts to answer the followin question : \
<code>What proportion of positive identifications was actually correct?</code>

Precision is defined as follows:

$$ Precision = \frac{TP}{TP + FP} $$

###### Note: A model that produces no false positives has a precision of 1.0.

Let's calculate precision for our ML model from the previous section that analyzes tumors:

![](img/13-4.png)

Our model has a precision of 0.5—in other words, when it predicts a tumor is malignant, it is correct 50% of the time.


#### Recall

**Recall** attempts to answer the following gquestion:

<code>What proportion of acutal positives was identified correctly?</code>

Mathematically, recall is defined as follows:

$$Recall = \frac{TP}{TP + FN}$$

Let's caculate recall for our tumor classifier:

![](img/13-5.png)

Our model has a recall of 0.11-in other words, it correctly identifies 11% of all maligant tumors.

#### Precision and Recall: A Tug of War
To fully evaluate the effectiveness of a model, you must examine **both** precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. Explore this notion by looking at the following figure, which shows 30 predictions made by an email classification model. Those to the right of the classification threshold are classified as "spam", while those to the left are classified as "not spam."

![](img/13-6.png)

Let's calculate precision and recall based on the results shown in Figure 1:

![](img/13-7.png)

Precision measures the percentage of **emails flagged as spam** that were correctly classified-that is, the percentage of dots to the right of the hreshold line that are green in Figure 1:

$$ Precision = \frac{TP}{TP + FP} = \frac{8}{8 + 2} = 0.8 $$

Recall measures the precentage of **actual spam emails** that were correctly classified-that is, the percentage of green dots that are to the right of the threshold line in Figure 1:

$$ Recall = \frac{TP}{TP + FN} = \frac{8}{8 + 3} = 0.73 $$

Figure 2 illustrates the effect of increasing the classifiction threshold.

![](img/13-8.png)

The number of false positives decreases, but false negatives increase. As a result, precision increases, while recall decreases:

![](img/13-9.png)

Conversely, Figure 3 illustrates the effect of decreasing the classification threshold (from its original position in Figure 1).

![](img/13-10.png)

False positives increase, and false negatives decrease. As a result, this time, precision decreases and recall increases:

![](img/13-11.png)

Various metircs have been developed that rely on both precision and recall. For example, see [F1 score](https://en.wikipedia.org/wiki/F1_score).