How do we **measure the quality or value of the prediction** in the selected business or
science problem? What will be the quantitative score? **How does the quantitative score
reflect the quality or value of the prediction?** How does the (possibly asymmetric)
prediction error convert into cost or decreased KPI?

## Prediction value measurement and quantitative score(s)

### Usual classification metrics and their limitations

As a Machine Learning problem, the aim of this project is to provide a binary probabilistic classifier, that takes as an input a dermoscopic image of a mole, and provides as an output the probability of this mole being *malignant*, i.e. potentially dangerous for the patient. We will thus define the image to be classified as belonging to the **positive class** if it is malignant, and to the **negative class** if it is *benign*, i.e. not harmful.

We introduce the standard terminology used for such classification problems:

* **TP** = number of samples that were predicted as belonging to the positive class that are indeed positive (malignant moles here)
* **FP** = number of samples that were predicted positive but that are in fact negative
* **TN** = number of samples that were predicted as belonging to the negative class that are indeed negative (benign moles here)
* **FN** = number of samples that were predicted negative but that are in fact positive

For further readability, we also introduce some other terminology that is sometimes used:

* **P** = number of samples that are *really* labelled as belonging to the positive class, according to the gold standard malignancy diagnosis
* **N** = number of samples that are *really* labelled as negative

With all of this set up, we can define the metrics that are usually used for evaluating binary classifiers:

The most obvious and famous one is the **Accuracy**, which is simply the percentage of correctly classified samples among the data set, and read:

$$
AC = \frac{TP + TN}{P + N}
$$

Another very common classification metric is the **Recall**, also known as the **Sensitivity** in Statistics, or as the **True-Positive rate** TPR. This ratio expresses how much of the members of the positive class has been well predicted by the classifier:

$$
SE = \frac{TP}{TP + FN}
$$

One final common perfomance indicator that we will introduce is the **Specificity** (the complement of what we will further call the False-Positive rate FPR), which this time describes how well the classifier did in identifying the members of the negative class:

$$
SP = \frac{TN}{TN + FP}
$$

Intuitively, these scores tend to describe well what we want to accomplish here: identify the right class for a given input. However, in our particular setting, like for most disease detection ones in Machine Learning, limiting us to a balanced combination of just these 3 scores will not be sufficient, nor reflect the true quality of the classifier's predictions.

First, our **dataset is quite imbalanced**: malignant moles only represent around 30% of the dataset. Despite not being *highly* imbalanced, this suffices to favor benign predictions when it comes to only focusing on a raw accuracy improvement. For instance, predicting all moles as being benign would lead to an accuracy score around 70%, against 30% for an only malignant prediction.

This leads us to the second important point, as the prediction can have a direct impact on the patient's decision to consult his dermatologist or not, **the prediction error cost is highly asymmetric**: we want to incorporate in the general score the fact that **failing the prediction of a trully maligant mole has a much dramatic effect than sending someone to its dermatologist for a benign one**. For this reason, we want to put more wait on the recall score, as defined above, but we will also add another score that goes in this sense later in this section.

### Another metric: Area under the ROC curve

In [None]:
µ