Following Geron's text, we define various metrics to analyze classifiers and apply Bayes rule to measure the effectiveness of classifiers in practice.

Also see Dr. Goldman's lecture on [logistic regression](https://github.com/doriang102/APM4990-/blob/master/lectures/Lecture%204%20-%20Classification%20.pdf).

### Precision and Recall

Once a treshold, say $0.5$, is chosen, we can define the precision and recall as follows. Let $P$ be the number of positives predicted by the classifier, $T$ be the number of true predictions, and $TP$ be the number positive predictions of the classifier that are also true.

Define the recall, $R$, and precision, $P$, by:

$R = \frac{TP}{T} \: \:\: \:\:\:\:\:  P = \frac{TP}{P}$

### F1-score

Once we fix a treshold, a good aggregate measure of a classifier is the $F_1$-score defined as follows. We define the $F_1$-score as the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of the precision and recall:

\begin{equation}
    F_1 = \frac{2}{\frac{1}{P} + \frac{1}{R}}
\end{equation}

It has the following properties:

$0 \leq \operatorname{min}(P, R) \leq F_1 \leq \sqrt{P R} \leq \frac{P + R}{2} \leq \operatorname{max}(P, R) \leq 1$

with equality holding in the inner inequalities iff $P = R$.

### ROC AUC

Despite being an aggregate measure of classifier quality, a clear downside of the $F_1$-score is that it depends on the threshold we chose, $0.5$. Besides inspecting the precision-recall curve, we can obtain another aggregate measure, independent of the treshold, using the ROC curve.

The ROC curve is defined as the true positive rate (TPR also called recall) vs. the false positive rate, calculated for each treshold. An aggregate measure is then derived by calculating the area under the ROC curve, called the *ROC AUC*. See either notebook using BERT or Naive Bayes for calculations of this metric.

### Avoiding the base rate fallacy

Following Murphy's text on machine learning (section 2.2.3.1), we describe how to obtain accurate predictions from classifiers, with disaster tweets as our main example.

Let $X$ be a Bernoulli random variable repesenting the classifiers prediction and $Y$ be a Bernoulli random variable repesenting whether or not a tweet actually corresponds to a disaster.

Suppose the recall, i.e. $p(x = 1 | y = 1) = R$, is:

In [9]:
R = 0.76

Let $m = p(y = 1)$ be the proportion of tweets that correspond to disasters. A generous estimate will be:

In [10]:
m = 0.004

Note that this percentage is much lower than the proportion of disaster tweets in the dataset used to build our models.

The false positive rate, $p(x=1 | y=0)$ (one minus the recall of the negative class), is:

In [13]:
fp = 1 - 0.88
fp

0.12

Combining these three terms by using bayes rule we can compute the correct answer as follows:

\begin{equation}
    p(y = 1 | x = 1) = \frac{p(x = 1 | y =1)p(y = 1)}{p(x = 1 | y = 1) p( y = 1) + p(x = 1 | y = 0) p( y = 0)}
\end{equation}

Thus $p(y = 1 | x = 1)$ is:

In [12]:
(R * m)/(R *m + fp *(1-m))

0.02480417754569191

This can be made higher by:


(1) applying the classifier on tweets that are more likely to be disaster tweets,

(2) decreasing the false positive rate of the classifier.