# Naive Bayes

### Introduction

Netflix, to know the probability people like a movie, then can just count:

$P(E) = \displaystyle{\lim_{n \to \infty}} \frac{n(E)}{n}$

But really netflix, contextualizes this with the following:

$P(E|F) = \frac{P(EF)}{P(F)}$

### Flu Test Example

Let's say that we are testing people for the flu.  We are told that the following: 

1. The probability of someone from the population having the flu is $.005$.  
2. The test is $.98$. effective at detecting the flu, meaning that if someone has the flu it detects the flu 98% of the time.  
3. The test has a false positive rate of .01, meaning that 1% of the time that someone does not have the flu, it detects the flu.

Now, what's the probability that given a positive test, someone has the flu?

> Given the effectiveness of the test above, it might be surprising to learn that it's only $.32$.  

We'll get into how the above is calculated in a little bit, but the main point to take from the above is that new evidence, does not determine beliefs in a background, but rather should update prior beliefs.

In this example, only $.005$ of the population has the flu, or 5 individuals out of 1000.

In [34]:
1000*.005

5.0

So while a positive test does increase the probability of someone actually having the flu, it still only takes that probability to .32.

### Understanding the Prior

Now let's define $H$ as someone having the flu, and $E$ as the event of testing positive.  

> The $H$ stands for *hypothesis* and $E$ stands for evidence.

Then, given the data above:

* $P(H)  = .005$
> $P(H)$ is called the prior.

We can see the prior visually by seeing what it looks like when the size of our sample is 1000.  We can see that while we do have 5 observations of the hypothesis.  Almost all of are of the hypothesis being negative.

In [29]:
pd.options.display.max_columns = 35

In [50]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.full((40, 25), '$H^c$'))
df.iloc[0, :] = np.concatenate([np.full((1, 5), '$H$'), np.full((1, 20), '$H^c$')], axis = 1)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
0,$H$,$H$,$H$,$H$,$H$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
1,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
2,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
3,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
4,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
5,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
6,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
7,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
8,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
9,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$


Now, the real consequence of this disparity is that even though when there no flu, it mainly predicts no flu, because no flu is so prevalent, it can still be wrong more than we expect when it predicts flu.

Let's fill in some numbers.

* $P(H) = .005 $
* $P(E|H) = .98$
* $P(E | H^C) = .01$

So, now the probability of both **no flu** and **predicting flu**, the (false positive rate) is the following: 
    
> $P(EH^C) = P(H^c)*P(E|H^c) = .995*.01$

In [52]:
.995*.01*1000

9.950000000000001

And the probability of flu and predicting flu is the following: 

In [53]:
.005*.98*1000

4.8999999999999995

So we can see because *most* occurrences are of people not having the flu, this carries over to there still being more individuals who don't have the flu, even with a positive test.  Let's show our grid again, but this time we won't show all the true negative examples.

In [61]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.full((10, 15), '$H^c$'))
df.iloc[0, :] = np.concatenate([np.full((1, 5), '$EH$'), np.full((1, 10), '$EH^c$')], axis = 1)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,$EH$,$EH$,$EH$,$EH$,$EH$,$EH^c$,$EH^c$,$EH^c$,$EH^c$,$EH^c$,$EH^c$,$EH^c$,$EH^c$,$EH^c$,$EH^c$
1,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
2,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
3,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
4,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
5,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
6,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
7,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
8,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$
9,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$,$H^c$


Now if we want to answer the question of what is $P(H|E)$, then we should restrict our consideration to just the top row.  Here we see that only $4/15$ or:

In [62]:
4/15

0.26666666666666666

> Our numbers are off because we only chose 1000 examples.

### The Problem via Statistics

Lets begin to wrap up by seeing how we solved the problem, but without the visual aid.  We define $E$ to be testing positive and $H$ to be an actual positive event.  Then we were given the following:

* $P(H) = .005$
* $P(E|H^c) = .01$ 
* $P(E|H) = .98$ 

Now our task is to find the probability of actually having the flu, given a positive test, and this $P(H|E)$ is just:
    
> $P(H|E): \frac{P(EH)}{P(E)}$

Next we use the chain rule to replace $P(EH)$, so we get:

> $P(H|E): \frac{P(H)*P(E|H)}{P(E)}$

Finally, we see the evidence (that is a positive test) under the conditional of a true positive and a false positive, so we can replace the denominator with:

> $P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

> Via the chain rule.

Now let's plug our numbers into our formula to find the probability of the flu given a positive test.

$P(H|E) = \frac{.005*.98}{.005*.98 + .995*.01}$

In [64]:
(.005*.98)/(.005*.98 + .995*.01)

0.32996632996632996

So we see that the probability of someone having the flu given a positive test .3299.

### Summary

In this lesson we learned about Bayes theorem.  We saw that the main takeaway from Bayes theorem is that new evidence, does not determine beliefs in a vacuum, but rather should *update* our prior beliefs.  Because certain events are quite rare, a small false positive rate can still lead to $P(H|E)$ being low. 

Statistically, the formula for Bayes theorem derives from our formula for conditional probability:

$P(H|E): \frac{P(H)*P(E|H)}{E} $

Use bayes when have some *hypothesis*, and observe evidence, and then want hypothesis, given evidence is true.

* $P(H) = prior$
* $P(E|H) = likelihood $
> Prob of evidence given a specific hypothesis (prob shy given librarian)

* $P(E | H^c) = $
> Prob of evidence given it's not true 

* $Posterior = P(H|E) =$ belief after seeing the evidence.

A test is .98 effective at detecting HIV.  And has a false positive rate of 1%.  And .5% of the population has HIV.  What's the probability we have hiv if we test positive.

* Let E = test positive for HIV with the test
* Let F = you actually have HIV.

> False positive rate is defined $\frac{\text{false positive}}{\text{condition negative}}$.

* $P(F) = .005$
* $P(E|F^c) = .01$ 
* $P(F^c) = .995$
* $P(E|F) = .98$ (Recall)

$P(F|E)  = \frac{True Positive}{TP + FP}$

$\frac{.98*.005}{.98*.005 + .01*.995}$ = 

In [3]:
.98*.005/(0.0049 + 0.00995)

0.32996632996632996

In [2]:
.01*.995

0.00995

### Background for Bayes

In [None]:
Let's say that we are 

Let's bring our conversation probability into the context of classification metrics.  Let's say that, in the diagram below, the circle F is all of our predicted positive observations.  And the circle E represents all of our actual positive observations.  

* $F$: Predicted positive
* $E$: Actual positive 

Let's say that we want to calculate our precision.  Remember that $precision = \frac{TP}{TP + FP}.$

<img src="./conditional-probability.png" width="60%">

In other words, precision the smaller wedge divided by the full triangle F.

$Precision = \frac{P(TP)}{P(\text{Predicted Positive})} = \frac{P(FE)}{P(F)}$

And of course, we can break $F$ down into our true positives and false positives:

$Precision = \frac{P(TP)}{P(TP) + P(FP)} = \frac{P(FE)}{P(FE) + P(FE^C)} $

And just to state what precision is, in terms of probability, this is the probability of a $TP$, assuming a prediction of positive.

$Precision = P(E|F) = \frac{P(TP)}{P(TP) + P(FP)} = \frac{P(FE)}{P(FE) + P(FE^C)} $

Once here, we can apply the chain rule.  That is, we can replace: 
* $P(FE) = P(F|E)*P(E)$ and 
* $P(FE^c) = P(F|E^c)*P(E^c)$.

So this gives us the following:

$ P(E|F) = \frac{P(TP)}{P(TP) + P(FP)} =\frac{P(FE)}{P(FE) + P(FE^C)} =  \frac{P(F|E)*P(E)}{P(F|E)*P(E) + P(F|E^c)*P(E^c)} $

This is called Bayes Theorem.  The reason why it is important is because many times we will have $P(E|F)$ (precision) and will want recall, $P(F|E)$, or vice versa.  

### Working through an example

Now let's see some numbers.  Let's again define the following:

* $F$: Predicted positive
* $E$: Actual positive 

$P(F | E) = P(\text{ Predictive positive} | \text{Actual Positive})$

### Intuition of why it works

<img src="./bayes-intuition.png">

[Bayes Video](https://youtu.be/wB0z0nQebNc?list=PLcmJYc2muOR9H96hGlUBV2DkviVZFmHAh&t=3768)

* General idea with bayes:
    * Conditional probability - have a belief, and update belief given more information

<img src="./conf-bayes.png" width="70%">

$P(F|E^c) = \frac{P(E^c | F) P(F)}{P(E^c|F)P(F) + P(E^c | F^c)P(F^c)} = .0001$

In [4]:
(.02*.005)/((.02*.005) + (.99*.995))
# 0.00010150738466223418

0.00010150738466223418

### Spam problem

<img src="./email-spam.png" width="40%">

### Resources

[ML Mastery Naive Bayes Classifier](https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)

[Naive Bayes - Python Data Science](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html)

[Sklearn Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)