# Bayesian Methods

- $P(A\vert B)={\frac{P(A)P(B\vert A)}{P(B)}}$
- Let’s use it for machine learning! I want a spam classifier.
- Example: how would we express the probability of an email being spam if it contains the word “Free”?
- $P(Spam\vert Free)={\frac{P(Spam)P(Free\vert Spam)}{P(Free)}}$ 
- The numerator is the probability of a message being spam and containing the word “Free” (this is subtly different from what we’re looking for)
- The denominator is the overall probability of an email containing the word “Free”.
    - Equivalent to $P(Free\vert Spam)P(Spam) + P(Free\vert Not Spam)P(Not Spam)$
- So together – this ratio is the % of emails with the word “Free” that are spam.

# What about all the other words?

- We can construct P(Spam | Word) for every (meaningful) word we encounter during training
- Then multiply these together when analyzing a new email to get the probability of it being spam.
- Assumes the presence of different words are independent of each other – one reason this is called “Naïve Bayes”.

# Sounds like a lot of work

- Scikit-learn to the rescue!
- The CountVectorizer lets us operate on lots of words at once, and MultinomialNB does all the heavy lifting on Naïve Bayes.
- We’ll train it on known sets of spam and “ham” (non-spam) emails
    - So this is supervised learning!

# Naive Bayes (Multinomial Naive Bayes Classification) Example

Suppose my inbox has:
- normal emails: 8
- spam emails: 4

Then, the prior probabilities are:
- $p(\text{Normal}) = \frac{8}{8 + 4} = 0.67$
- $p(\text{Spam}) = \frac{4}{8 + 4} = 0.33$

My normal emails include the following words and their frequencies:
- Dear: 8
- Friend: 5
- Lunch: 3
- Money: 1

The total number of words in normal emails is 17. Therefore, the conditional probabilities are:
- $p(\text{Dear}|\text{Normal}) = \frac{8}{17} = 0.47$
- $p(\text{Friend}|\text{Normal}) = \frac{5}{17} = 0.29$
- $p(\text{Lunch}|\text{Normal}) = \frac{3}{17} = 0.18$
- $p(\text{Money}|\text{Normal}) = \frac{1}{17} = 0.06$

My spam emails include the following words and their frequencies:
- Dear: 2
- Friend: 1
- Lunch: 0
- Money: 4

The total number of words in spam emails is 7. Therefore, the conditional probabilities are:
- $p(\text{Dear}|\text{Spam}) = \frac{2}{7} = 0.29$
- $p(\text{Friend}|\text{Spam}) = \frac{1}{7} = 0.14$
- $p(\text{Lunch}|\text{Spam}) = \frac{0}{7} = 0$
- $p(\text{Money}|\text{Spam}) = \frac{4}{7} = 0.57$

I receive a new email that includes:
- Dear: 1
- Friend: 1

Is it a normal email or a spam email?

Using the naive Bayes formula, I can calculate the posterior probabilities as follows:

- $p(\text{Normal}|\text{Dear Friend}) = p(\text{Normal}) * p(\text{Dear}|\text{Normal}) * p(\text{Friend}|\text{Normal}) = 0.67 * 0.47 * 0.29 = 0.09$
- $p(\text{Spam}|\text{Dear Friend}) = p(\text{Spam}) * p(\text{Dear}|\text{Spam}) * p(\text{Friend}|\text{Spam}) = 0.33 * 0.29 * 0.14 = 0.01$

Since $p(\text{Normal}|\text{Dear Friend}) > p(\text{Spam}|\text{Dear Friend})$, I can conclude that the email is **normal**.

I receive another email that includes:
- Lunch: 1
- Money: 4

Is it a normal email or a spam email?

Using the naive Bayes formula, I can calculate the posterior probabilities as follows:

- $p(\text{Normal}|\text{Lunch Money Money Money Money}) = p(\text{Normal}) * p(\text{Lunch}|\text{Normal}) * p(\text{Money}|\text{Normal}) ^ 4 = 0.67 * 0.18 * 0.06 ^ 4 = 0.000002$
- $p(\text{Spam}|\text{Lunch Money Money Money Money}) = p(\text{Spam}) * p(\text{Lunch}|\text{Spam}) * p(\text{Money}|\text{Spam}) ^ 4 = 0.33 * 0 * 0.57 ^ 4 = 0$

Since $p(\text{Normal}|\text{Lunch Money Money Money Money}) > p(\text{Spam}|\text{Lunch Money Money Money Money})$, I can conclude that the email is **normal**. However, this is **wrong**. The email is clearly a spam email, but the naive Bayes classifier fails to detect it because of the zero probability problem. This happens when one of the conditional probabilities is zero, which makes the whole posterior probability zero regardless of the other factors.

To fix this problem, I can use a technique called **Laplace smoothing**, which adds a small constant (usually 1) to each word's frequency to avoid zero probabilities. This also helps to account for words that may not appear in the training data but may appear in the test data.

Using Laplace smoothing with alpha = 1, I can calculate the new conditional probabilities as follows:

- $p(\text{Lunch}|\text{Spam}) = \frac{0 + 1}{7 + 4} = 0.07$
- $p(\text{Lunch}|\text{Normal}) = \frac{3 + 1}{17 + 4} = 0.19$
- $p(\text{Money}|\text{Spam}) = \frac{4 + 1}{7 + 4} = 0.45$
- $p(\text{Money}|\text{Normal}) = \frac{1 + 1}{17 + 4} = 0.10$

Using the naive Bayes formula with the smoothed probabilities, I can calculate the new posterior probabilities as follows:

- $p(\text{Normal}|\text{Lunch Money Money Money Money}) = p(\text{Normal}) * p(\text{Lunch}|\text{Normal}) * p(\text{Money}|\text{Normal}) ^ 4 = 0.67 * 0.19 * 0.10 ^ 4 = 0.00001$
- $p(\text{Spam}|\text{Lunch Money Money Money Money}) = p(\text{Spam}) * p(\text{Lunch}|\text{Spam}) * p(\text{Money}|\text{Spam}) ^ 4 = 0.33 * 0.07 * 0.45 ^ 4 = 0.00094$

Now, $p(\text{Spam}|\text{Lunch Money Money Money Money}) > p(\text{Normal}|\text{Lunch Money Money Money Money})$, which is the correct result. The email is **spam**.