In this blog, I will cover:

1. What is Naive Bayes
2. How it works
3. How it is calculated
4. Why is it Naive
5. How does it get calculated, if we don't know P(B|A)
6. How to implement naive bayes classifiers in Python

### 1. What’s Naive Bayes?

![title](Capture5.PNG)

Naive Bayes Algorithm's core is the Bayes Theorem, which is predicting the probability of event happen given one or more than one evidence/features observed.<br>

### 2. How does it work?

Let's use some examples to illustrate this formula and different scenario.<br>

For example:
1. you are a fireman, and you want to predict the chances of fire occurrence when got a call report a smoke<br>
2. or you are a botanist, you want to predict the chance of a flower to be Jasmine when you measure the petal width/length<br>
3. or you are a cybersecurity staff, you want to predict the chance of email to be a spam when the word - 'Viagra' shows up<br>

The smoke, petal width/length, or presence of 'Viagra' in an email are all the 'B' in our formula, which is the known evidence/features.<br>

The fire occurrence, flower types, Spam email or not, are all the 'A' in our formula, which is event we are interested to predict.<br>

Those three 'when' are the '|' in the formula, which is a conditional term, meaning given what we observed so far, how big is the chance the event will happen?

### 3. How does it get calculated?

After we gain some intuitive around Bayes probability theorem, let's dive in how we can calculate the probability we are interested - the P(A|B), by given perfect or not perfect information. We will use the fire&smoke example to demonstrate the calculation.

Before we diving in, three formulas we will be heavily reused:<br>
F1. $P(A|B) = {P(A) * P(B|A)}{P(B)}$<br>
F2. $P(A) = P(A) * P(B|A) + P(NOTA) * P(B|NOTA)$<br>
F3. $P(A) + P(NOTA) = 1$<br>
F4. $P(B,C|A) = P(B|A) * P(C|A)$
                                    

***Perfect Scenario: assume you know every individual component in Bayes Theorem:***<br>

$$P(fire|smoke) = \frac{P(fire) * P(smoke|fire)}{P(smoke)} = \frac{0.01 * 0.8}{0.2} = 0.04$$<br>


$P(fire)$ - the probability of having a fire - is 1%<br>
$P(smoke)$ - the probability of seeing a smoke - is 20%<br>
$P(smoke|fire)$ - out of 100 fire accident, 80 of them starts with smoke - is 80%<br>

***Not-Perfect Scenario 1: assume you do not know the P(smoke)***<br>

$$P(fire|smoke) = \frac{P(fire) * P(smoke|fire)}{???} = \frac{0.01 * 0.8}{???}$$

What you do know additionally:<br>

$P(smoke|not fire)$ - out of 100 normal event, 2 of them has smoke, maybe intense cooking.. - is 2%<br>

Based on F2 and F3, <br>
$$P(smoke) = P(fire) * P(smoke|fire) + P(not fire) * P(smoke | not fire) = 0.01 * 0.80 + 0.99 * 0.02 = 0.0278 $$<br>

Let's plug in: <br>
$$P(fire|smoke) = \frac{P(fire) * P(smoke|fire)}{P(smoke)} = \frac{0.01 * 0.8}{0.0278} = 0.2878$$



***Not-Perfect Scenario 2: assume we collect more than one features - smoke and High temp, Oh and you don't know P(smoke|fire) or P(smoke)***<br> 

$$P(fire|smoke,High temp) = \frac{P(fire) * P(smoke,High temp|fire)}{P(smoke)} = \frac{0.01 * ???}{???}$$<br>

What we do know additionally:<br>
$P(smoke|fire)$ - out of e.g. 100 fire accident, 80 of them starts with smoke - is 80%<br>
$P(smoke|not fire)$ - out of 100 normal event, 2 of them has smoke, maybe intense cooking.. - is 2% <br>
$P(High temp|fire)$ - out of 100 fire accident, 50 of them starts with high temp - is 50%<br>
$P(High temp|not fire)$ - out of 100 normal event, 30 of them has high temp, e.g. A/C failures during summer.. - is 30% <br>


Let's leveraged the formula and info we do know!<br>
Based on F4:
$$P(smoke,High temp|fire) = P(smoke|fire) * P(High temp|fire) = 0.80 * 0.50 = 0.40$$<br>
$$P(smoke,High temp |not fire) = P(smoke|not fire) * P(High temp|not fire) = 0.02 * 0.30 = 0.06$$<br>

Based on F2:
$$P(smoke,High temp) = P(fire) * P(smoke,High temp|fire) + P(not fire) * P(smoke,High temp | not fire) = 0.01 * 0.40 + 0.99 * 0.06 = 0.0634$$<br>

Let's plug in those two value into the F1
$$P(fire|smoke,High temp) = \frac{P(fire) * P(smoke|fire,High temp)}{P(smoke)} = \frac{0.01 * 0.40}{0.0634}$$


### 4. Why Naive Bayes Naive?

Hope the above three scenarios give you a taste of leveraging known probability to derive the probability you are interested. You may say - okay this method seems reasonable, why it is called 'Naive'?<br>

Remember the last of those four formulas above, which is leveraged when we have more than one features: $P(B,C|A) = P(B|A) * P(C|A)$<br>

this formula holds only when features B and feature C are independent, which is almost impossible for many cases.<br>

***In other word, when features increase or even just more than one, Naive Bayes naively assume they are independent to each other.***

### 5. How does it get calculated, if we just don't know P(B|A)?

Furthermore, you may realize the bulk of the calculation is usually done on the $P(B|A)$.<br>
In reality, we not always get a direct answer on what's the value of $P(B|A)$.What do we do? __WE MAKE ASSUMPTIONS!__<br>

Three common assumptions on the distribution of the feature:<br>
- For continuous features (e.g. 1.234,2.345,3.672): Gaussian Naive Bayes assume features are distributed as Gaussian (normal) distribution. <br>
- For discrete features (e.g. 1,2,3): Multinomial Naive Bayes assume features are distributed as Multinomial distribution. like rolling the dice <br>
- For binary feature (e.g. presence or not): Bernoulli Naive Bayes assume features are distributed as Bernoulli distribution. like tossing the coin<br>

### 6. Naive Bayes Implementation using Sklearn

In [1]:
# Load Library
import numpy as np
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

#### Gaussian Naive Bayes Example

In [8]:
smoke_density = np.array([[1.120], [2.256], [3.800], [0.768], [0.965], [0.362]])
X = smoke_density
Y = np.array(['FIRE', 'FIRE', 'FIRE', 'NOT FIRE', 'NOT FIRE', 'NOT FIRE'])
clf = GaussianNB()
clf.fit(X, Y)

print(clf.predict([[0.1]]))
print(clf.predict([[2.0]]))

['NOT FIRE']
['FIRE']


#### Multinomial Naive Bayes Example

Multinomial Naive Bayes is commonly used in text classification, the features here are the distribution of words frequency. <br>
For example, we want to use three words' - "Free", "Viagra", "Trial" - frequency to predict whether an email is a Spam or not 

Emails = [<br>
       [Free Free Viagra Viagra!],<br>
       [Viagra for Free.],<br>
       [Free Trial Viagra ~],<br>
       [Free Nail TRIAL],<br>
       [Free Spa or Free Nail polish Trial !!!],<br>
       [Top 10 Hiking Trial]])<br>

In [2]:
Text_Freq = np.array([[2, 2, 0],
[1, 1, 0],
[1, 1, 1],
[1, 0, 1],
[1, 0, 2],
[0, 0, 1]])

X = Text_Freq
Y = np.array(['Spam', 'Spam', 'Spam', 'Not Spam', 'Not Spam', 'Not Spam'])
clf = MultinomialNB()
clf.fit(X, Y)

New_Email = 'Moonnight Trial'
print(clf.predict([[0,0,1]]))
['Not Spam']

New_Email2 = 'Free BitCoins'
print(clf.predict([[1,1,0]]))

['Not Spam']
['Spam']


#### Bernoulli Naive Bayes

Bernoulli Naive Bayes is commonly used in text classification too, the features here are the distribution of words' presence regardless it's frequency. <br>
Let's use the same email list<br>

In [12]:
Text_Pres = np.array([[1, 1, 0],
       [1, 1, 0],
       [1, 1, 1],
       [1, 0, 1],
       [1, 0, 1],
       [0, 0, 1]])
       
X = Text_Pres
Y = np.array(['Spam', 'Spam', 'Spam', 'Not Spam', 'Not Spam', 'Not Spam'])
clf = BernoulliNB()
clf.fit(X, Y)

New_Email = 'Moonnight Trial'
print(clf.predict([[0,0,1]]))
['Not Spam']

New_Email2 = 'Free BitCoins'
print(clf.predict([[1,1,0]]))
['Spam']

['Not Spam']
['Spam']


Futher Link:
if you are interested to know more about the detail of bernoulli and multinoulli and the difference between, check out the following links:
https://geekyisawesome.blogspot.com/2016/12/bernoulli-vs-binomial-vs-multinoulli-vs.html

if you are interested to know more about using Naive Bayes on Text Classification, refer the following links:
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html