# Naive Bayes

Naive Bayes is very simple, yet powerful algorithm for classification. It is based on Bayes Theorem with an assumption of independence among predictors. It assumes that the presence of a feature in a class is unrelated to any other feature. 

## Bayes Theorem

Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Given a Hypothesis (H) and evidence (E), Bayes' Theorem states that the relationship between the probability of the hypothesis before getting the evidence, P(H), and the probability of the hypothesis after getting the evidence, P(H|E), is:

\begin{equation}
P(H|E) = \frac{P(E|H)P(H)}{P(E)}
\end{equation}

P(H) is called the prior probability,

P(H|E) is called the posterior probability,

P(H|E)/P(E) is called the likelihood ratio.

### Example

In [1]:
import pandas as pd
data = pd.DataFrame({
    "Day": pd.Series(list(range(1,15)), index=list(range(1,15))),
    "Outlook": pd.Series(["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny", "Sunny", "Rain", "Sunny", "Overcast", "Overcast", "Rain"], index=list(range(1,15))),
    "Humidity": pd.Series(["High"]*4 + ["Normal"]*3 + ["High"] + ["Normal"]*3 + ["High", "Normal", "High"], index=list(range(1,15))),
    "Wind": pd.Series(["Weak", "Strong"] + ["Weak"]*3 + ["Strong"]*2 + ["Weak"]*3 + ["Strong"]*2 + ["Weak", "Strong"], index=list(range(1,15))),
    "Play": pd.Series(["No"]*2 + ["Yes"]*3 + ["No", "Yes"]*2 + ["Yes"]*4 + ["No"], index=list(range(1,15)))
})
data

Unnamed: 0,Day,Outlook,Humidity,Wind,Play
1,1,Sunny,High,Weak,No
2,2,Sunny,High,Strong,No
3,3,Overcast,High,Weak,Yes
4,4,Rain,High,Weak,Yes
5,5,Rain,Normal,Weak,Yes
6,6,Rain,Normal,Strong,No
7,7,Overcast,Normal,Strong,Yes
8,8,Sunny,High,Weak,No
9,9,Sunny,Normal,Weak,Yes
10,10,Rain,Normal,Weak,Yes


Now we should create a frequency table using each attribute of the dataset.

In [2]:
freq_outlook = data[['Outlook']].groupby( 'Outlook').count()
freq_outlook['Yes'] = data[['Outlook', 'Play']][data['Play'] == 'Yes'].groupby( 'Outlook').count()
freq_outlook['No'] = data[['Outlook', 'Play']][data['Play'] == 'No'].groupby( 'Outlook').count()
freq_outlook = freq_outlook.fillna(0)
freq_outlook

Unnamed: 0_level_0,Yes,No
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
Overcast,4,0.0
Rain,3,2.0
Sunny,2,3.0


In [3]:
freq_humidity = data[['Humidity']].groupby('Humidity').count()
freq_humidity['Yes'] = data[['Humidity', 'Play']][data['Play'] == 'Yes'].groupby( 'Humidity').count()
freq_humidity['No'] = data[['Humidity', 'Play']][data['Play'] == 'No'].groupby( 'Humidity').count()
freq_humidity = freq_humidity.fillna(0)
freq_humidity

Unnamed: 0_level_0,Yes,No
Humidity,Unnamed: 1_level_1,Unnamed: 2_level_1
High,3,4
Normal,6,1


In [4]:
freq_wind = data[['Wind']].groupby('Wind').count()
freq_wind['Yes'] = data[['Wind', 'Play']][data['Play'] == 'Yes'].groupby('Wind').count()
freq_wind['No'] = data[['Wind', 'Play']][data['Play'] == 'No'].groupby('Wind').count()
freq_wind = freq_wind.fillna(0)
freq_wind

Unnamed: 0_level_0,Yes,No
Wind,Unnamed: 1_level_1,Unnamed: 2_level_1
Strong,3,3
Weak,6,2


Now for each frequency table we will generate a likelihood table.

In [5]:
likelihood_outlook = freq_outlook.copy()
yes_sum = likelihood_outlook[['Yes']].sum()['Yes']
no_sum = likelihood_outlook[['No']].sum()['No']
total = yes_sum + no_sum

likelihood_outlook['Yes'] = freq_outlook['Yes'].apply(lambda x: x / yes_sum)
likelihood_outlook['No'] = freq_outlook['No'].apply(lambda x: x / no_sum)
likelihood_outlook['P(x)'] = (freq_outlook['Yes'] + freq_outlook['No'])
likelihood_outlook['P(x)'] = likelihood_outlook['P(x)'].apply(lambda x: x / total)
likelihood_outlook

Unnamed: 0_level_0,Yes,No,P(x)
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Overcast,0.444444,0.0,0.285714
Rain,0.333333,0.4,0.357143
Sunny,0.222222,0.6,0.357143


Likelihood of YES given SUNNY is:
\begin{equation}
P(c|x) = P(Yes|Sunny) = \frac{P(Sunny|Yes) * P(Yes)}{P(Sunny)} = \frac{\frac{2}{9} \times \frac{9}{14}}{\frac{5}{14}} = \frac{0.222222 \times \frac{9}{14}}{0.357143} = 0.399999 
\end{equation}

Similarly, the likelihood of ‘No’ given ‘Sunny‘ is:
\begin{equation}
P(c|x) = P(No|Sunny) = \frac{P(Sunny|No) * P(No)}{P(Sunny)} = \frac{\frac{3}{9} \times \frac{5}{14}}{\frac{5}{14}} = 0.333
\end{equation}

In the same way, we need to create the Likelihood Table for other attributes

In [6]:
likelihood_humidity = freq_humidity.copy()
yes_sum = likelihood_humidity[['Yes']].sum()['Yes']
no_sum = likelihood_humidity[['No']].sum()['No']
total = yes_sum + no_sum

likelihood_humidity['Yes'] = freq_humidity['Yes'].apply(lambda x: x / yes_sum)
likelihood_humidity['No'] = freq_humidity['No'].apply(lambda x: x / no_sum)
likelihood_humidity['P(x)'] = (freq_humidity['Yes'] + freq_humidity['No'])
likelihood_humidity['P(x)'] = likelihood_humidity['P(x)'].apply(lambda x: x / total)
likelihood_humidity

Unnamed: 0_level_0,Yes,No,P(x)
Humidity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
High,0.333333,0.8,0.5
Normal,0.666667,0.2,0.5


\begin{equation}
P(Yes|High) = \frac{P(Yes|High) \times P(Yes)}{P(High)} = \frac{\frac{3}{9} \times \frac{9}{14}}{\frac{7}{14}} = \frac{0.333333 \times 0.64}{0.5} = 0.428
\end{equation}

\begin{equation}
P(No|High) = \frac{P(No|High) \times P(No)}{P(High)} = \frac{\frac{4}{5} \times \frac{5}{14}}{\frac{7}{14}} = 0.571
\end{equation}

In [7]:
likelihood_wind = freq_wind.copy()
yes_sum = likelihood_wind[['Yes']].sum()['Yes']
no_sum = likelihood_wind[['No']].sum()['No']
total = yes_sum + no_sum

likelihood_wind['Yes'] = freq_wind['Yes'].apply(lambda x: x / yes_sum)
likelihood_wind['No'] = freq_wind['No'].apply(lambda x: x / no_sum)
likelihood_wind['P(x)'] = (freq_wind['Yes'] + freq_wind['No'])
likelihood_wind['P(x)'] = likelihood_wind['P(x)'].apply(lambda x: x / total)
likelihood_wind

Unnamed: 0_level_0,Yes,No,P(x)
Wind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Strong,0.333333,0.6,0.428571
Weak,0.666667,0.4,0.571429


\begin{equation}
P(Yes|Weak) = \frac{P(Weak|Yes) * P(Yes)}{P(Weak)}
\end{equation}

\begin{equation}
P(No|Weak) = \frac{P(Weak|No) * P(No)}{P(Weak)}
\end{equation}

Now we have to predict wheter "we can play on that day or not."
- Likelihood of 'Yes' on that Day = P(Outlook = Rain|Yes) \* P(Humidity= High|Yes) \* P(Wind= Weak|Yes) \* P(Yes)
- Likelihood of 'No' on that Day = P(Outlook = Rain|No) \* P(Humidity= High|No) \* P(Wind= Weak|No) \* P(No)

Next we normalize the values:
\begin{equation}
P(Yes) = \frac{\text{Likelihood of 'Yes' on that Day}}{(\text{Likelihood of 'Yes' on that Day}) + (\text{Likelihood of 'No' on that Day})}
\end{equation}

In [8]:
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.4, random_state=4)

In [9]:
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [10]:
expected = y_test
predicted = model.predict(X_test)

In [11]:
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        25
           1       0.89      1.00      0.94        17
           2       1.00      0.89      0.94        18

    accuracy                           0.97        60
   macro avg       0.96      0.96      0.96        60
weighted avg       0.97      0.97      0.97        60

[[25  0  0]
 [ 0 17  0]
 [ 0  2 16]]
