In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

Contents : 
1. Probability Intro
    - Saling Lepas (mutually exclusive)
    - Saling Bebas (independent)
    - Bersyarat (dependent)
2. Conditional Probability (Bayes Theorem)
3. Naive Bayes Algorithm:
    - Multinomial Naive Bayes
    - MNB Smoothing : Laplacian
    - Gaussian Naive Bayes
    - Bernoulli (optional)

# Probability Intro

Probability is measuring how likely an event will be occured, based on all occurences in observation. I assume all of you already familiar with the formula. 

$$ P(A) = \frac{occurence of A}{all occurences}$$

to determine the probability between two or more events, we need to know first what is the relations between the events. There are 3 events, named : 


## Mutually Exclusive Events


<p style="text-align: justify"> 
Mutually exclusive events happened when each events cannot coexist. There can be only one event occured.
examples : 
- When you flip a coin, you will only get either head or tail, not both. The two events are : Getting a head over a flip, and getting a tail over a flip
- When you're riding, you can only turn either right or left, not both. The two events are : Turning right, and turning left 
- When you're drawing a card from the complete deck, you can only get one of ace, numbers, jack, queen, or king. **Only one**. The events are : Draw an ace, draw a number, draw a jack, draw a queen, and draw a king. 

To find the probability between two or more mutually exclusive, is by simply adding them. 
example : 

In one single draw, the probability of drawing an ace **or** a king is $P(Ace) + P(King)$

The sum of probability between mutally exclusive events are one. 

</p>


$$
    P(Ace) + P(Numbers) + P(Jack) + P(Queen) + P(King) = 1\\
    \frac{4}{52} +\frac{36}{52} +\frac{4}{52} +\frac{4}{52} +\frac{4}{52} = \frac{52}{52}\\
$$

## Independent Events

<p style="text-align: justify"> 
Independent events happened when each events are not affecting the others. The most representative example of independent events are coin toss. Let's say, we toss a coin 100 times, and all the 100 flips give result of "HEAD". The probability of the 101th toss resulting head is still 1/2. This happened because all the tosses are isolated and gives no effects for the next tosses. 

There's a belief saying that if the same event happened consecutively (like, getting "HEAD" 100 times in a row), it tends to break. This also known as [gambler's fallacy](https://www.stat.berkeley.edu/~aldous/157/Papers/croson.pdf). However, the believe is incorrect due to coin toss is independent event, wich means one toss won't affect the others. 

To find the probability between independent events, is by multiplying them.

Example:
The probability of first coin toss giving a head **and** the second toss giving a tail is $P(Head) \times P(Tail)$


 </p>

## Dependent Events

Dependent events are the opposite of independent event, wich means one event will affect the others. 

Example: The event of Someone passed a test can be affected by the fact that he/she finished reading the materials. 

Take a look at example below
![](res/naive-bayes.png)

<p style="text-align: justify"> 
    
From the statement above, we know that logically, if we correctly answered question A, it will increase our probability to be hired by google. But, what's the chance ? 5/100 ? or, how do we really calculate it ? 

Let's discuss this in the next session : Conditional Probability
</p>

# Conditional Probability 


<p style="text-align: justify"> 
The answer of previous question is, Bayesian Theorem. Bayesian theorem is used to calculate the probability between dependent event, or also called this Conditional Probability. The formula to calculate the proability of event A given that event B occured is
</p>

$$
\begin{equation}
P(A | B) = \frac{P(A \cap B)}{P(B)} 
\label{eq:bayesian}
\tag{1}
\end{equation}
$$

Referring to euqation $\eqref{eq:bayesian}$, we can get that 
$$
\begin{equation}
P(A | B) \times P(B) = P(A \cap B)
\end{equation}
$$

$$
\begin{equation}
P(B | A) \times P(A) = P(B \cap A)
\end{equation}
$$

Looking from the diagram, $P(B\cap A) = P(A\cap B)$, so we can can infer that

$$
\begin{equation}
P(A \cap B) = P(A \cap B) = P(B | A) \times P(A)
\tag{2}
\end{equation}
$$

Hence, the final equation will look like this
$$
\begin{equation}
P(A | B) = \frac{P(B | A) \times P(A)}{P(B)} 
\label{eq:bayesian_complete}
\tag{3}
\end{equation}
$$

Is equation $\eqref{eq:bayesian}$ and $\eqref{eq:bayesian_complete}$ the same ? we will find out. Let's answer the question from the previous section

Let's say : 
- Event Hired By Google as H
- Event Correctly answered question a as C

$$
\begin{align*}
|H| &= 20\\
|C| &= 20\\
|H\cap C| &= 5 \\
P(H) &= \frac{20}{100}\\
P(C) &= \frac{20}{100}\\
P(H \cap C) = P(C \cap H) &=\frac{5}{100}
\end{align*}
$$



$$
\begin{align*}
P(H | C) &= \frac{P(H \cap C)}{P(C)} &&= \frac{P(C | H) \times P(H)}{P(C)}\\
&= \frac{(\frac{5}{100})}{(\frac{20}{100})} &&= \frac{(\frac{5}{20}) \times (\frac{20}{100})}{(\frac{20}{100})}\\
&= \frac{5}{20} &&= \frac{5}{20}\\
\end{align*}
$$

It's easier for us to calculate the probability using venn diagram since it gives clear informations (actually, you can just look at it). The problem is we might find difficculty in drawing multi-variable into venn diagram

# Naive Bayes Algorithm

Naive bayes is classification algorithm that will classify data based on its highest probability. Generally, there are three Naive Bayes Classification Algorithm 

## Multinomial Naive Bayes

<p style="text-align: justify"> 
Multinomial Naive Bayes is used for categorical data, this is also the most common used algorithm of Naive Bayes due to it's simplicity. 

To exemplify the use of multinomial naive bayes, we will be using data from [Amazon Employee Access](https://www.kaggle.com/c/amazon-employee-access-challenge/data).
The data consists of real historical data (categorical data) collected from 2010 & 2011.  Employees are manually allowed or denied access to resources over time. You must create an algorithm capable of learning from this historical data to predict approval/denial for an unseen set of employees. 


Let's first take a look at the data

</p>

In [2]:
employee = pd.read_csv('data/amazon-employee-access-challenge/train.csv')
employee.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


| Column Name      | Description                                                                                                       |   |   |   |
|------------------|-------------------------------------------------------------------------------------------------------------------|---|---|---|
| ACTION           | ACTION is 1 if the resource was approved, 0 if the resource was not                                               |   |   |   |
| RESOURCE         | An ID for each resource                                                                                           |   |   |   |
| MGR_ID           | The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time |   |   |   |
| ROLE_ROLLUP_1    | Company role grouping category id 1 (e.g. US Engineering)                                                         |   |   |   |
| ROLE_ROLLUP_2    | Company role grouping category id 2 (e.g. US Retail)                                                              |   |   |   |
| ROLE_DEPTNAME    | Company role department description (e.g. Retail)                                                                 |   |   |   |
| ROLE_TITLE       | Company role business title description (e.g. Senior Engineering Retail Manager)                                  |   |   |   |
| ROLE_FAMILY_DESC | Company role family extended description (e.g. Retail Manager, Software Engineering)                              |   |   |   |
| ROLE_FAMILY      | Company role family description (e.g. Retail Manager)                                                             |   |   |   |
| ROLE_CODE        | Company role code; this code is unique to each role (e.g. Manager)                                                |   |   |   |

First, let's split the data into : Train, Validation, Test. Luckily, we already have a test data provided from the repository.

In [3]:
train, valid = train_test_split(employee, test_size=0.2, random_state=int(time.time()))
X_train = train.loc[:, employee.columns != 'ACTION'].astype(str)
y_train = train.loc[:, 'ACTION']
X_valid = valid.loc[:, employee.columns != 'ACTION'].astype(str)
y_valid = valid.loc[:, 'ACTION']
X_test = pd.read_csv('data/amazon-employee-access-challenge/test.csv').astype(str).drop(columns=['id'])

In [4]:
y_train.value_counts()

1    24678
0     1537
Name: ACTION, dtype: int64

In [5]:
X_train.head()

Unnamed: 0,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
21808,75078,41782,118315,118463,118522,120765,120765,118467,120767
30420,74355,1350,117961,118052,120096,117905,117906,290919,117908
3729,74038,5093,117961,118446,124170,118396,233714,118398,118399
13012,45610,21556,118006,118007,117941,117899,117897,19721,117900
3227,42085,21340,11146,118491,117920,118568,131694,19721,118570


In [6]:
X_test.head()

Unnamed: 0,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,32642,7792,118573,118574,117945,136261,128463,292795,119082
1,4696,14638,117961,118343,118514,118321,289122,255851,118322
2,22662,1760,118887,118888,120171,118396,255118,118398,118399
3,75078,7576,117961,118052,120671,118321,117906,257051,118322
4,39879,55668,117902,118041,117945,135951,134458,19776,119082


<p style="text-align: justify"> 
It's shown from the data that we have pretty imbalance distribution. Now let's continue to how Naive Bayes Algorithm works. 

Based on the data, there are 9 predictors named RESOURCE, MGR_ID, ROLE_ROLLUP_1, and so on. Let's call these predictor as $X$. So we will find the probability of $X$ will be classified as $1$ and $0$. In other word, we will compare $P(1|X)$ and $P(0|X)$. If $P(1|X) > P(0|X)$ then we will classify $X$ as class $1$, vice versa.

</p>

$$
\begin{equation}
P(1| X): P(0|X)\\
\frac{P(X | 1) \times P(1)}{P(X)} : \frac{P(X | 0) \times P(0)}{P(X)}\\
P(X | 1) \times P(1) : P(X | 0) \times P(0)\\
\tag{4}
\end{equation}
$$

Now, let's break the $P(X|1)$ or $P(X|0)$

$$P(X|1) = P(RESOURCE=x_1, MGR_ID=x_2, ... , ROLE_CODE=x_9 | 1)$$

Wich, in naive bayes, we will strongly assume that each variables/predictor are **independet**. This means we assume that each variable doen't affect each other. Thus, the final probability for $P(X|1)$ can be calculated as below

$$P(X|1) = P(RESOURCE=x_1|1) \times P(MGR\_ID=x_2|1) \times ... \times P(ROLE\_CODE=x_9 | 1)$$

Please note that this assumption is so strong that i'ts unlikely to find a predictor that independent each other. This is why this algorithm called **Naive** Bayes, naively using assumption on bayesian theorem. 



## Smoothing

<p style="text-align: justify"> 
In multinomial naive bayes, there's a possibility that a variable's value has no occurences in data train. This phenomenom will resulting in zero probability problem, where because of one zero occurences of one variable's value, the final probability will be zero. The most common way to deal with this problem is by using **laplace smoothing** (It's different from laplacian smoothing). Laplace smoothing works by virtually adding the occurences (usually 1) of the data. Or, just simply add 1 to each nominator and denominator so that the probability won't be zero.

Example:

Before smoothing : 
$P (A) = 0/100$

After smoothing
$P (A) = 1/101$

Notes: By default, sklearn apply laplace smoothing to the naive bayes algorithm.

We have talked about how the algorithm works, now let's just call it using scikit-learn library

</p>

In [9]:
mnb = MultinomialNB()
mnb.fit(X_train.values, y_train.values.flatten())
y_mnb = mnb.predict(X_valid)

In [10]:
print(confusion_matrix(y_valid, y_mnb))
tn, fp, fn, tp = confusion_matrix(y_valid, y_mnb).ravel()
print ("TP : {} \nTN : {} \nFP : {}\nFN : {}".format(tp, tn, fp, fn))

[[ 207  153]
 [3245 2949]]
TP : 2949 
TN : 207 
FP : 153
FN : 3245


In [11]:
# rec = tp/(tp + fn)
# prec = tp/(tp + fp)

In [12]:
target_names = ['Class 0', 'Class 1']
print(classification_report(y_valid, y_mnb, target_names=target_names))

              precision    recall  f1-score   support

     Class 0       0.06      0.57      0.11       360
     Class 1       0.95      0.48      0.63      6194

    accuracy                           0.48      6554
   macro avg       0.51      0.53      0.37      6554
weighted avg       0.90      0.48      0.61      6554



<p style="text-align: justify"> 
The macro average measure the performance for each class equally. This will do just $\frac{\sum_{i=0}^{n}{f(c_i)}}{n}$

While the weighted average or micro average consider the portion of each class $\sum_{i=0}^{n}{f(c_i) \times P(C_i)}$

Since our positive class is 1, let's take a look at it's performance : 0.94 Precision, 0.64 Recall, 0.76 f-1, and 0.62 of accuracy. 
In this case, we don't want any person that actually have no access but mistakenly classified as a person having the access. This means, we want to reduce the false positive rate. So, we will use a metric that will penalize our model each time it gives false positive, Precision. And since the precision is good, well, let's say this model is good. 

If you're still find it difficult when to choose the appropriate metric score, please check our article : [Recall or Precision and when to use it](google.com)
</p>

## Gaussian Naive Bayes

<p style="text-align: justify">
![](res/Titanic.png)


You might asking how to calculate the probability of non categorical or continuous value? Actually there are a way of changing a continous number into categorical. The most common way to do it is by binning. Binning means we create several range, and whenever the value fall within a range, it belongs to it. For example, we can bin the data into 4 categorical value based on quartiles. A bin where value < Q1, A bin where value within Q1 to Q2, A bin where value within Q2 to Q3, and a bin where value > Q3. Then, we can do the classification using Multinomial Naive Bayes. 

The second way, is to change the probability function using Gaussian function. This means we will assume that the **values is normally distributed**. If we go back a little to practical statistic course, we can find the probability of value x if the data is normally distributed with the formula 

$$
\begin{equation}
    f(x,\mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \times e^{-\frac{(x - \mu)^2}{2\sigma^2}}
\tag{5}
\end{equation}
$$

where $x$ is the predictor value, $\mu$ is the mean of population (please remember the population might change due to conditional changes). That's all the differences. Let's try to implement the Gaussian Naive Bayes Classification using the most famous data for beginner data science - [Titanic : Machine Learning From Disaster](https://www.kaggle.com/c/titanic/data)

</p>

In [13]:
titanic = pd.read_csv("data/titanic/train.csv")
test = pd.read_csv("data/titanic/test.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


| Variable | Definition                                 | Key                                            |   |   |
|----------|--------------------------------------------|------------------------------------------------|---|---|
| survival | Survival                                   | 0 = No, 1 = Yes                                |   |   |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |   |   |
| sex      | Sex                                        |                                                |   |   |
| Age      | Age in years                               |                                                |   |   |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |   |   |
| parch    | # of parents / children aboard the Titanic |                                                |   |   |
| ticket   | Ticket number                              |                                                |   |   |
| fare     | Passenger fare                             |                                                |   |   |
| cabin    | Cabin number                               |                                                |   |   |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |   |   |

In [69]:
titanic = pd.read_csv("data/titanic/train.csv")
test = pd.read_csv("data/titanic/test.csv")

# Convert categorical variable to numeric
titanic["Sex_cleaned"]=np.where(titanic["Sex"]=="male",0,1)
titanic["Embarked_cleaned"]=np.where(titanic["Embarked"]=="S",0,
                                  np.where(titanic["Embarked"]=="C",1,
                                           np.where(titanic["Embarked"]=="Q",2,3)
                                          )
                                 )
test["Sex_cleaned"]=np.where(test["Sex"]=="male",0,1)
test["Embarked_cleaned"]=np.where(test["Embarked"]=="S",0,
                                  np.where(test["Embarked"]=="C",1,
                                           np.where(test["Embarked"]=="Q",2,3)
                                          )
                                 )

# Cleaning train of NaN
titanic=titanic[[
    "Survived",
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_cleaned"
]].dropna(axis=0, how='any')

#filling test's NaN, because we can't evaluate the number if it's NaN
test=test[[
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_cleaned"
]].fillna(value=0)

# Split dataset in training and test datasets
train, valid = train_test_split(titanic, test_size=0.2, random_state=int(time.time()))


# Instantiate the classifier
gnb = GaussianNB()

used_features =[
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_cleaned"
]

X_train = train.loc[:, used_features]
y_train = train.loc[:, 'Survived']
X_valid = valid.loc[:, used_features]
y_valid = valid.loc[:, 'Survived']

In [70]:
gnb.fit(
    X_train.values,
    y_train
)
y_gnb = gnb.predict(X_valid)

In [71]:
target_names = ['Not Survived', 'Survived']
print(classification_report(y_valid, y_gnb, target_names=target_names))

              precision    recall  f1-score   support

Not Survived       0.79      0.84      0.82        82
    Survived       0.77      0.70      0.74        61

    accuracy                           0.78       143
   macro avg       0.78      0.77      0.78       143
weighted avg       0.78      0.78      0.78       143



<p style="text-align: justify">
Now that we already  trained our model on train and validation data, let's classify the test data and see if your model can make it to the top 10% of the competition!
</p>

In [87]:
y_gnb_test = gnb.predict(test[used_features])
submission = pd.read_csv('data/titanic/test.csv')
submission['Survived'] = y_gnb_test

In [88]:
submission = submission[['PassengerId', 'Survived']]
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,1
3,895,0
4,896,1


In [86]:
submission.to_csv('submission.csv', index=False)

You have generated the submission file, now upload it and see the result ! for your convenience, I already uploaded it for you, heres the result

![](res/titanic-gnb.PNG)

70% accuracy is not bad, right ? wrong. This is competition, model interpretations is not priority. We can create more complex model later that could gives more performance. 

Stay on your line !

## Bernoulli Naive Bayes

<p style="text-align: justify">
The Bernoulli Naive Bayes Classifier is best used for **binomial data**. This algorithm measure the probability of binomial data using Bernoulli distribution. Just like the Multinomial Naive Bayes, this algorithm is commonly used for text classification. The difference is, instead of using "frequency", this algorithm most suited for data wich using binary value such as "the document contain word A or not". To measure the probability is shown below:
    </p>

$$
\begin{equation}
    P(x_i = v | c_k ) = P(x_i|c_k)\times v + (1-P(x_i|c_k)) \times (1-v)
\tag{6}
\end{equation}
$$

where 
- $v$ is $\{0,1\}$, denoting the occurence of variable $x_i$ in current document
- $P(x_i|c_k)$ is the probability of how much the class k that contain the variable $x_i$ (where $x_i = 1$).

Unlinke Multinomial Naive Bayes, this algorithm take the absence of variable into accounts.

Let's use the previous data in Multinomial Naive Bayes, [Amazon Employee Access](https://www.kaggle.com/c/amazon-employee-access-challenge/data).

In [21]:
employee = pd.read_csv('data/amazon-employee-access-challenge/train.csv')
dummy_employee = pd.get_dummies(employee.loc[:, employee.columns != 'ACTION'].astype(str))
dummy_employee['ACTION'] = employee['ACTION']

In [28]:
train, valid = train_test_split(dummy_employee, test_size=0.2, random_state=int(time.time()))
X_train = train.drop(columns=['ACTION'])
y_train = train.loc[:, 'ACTION']
X_valid = valid.drop(columns=['ACTION'])
y_valid = valid.loc[:, 'ACTION']

In [30]:
print(X_train.shape, y_train.shape,  X_valid.shape, y_valid.shape)

(26215, 15626) (26215,) (6554, 15626) (6554,)


In [25]:
train.head()

Unnamed: 0,RESOURCE_0,RESOURCE_100003,RESOURCE_100028,RESOURCE_100031,RESOURCE_100037,RESOURCE_100038,RESOURCE_100039,RESOURCE_1001,RESOURCE_1003,RESOURCE_100413,...,ROLE_CODE_216827,ROLE_CODE_239004,ROLE_CODE_240105,ROLE_CODE_247660,ROLE_CODE_254396,ROLE_CODE_258436,ROLE_CODE_266863,ROLE_CODE_268610,ROLE_CODE_270691,ACTION
3581,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3768,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
25132,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
30253,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
28167,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


<p style="text-align: justify">
From the example above, we've changed the dataframe into one-hot encoding to make it binomial. What you need to concern is, the predictors of training data, validation data, and test data should be the same. 

In previous section (gaussian NB) we already did feature selection from data test to match the feature of training and validation data. For now, we will not try to match the feature of data test, so we can't use the model to predict the test due to its not as simple as one-hot-encoding the data test.

</p>

In [36]:
bnb = BernoulliNB()

In [37]:
bnb.fit(
    X_train.values,
    y_train
)
y_bnb = bnb.predict(X_valid)

In [35]:
target_names = ['Class 0', 'Class 1']
print(classification_report(y_valid, y_bnb, target_names=target_names))

              precision    recall  f1-score   support

     Class 0       0.38      0.10      0.15       369
     Class 1       0.95      0.99      0.97      6185

    accuracy                           0.94      6554
   macro avg       0.66      0.54      0.56      6554
weighted avg       0.92      0.94      0.92      6554



## Summary

### Pros :

- Computationally cheap ( O(1) )
- Easy to implement
- Works well with small datasets
- Works well with high dimensions
- Perform well even if the Naive Assumption is not perfectly met. In many cases, the approximation is enough to build a good classifier.

### Cons : 
- Require to remove correlated features because they are voted twice in the model and it can lead to over inflating importance.


Sources : 

Use of machine learning :
[google](https://developers.google.com/machine-learning/problem-framing/cases)

Naive Bayes:
[Nb1](https://jakevdp.github.io/PythonDataScienceHandbook/06.00-figure-code.html#Gaussian-Naive-Bayes), 
[sicara](https://www.sicara.ai/blog/2018-02-28-naive-bayes-classification-sklearn)
[kamalnigam](www.kamalnigam.com/papers/multinomial-aaaiws98.pdf)
[Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Bernoulli_naive_Bayes)


