# Naive Bayes Classifier

Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features. 

$
\begin{align}
\ P(h|D) = \frac{P(D|h) P(h)}{P(D)}
\end{align}
$

- P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h
- P(D): the probability of the data (regardless of the hypothesis). This is known as the prior probability
- P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability
- P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior probability

In [3]:
import pickle as pkl

with open('../data/titanic_tansformed.pkl', 'rb') as f:
    df_data = pkl.load(f)

In [4]:
df_data.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,2,3,male,Q,S
0,0,22.0,1,0,7.25,0,1,1,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0
2,1,26.0,0,0,7.925,0,1,0,0,1
3,1,35.0,1,0,53.1,0,0,0,0,1
4,0,35.0,0,0,8.05,0,1,1,0,1


In [5]:
df_data.shape

(889, 10)

In [6]:
data = df_data.drop("Survived",axis=1)
label = df_data["Survived"]

In [7]:
from sklearn.model_selection import train_test_split  
data_train, data_test, label_train, label_test = train_test_split(data, label, test_size = 0.2, random_state = 101)

In [8]:
from sklearn.naive_bayes import GaussianNB
import time

tic = time.time()
nb_cla = GaussianNB()
nb_cla.fit(data_train,label_train)
print('Time taken for training Naive Bayes', (time.time()-tic), 'secs')

predictions = nb_cla.predict(data_test)
print('Accuracy', nb_cla.score(data_test, label_test))

from sklearn.metrics import classification_report, confusion_matrix                
print(confusion_matrix(label_test, predictions))  
print(classification_report(label_test, predictions)) 

Time taken for training Naive Bayes 0.0013301372528076172 secs
Accuracy 0.8202247191011236
[[96 11]
 [21 50]]
             precision    recall  f1-score   support

          0       0.82      0.90      0.86       107
          1       0.82      0.70      0.76        71

avg / total       0.82      0.82      0.82       178



## Multinomial Naive-Bayes
- Used when the values are discrete


In [9]:
from sklearn.naive_bayes import MultinomialNB
import time

tic = time.time()
nb_cla = MultinomialNB()
nb_cla.fit(data_train,label_train)
print('Time taken for training Naive Bayes', (time.time()-tic), 'secs')

predictions = nb_cla.predict(data_test)
print('Accuracy', nb_cla.score(data_test, label_test))

Time taken for training Naive Bayes 0.15154290199279785 secs
Accuracy 0.7303370786516854


## Bernoulli Naive-Bayes
- Used when the values of all the features are binary

In [10]:
from sklearn.naive_bayes import BernoulliNB
import time

tic = time.time()
nb_cla = BernoulliNB()
nb_cla.fit(data_train,label_train)
print('Time taken for training Naive Bayes', (time.time()-tic), 'secs')

predictions = nb_cla.predict(data_test)
print('Accuracy', nb_cla.score(data_test, label_test))

Time taken for training Naive Bayes 0.0020689964294433594 secs
Accuracy 0.8033707865168539
