# Naive Bayes
## Titanic data

This notebook runs the Naive Bayes algorithm on the Titanic data. 



In [1]:
### load the data
import pandas as pd
df = pd.read_csv('data/titanic3.csv', usecols=['pclass', 'survived', 'sex', 'age'])
print(df.head())
print('\nDimensions of data frame:', df.shape)

   pclass  survived     sex      age
0       1         1  female  29.0000
1       1         1    male   0.9167
2       1         0  female   2.0000
3       1         0    male  30.0000
4       1         0  female  25.0000

Dimensions of data frame: (1309, 4)


In [2]:
# convert columns to factors
df.survived = df.survived.astype('category').cat.codes
df.pclass = df.pclass.astype('category').cat.codes
df.sex = df.sex.astype('category').cat.codes
df.head()

Unnamed: 0,pclass,survived,sex,age
0,0,1,0,29.0
1,0,1,1,0.9167
2,0,0,0,2.0
3,0,0,1,30.0
4,0,0,0,25.0


In [3]:
# count missing values

df.isnull().sum()

pclass        0
survived      0
sex           0
age         263
dtype: int64

In [4]:
# fill missing values
import numpy as np

age_mean = np.mean(df.age)
df.age.fillna(age_mean, inplace=True)

In [5]:
# train test split
from sklearn.model_selection import train_test_split

X = df.loc[:, ['pclass', 'age', 'sex']]
y = df.survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print('train size:', X_train.shape)
print('test size:', X_test.shape)

train size: (1047, 3)
test size: (262, 3)


In [6]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train, y_train)
clf.score(X_train, y_train)

0.7277936962750716

In [7]:
# make predictions

pred = clf.predict(X_test)

In [8]:
# evaluate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.6641221374045801
precision score:  0.62
recall score:  0.31
f1 score:  0.4133333333333334


In [9]:
# confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred)

array([[143,  19],
       [ 69,  31]])

In [10]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.67      0.88      0.76       162
           1       0.62      0.31      0.41       100

    accuracy                           0.66       262
   macro avg       0.65      0.60      0.59       262
weighted avg       0.65      0.66      0.63       262



### Try Bernoulli NB instead of Multinomial

Both the Multinomial and Bernoulli models handle discrete data. The difference is that MultinomialNB uses frequency counts per class and BernoulliNB works best with binary features. In the Titanic data, sex is binary, pclass is multinomial, and age is Gaussian. The sklearn library provides models for all 3 types of data. See the documentation:

* [BernoulliNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)
* [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
* [Gaussian](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [11]:
from sklearn.naive_bayes import BernoulliNB

clf2 = BernoulliNB()
clf2.fit(X_train, y_train)
pred2 = clf2.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred2))
print('precision score: ', precision_score(y_test, pred2))
print('recall score: ', recall_score(y_test, pred2))
print('f1 score: ', f1_score(y_test, pred2))
print(classification_report(y_test, pred2))

accuracy score:  0.7786259541984732
precision score:  0.7560975609756098
recall score:  0.62
f1 score:  0.6813186813186813
              precision    recall  f1-score   support

           0       0.79      0.88      0.83       162
           1       0.76      0.62      0.68       100

    accuracy                           0.78       262
   macro avg       0.77      0.75      0.76       262
weighted avg       0.78      0.78      0.77       262



The Bernoulli model significantly outperformed the Multinomial model. The Bernoulli model binarizes predictors. The sex predictor is already binary, pclass has 3 levels but would be binarized by the training algorithm. The age predictor would also be binarized into above/below an age. 

Determining beforehand which algorithm to use is difficult, which is why two different algorithms were tried here. Besides the handling of data, the two algorithms are different. MultinomialNB considers counts for multiple features whereas BernoulliNB cares about the presence *or* the absence of a feature. 