# Titanic Bayes
Using the Titanic dataset, clean up the data (handle missing values either by removal or filling, and transforming non-numerical data into number values) and then build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). Compare the two models against each other. Did one model perform better than the other? How does the performance of these two models compare to the other classification algorithms, logistic regression and decision trees?

For a bonus challenge, try different methods of preparing your data (cleaning, choosing rows/columns) to see if that affects your results.

*To see an example of predictive output of the logistic regression and decision trees, run the code in the notebooks for the Lv 1 Module 8: Logistic Regression and Module 9: Decision Trees notebooks (Links to an external site.).

Upload your Jupyter notebook to Github and submit the URL to turn in this assignment.

In [4]:
import pandas as pd
import numpy as np

In [5]:
filename="datasets/titanic.xls"
df=pd.read_excel(filename)

In [6]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [8]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [17]:
#transform sex column to binary values (0,1)
df['sex']=df['sex'].map({'female':0,'male':1})

In [20]:
df['age']=df.fillna(df['age'].mean())

In [24]:
df=df.drop('name',axis=1)

In [26]:
df=df.drop('embarked',axis=1)

In [27]:
df=df.drop('body',axis=1)

In [31]:
df=df.drop('boat',axis=1)

In [32]:
df=df.drop('cabin',axis=1)

In [33]:
df=df.drop('home.dest',axis=1)

In [35]:
df['fare']=df.fillna(df['fare'].mean())

In [40]:
df['age']=df['age'].astype(str).astype(int)

In [42]:
df['fare']=df['fare'].astype(str).astype(int)

In [44]:
df=df.drop('ticket',axis=1)

In [45]:
df.count()

pclass      1309
survived    1309
sex         1309
age         1309
sibsp       1309
parch       1309
fare        1309
dtype: int64

In [46]:
df.corr()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare
pclass,1.0,-0.312469,0.124617,1.0,0.060832,0.018322,1.0
survived,-0.312469,1.0,-0.528693,-0.312469,-0.027825,0.08266,-0.312469
sex,0.124617,-0.528693,1.0,0.124617,-0.109609,-0.213125,0.124617
age,1.0,-0.312469,0.124617,1.0,0.060832,0.018322,1.0
sibsp,0.060832,-0.027825,-0.109609,0.060832,1.0,0.373587,0.060832
parch,0.018322,0.08266,-0.213125,0.018322,0.373587,1.0,0.018322
fare,1.0,-0.312469,0.124617,1.0,0.060832,0.018322,1.0


In [47]:
X=df.drop('survived',axis=1)
y=df['survived']

In [48]:
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=109)

In [49]:
gnb=GaussianNB()

In [50]:
gnb.fit(X_train,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [51]:
y_pred=gnb.predict(X_test)

In [52]:
cm=pd.DataFrame(confusion_matrix(y_test,y_pred),columns=['Predicted Failed','Predicted Passed'],index=['True Failed','True Passed'])
cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,152,48
True Passed,42,86


In [60]:
gnb.score(X_test,y_test)

0.725609756097561

In [53]:
from sklearn.naive_bayes import BernoulliNB

In [54]:
bnb=BernoulliNB()

In [55]:
bnb.fit(X_train,y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [56]:
bnb.score(X_train,y_train)

0.7849133537206932

In [57]:
y_pred=bnb.predict(X_test)

In [59]:
cm=pd.DataFrame(confusion_matrix(y_test,y_pred),columns=['Predicted Failed','Predicted Passed'],
               index=['True Failed','True Passed'])
cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,170,30
True Passed,47,81


In [61]:
bnb.score(X_test,y_test)

0.7652439024390244

The Bernoulli model is better at determining the survival rate based on the overall score.  One way to improve the results could be to look at fewer features, for instance only sex and age.  

The overall score of the Logistic Regression is higher than the Bernoulli and Gaussian. (78%)

The overall score of the Decision Tree method is higher than the Bernoulli and Gaussian (80%).  