<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/4482_Naive_Bayes-Titanic-Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NB Titaic Tutorial
Converted to Python by Matthew Pecsok from Dr. Olivia Sheng's original tutorial in R
June 12, 2021

1 Data description

2 Library Setup

3 Overall data inspection

4 NB model building using sklearn package

5 Explanatory data exploration

6 Generate performance metrics

7 Simple hold-out evaluation


# 1 Data description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people such as women, children, and the upper-class were more likely to survive than others.

VARIABLE DESCRIPTIONS:

PassengerID Unique passenger identifier

Survived Survival (0 = No; 1 = Yes)

Pclass Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower)

Name

Sex

Age - (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5)

Sibsp - Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard

Ticket Number

Fare - Passenger Fare

Cabin

Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


# 2 Library Setup

https://scikit-learn.org/stable/modules/naive_bayes.html

In [None]:
import pandas as pd
import numpy as np

import os

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn import metrics


# 3 overall data inspection

In [None]:
titanic = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/titanic_cleaned.csv")
titanic_orig = titanic.copy()

In [None]:
type(titanic)

https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [None]:
titanic.shape

In [None]:
titanic.columns

tranform the data from a numpy array and a list into a pandas dataframe for exploratory data analysis

In [None]:
titanic.info()

In [None]:
# remove all non-categorical type columns
# also remove cabin as it is causing issues currently when splitting
titanic = titanic[['Survived','Sex','Embarked','Pclass']]

In [None]:
titanic

In [None]:
titanic.info()

In [None]:
titanic.describe(include='all')

In [None]:
titanic.head(2)

## 3.1 Data Type Update Notes

One key change this notebook has that needs to be addressed which was not explained in the video is the fact that we are using a single Categorical model here for Naive Bayes. The sklearn package doesn't support using a single model for both Gaussian (continuous ie 1.3,2.5,1.98 etc) data as well as categorical data such as A,B,C. This means that two models need to be fit in order to achieve the same predictions as the R package e1071 which can handle multiple variable types. At the root of this is the fact that in R factor variables are intended to be created in the data transformation phase, and then the model itself can use this information to use a gaussian model for continuous data and categorical model for categorical data. In python we do not code factor data in quite the same way and so the models do not support this outof the box.

To resolve this issue we predict with both models on their respective data and then do a tiny bit of math using what we know about bayes formula to get back to a single classification prediction.

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html


In [None]:
y = titanic.pop('Survived')

## Dummy encoding the dataframe

## 3.2 encode the data

In [None]:
#convert all columns to
titanic['Pclass'] = titanic['Pclass'].astype(str)
titanic.dtypes


In [None]:
titanic_enc = pd.get_dummies(titanic)

In [None]:
titanic_enc.dtypes

## 3.5 build a new predictive model with ONLY  categorical features, return the predictions


In [None]:
cnb = CategoricalNB() # create a gaussian model

cnb.fit(titanic_enc,y)
cnb_pred_proba = cnb.predict_proba(titanic_enc) # predict and get probabilities value

cnb_pred_proba[0:10]

# 4 NB model building using sklearn package

In [None]:
cnb_pred = cnb.predict(titanic_enc) # predict and get probabilities value

cnb_pred[0:10]

In [None]:
print(cnb.class_prior)

# 5 Explanatory data exploration

In [None]:
df = pd.crosstab(titanic_orig['Survived'], titanic_orig['Sex'])
df

In [None]:
df.plot(kind='bar')

In [None]:
# overall proportions
df = pd.crosstab(titanic_orig['Sex'], titanic_orig['Survived'])/titanic_orig.shape[0]
df = df.round(2)
df

In [None]:
df.plot(kind='bar')

In [None]:
# proportions by gender
ct = pd.crosstab(titanic_orig['Sex'], titanic_orig['Survived'])
ct = ct.div(ct.sum(axis=0), axis=1).round(2)
ct

In [None]:
ct.plot(kind='bar')

In [None]:
# proportions by gender
ct = pd.crosstab(titanic_orig['Sex'], titanic_orig['Survived'])
ct = ct.div(ct.sum(axis=1), axis=0).round(2)
ct

In [None]:
ct.plot(kind='bar')
plt.show()

In [None]:
# proportions by Embarked
ct = pd.crosstab(titanic_orig['Embarked'], titanic_orig['Survived'])
ct = ct.div(ct.sum(axis=0), axis=1).round(2)
ct

In [None]:
ct.plot(kind='bar')
plt.show()

In [None]:
# proportions by Pclass
ct = pd.crosstab(titanic_orig['Pclass'], titanic_orig['Survived'])
ct = ct.div(ct.sum(axis=0), axis=1).round(2)
ct

In [None]:
ct.plot(kind='bar')
plt.show()

In [None]:
# proportions by Pclass
ct = pd.crosstab(titanic_orig['Pclass'], titanic_orig['Survived'])
ct = ct.div(ct.sum(axis=1), axis=0).round(2)
ct

In [None]:
ct.plot(kind='bar')
plt.show()

# 5 Generate performance metrics

In [None]:
# use the predictions we made a little bit ago to create a confusion matrix
cm = confusion_matrix(y,cnb_pred,labels=[0,1])
print(cm)

In [None]:
# show a confusion matrix in a more legible format

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=cnb.classes_
                              )
disp.plot(values_format='',cmap=plt.cm.Blues)
plt.show()


In [None]:
print(metrics.classification_report(y,cnb_pred))

It's worth noting here that we used the stratify=target argument to the split function to make sure that each target class is represented at the same proportion in the test and train set. sklearn does NOT do this by default, while in R createdatapartition does.

# 6 Simple hold-out evaluation

In [None]:
cnb_split_model = CategoricalNB(alpha=0)

In [None]:
titanic_pre_train_test_split = titanic.copy()
#titanic_pre_train_test_split = titanic[['Survived','Sex','Pclass','Embarked']]
titanic_pre_train_test_split_enc = pd.get_dummies(titanic_pre_train_test_split)
titanic_pre_train_test_split_enc.info()

In [None]:
# now that we have encoded our data split it into train test
X = titanic_pre_train_test_split_enc
X_train, X_test, y_train, y_test = train_test_split(X,y , test_size=0.3, random_state=0,stratify=y)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


In [None]:
y_train.value_counts()

In [None]:
y_train.value_counts()/len(y_train)

In [None]:
y_test.value_counts()

In [None]:
y_test.value_counts()/len(y_test)

In [None]:
# FIT the model
cnb_split_model.fit(X_train,y_train)

In [None]:
# predict on the TRAIN data
y_pred_train = cnb_split_model.predict(X_train)

In [None]:
disp = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_train,y_pred_train),
    display_labels=cnb_split_model.classes_
    )
disp.plot(values_format='',cmap=plt.cm.Blues)
plt.show()

In [None]:
print(metrics.classification_report(y_train,y_pred_train))

In [None]:
# now predict on our hold out data. this dataset is intended to replicate the "real" world by including data
# that the model did not get to see when being fitted. it is simply a subset of our original data
# predict on the TEST data
import numpy as np

y_pred_test = cnb_split_model.predict(X_test)

In [None]:
disp = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_test,y_pred_test),
    display_labels=cnb_split_model.classes_
    )
disp.plot(values_format='',cmap=plt.cm.Blues)
plt.show()

In [None]:
print(metrics.classification_report(y_test,y_pred_test))