NB Titaic Tutorial
Converted to Python by Matthew Pecsok from Dr. Olivia Sheng's original tutorial in R
June 12, 2021

1 Data description

2 Library Setup

3 Overall data inspection

4 NB model building using sklearn package

5 Explanatory data exploration

6 Generate performance metrics

7 Simple hold-out evaluation


# 1 Data description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people such as women, children, and the upper-class were more likely to survive than others.

VARIABLE DESCRIPTIONS:

PassengerID Unique passenger identifier

Survived Survival (0 = No; 1 = Yes)

Pclass Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower)

Name

Sex

Age - (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5)

Sibsp - Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard

Ticket Number

Fare - Passenger Fare

Cabin

Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


# 2 Library Setup

https://scikit-learn.org/stable/modules/naive_bayes.html

In [3]:
import pandas as pd
import numpy as np

import os

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn import metrics

from sklearn.model_selection import cross_validate


# 3 overall data inspection

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
titanic = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/6482_to_4482/titanic_cleaned.csv")

In [None]:
type(titanic)

https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [None]:
titanic.shape

In [None]:
titanic.columns

tranform the data from a numpy array and a list into a pandas dataframe for exploratory data analyisi

In [None]:
titanic.info()

In [None]:
# remove all non-categorical type columns
# also remove cabin as it is causing issues currently when splitting
titanic = titanic[['Survived','Sex','Embarked','Pclass']]

In [None]:
titanic

In [None]:
titanic.info()

In [None]:
titanic.describe(include='all')

## Dummy encoding the dataframe 

In [None]:
titanic.head(2)

## 3.2 encode the data 

In [None]:
#convert all columns to 
titanic['Pclass'] = titanic['Pclass'].astype(str)
titanic.dtypes


In [None]:
titanic_enc = pd.get_dummies(titanic)

In [None]:
titanic_enc.dtypes

In [None]:
titanic_enc.head(2)

## 4 build a NB model and use cross validation to see how it performs across folds

In [None]:
y = titanic_enc.pop('Survived')

In [None]:
cnb = CategoricalNB() # create a categorical NB model

In [None]:
cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=False)

In [None]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=False)
pd.DataFrame(scores)

In [None]:
five_fold = pd.DataFrame(scores)


In [None]:
print("mean\n\n",five_fold.mean(axis=0))
print("\n\nstd\n\n",five_fold.std(axis=0))

In [None]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=False)
pd.DataFrame(scores)

In [None]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=False)
ten_fold = pd.DataFrame(scores)

print("mean\n\n",ten_fold.mean(axis=0))
print("std\n\n",ten_fold.std(axis=0))

        

In [None]:
!cp "/content/drive/My Drive/Colab Notebooks/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

!jupyter nbconvert --to html "4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb"