<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NB Titaic Tutorial
Converted to Python by Matthew Pecsok from Dr. Olivia Sheng's original tutorial in R
June 12, 2021

1 Data description

2 Library Setup

3 Overall data inspection

4 NB model building using sklearn package

5 Explanatory data exploration

6 Generate performance metrics

7 Simple hold-out evaluation


# 1 Data description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people such as women, children, and the upper-class were more likely to survive than others.

VARIABLE DESCRIPTIONS:

PassengerID Unique passenger identifier

Survived Survival (0 = No; 1 = Yes)

Pclass Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower)

Name

Sex

Age - (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5)

Sibsp - Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard

Ticket Number

Fare - Passenger Fare

Cabin

Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


# 2 Library Setup

https://scikit-learn.org/stable/modules/naive_bayes.html

In [54]:
import pandas as pd
import numpy as np

import os

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn import metrics

from sklearn.model_selection import cross_validate


# 3 overall data inspection

In [55]:
titanic = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/titanic_cleaned.csv")

In [56]:
type(titanic)

pandas.core.frame.DataFrame

https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [57]:
titanic.shape

(714, 9)

In [58]:
titanic.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

tranform the data from a numpy array and a list into a pandas dataframe for exploratory data analyisi

In [59]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Sex       714 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     714 non-null    int64  
 5   Parch     714 non-null    int64  
 6   Fare      714 non-null    float64
 7   Cabin     714 non-null    object 
 8   Embarked  714 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 50.3+ KB


In [60]:
# remove all non-categorical type columns
# also remove cabin as it is causing issues currently when splitting
titanic = titanic[['Survived','Sex','Embarked','Pclass']]

In [61]:
titanic

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1
2,1,female,S,3
3,1,female,S,1
4,0,male,S,3
...,...,...,...,...
709,0,female,Q,3
710,0,male,S,2
711,1,female,S,1
712,1,male,C,1


In [62]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  714 non-null    int64 
 1   Sex       714 non-null    object
 2   Embarked  714 non-null    object
 3   Pclass    714 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 22.4+ KB


In [63]:
titanic.describe(include='all')

Unnamed: 0,Survived,Sex,Embarked,Pclass
count,714.0,714,714,714.0
unique,,2,4,
top,,male,S,
freq,,453,554,
mean,0.406162,,,2.236695
std,0.49146,,,0.83825
min,0.0,,,1.0
25%,0.0,,,1.0
50%,0.0,,,2.0
75%,1.0,,,3.0


## Dummy encoding the dataframe 

In [64]:
titanic.head(2)

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1


Pop the target variable and change it to 0,1

In [65]:
y = titanic.pop('Survived')

In [66]:
balanced_y_target = y.eq('yes').mul(1)

## 3.2 encode the data 

In [67]:
#convert all columns to 
titanic['Pclass'] = titanic['Pclass'].astype(str)
titanic.dtypes


Sex         object
Embarked    object
Pclass      object
dtype: object

In [68]:
titanic_enc = pd.get_dummies(titanic)

In [69]:
titanic_enc.dtypes

Sex_female          uint8
Sex_male            uint8
Embarked_C          uint8
Embarked_Q          uint8
Embarked_S          uint8
Embarked_missing    uint8
Pclass_1            uint8
Pclass_2            uint8
Pclass_3            uint8
dtype: object

In [70]:
titanic_enc.head(2)

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_missing,Pclass_1,Pclass_2,Pclass_3
0,0,1,0,0,1,0,0,0,1
1,1,0,1,0,0,0,1,0,0


## 4 build a NB model and use cross validation to see how it performs across folds

In [71]:
cnb = CategoricalNB() # create a categorical NB model

In [72]:
cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=True)

{'fit_time': array([0.00583243, 0.00309157, 0.00310874, 0.00365138, 0.0048604 ]),
 'score_time': array([0.00440836, 0.00352597, 0.0035789 , 0.00504065, 0.00549006]),
 'test_f1': array([0.70967742, 0.76521739, 0.7       , 0.70175439, 0.75438596]),
 'train_f1': array([0.7300216 , 0.71610169, 0.73233405, 0.73150106, 0.71881607]),
 'test_accuracy': array([0.74825175, 0.81118881, 0.74825175, 0.76223776, 0.8028169 ]),
 'train_accuracy': array([0.78108581, 0.76532399, 0.78108581, 0.77758319, 0.76748252]),
 'test_recall': array([0.75862069, 0.75862069, 0.72413793, 0.68965517, 0.74137931]),
 'train_recall': array([0.72844828, 0.72844828, 0.73706897, 0.74568966, 0.73275862]),
 'test_precision': array([0.66666667, 0.77192982, 0.67741935, 0.71428571, 0.76785714]),
 'train_precision': array([0.73160173, 0.70416667, 0.72765957, 0.71784232, 0.70539419])}

In [73]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.00448,0.003575,0.709677,0.730022,0.748252,0.781086,0.758621,0.728448,0.666667,0.731602
1,0.002908,0.003438,0.765217,0.716102,0.811189,0.765324,0.758621,0.728448,0.77193,0.704167
2,0.002938,0.003409,0.7,0.732334,0.748252,0.781086,0.724138,0.737069,0.677419,0.72766
3,0.003265,0.003539,0.701754,0.731501,0.762238,0.777583,0.689655,0.74569,0.714286,0.717842
4,0.00288,0.003359,0.754386,0.718816,0.802817,0.767483,0.741379,0.732759,0.767857,0.705394


In [74]:
five_fold = pd.DataFrame(scores)


In [75]:
print("mean\n\n",five_fold.mean(axis=0))
print("\n\nstd\n\n",five_fold.std(axis=0))

mean

 fit_time           0.003294
score_time         0.003464
test_f1            0.726207
train_f1           0.725755
test_accuracy      0.774549
train_accuracy     0.774512
test_recall        0.734483
train_recall       0.734483
test_precision     0.719632
train_precision    0.717333
dtype: float64


std

 fit_time           0.000681
score_time         0.000091
test_f1            0.031120
train_f1           0.007679
test_accuracy      0.030316
train_accuracy     0.007578
test_recall        0.028850
train_recall       0.007213
test_precision     0.049185
train_precision    0.012514
dtype: float64


In [76]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.00852,0.003683,0.75,0.722753,0.777778,0.774143,0.827586,0.724138,0.685714,0.721374
1,0.003027,0.003486,0.666667,0.732448,0.722222,0.780374,0.689655,0.739464,0.645161,0.725564
2,0.002947,0.003427,0.716981,0.726592,0.791667,0.772586,0.655172,0.743295,0.791667,0.710623
3,0.002939,0.003873,0.806452,0.71619,0.833333,0.767913,0.862069,0.720307,0.757576,0.712121
4,0.004237,0.003572,0.644068,0.734848,0.704225,0.782271,0.655172,0.743295,0.633333,0.726592
5,0.003018,0.003531,0.741935,0.72381,0.774648,0.774495,0.793103,0.727969,0.69697,0.719697
6,0.002893,0.003392,0.724138,0.725898,0.774648,0.774495,0.724138,0.735632,0.724138,0.716418
7,0.002972,0.003399,0.690909,0.729323,0.760563,0.77605,0.655172,0.743295,0.730769,0.715867
8,0.00297,0.003443,0.758621,0.722117,0.802817,0.771384,0.758621,0.731801,0.758621,0.712687
9,0.003087,0.003698,0.75,0.723164,0.802817,0.771384,0.724138,0.735632,0.777778,0.711111


In [77]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
ten_fold = pd.DataFrame(scores)

print("mean\n\n",ten_fold.mean(axis=0))
print("std\n\n",ten_fold.std(axis=0))

        

mean

 fit_time           0.003706
score_time         0.004338
test_f1            0.724977
train_f1           0.725714
test_accuracy      0.774472
train_accuracy     0.774509
test_recall        0.734483
train_recall       0.734483
test_precision     0.720173
train_precision    0.717205
dtype: float64
std

 fit_time           0.001337
score_time         0.001619
test_f1            0.047705
train_f1           0.005428
test_accuracy      0.038350
train_accuracy     0.004267
test_recall        0.074580
train_recall       0.008287
test_precision     0.054087
train_precision    0.005876
dtype: float64


In [78]:
!cp "/content/drive/My Drive/Colab Notebooks/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

!jupyter nbconvert --to html "4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb"

cp: cannot stat '/content/drive/My Drive/Colab Notebooks/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb': No such file or directory
This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    E