<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NB Titaic Tutorial
Converted to Python by Matthew Pecsok from Dr. Olivia Sheng's original tutorial in R
June 12, 2021

1 Data description

2 Library Setup

3 Overall data inspection

4 NB model building using sklearn package

5 Explanatory data exploration

6 Generate performance metrics

7 Simple hold-out evaluation


# 1 Data description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people such as women, children, and the upper-class were more likely to survive than others.

VARIABLE DESCRIPTIONS:

PassengerID Unique passenger identifier

Survived Survival (0 = No; 1 = Yes)

Pclass Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower)

Name

Sex

Age - (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5)

Sibsp - Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard

Ticket Number

Fare - Passenger Fare

Cabin

Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


# 2 Library Setup

https://scikit-learn.org/stable/modules/naive_bayes.html

In [1]:
import pandas as pd
import numpy as np

import os

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn import metrics

from sklearn.model_selection import cross_validate


# 3 overall data inspection

In [2]:
titanic = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/titanic_cleaned.csv")

In [3]:
type(titanic)

pandas.core.frame.DataFrame

https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [4]:
titanic.shape

(714, 9)

In [5]:
titanic.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

tranform the data from a numpy array and a list into a pandas dataframe for exploratory data analyisi

In [6]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Sex       714 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     714 non-null    int64  
 5   Parch     714 non-null    int64  
 6   Fare      714 non-null    float64
 7   Cabin     714 non-null    object 
 8   Embarked  714 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 50.3+ KB


In [7]:
# remove all non-categorical type columns
# also remove cabin as it is causing issues currently when splitting
titanic = titanic[['Survived','Sex','Embarked','Pclass']]

In [8]:
titanic

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1
2,1,female,S,3
3,1,female,S,1
4,0,male,S,3
...,...,...,...,...
709,0,female,Q,3
710,0,male,S,2
711,1,female,S,1
712,1,male,C,1


In [9]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  714 non-null    int64 
 1   Sex       714 non-null    object
 2   Embarked  714 non-null    object
 3   Pclass    714 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 22.4+ KB


In [10]:
titanic.describe(include='all')

Unnamed: 0,Survived,Sex,Embarked,Pclass
count,714.0,714,714,714.0
unique,,2,4,
top,,male,S,
freq,,453,554,
mean,0.406162,,,2.236695
std,0.49146,,,0.83825
min,0.0,,,1.0
25%,0.0,,,1.0
50%,0.0,,,2.0
75%,1.0,,,3.0


## Dummy encoding the dataframe 

In [11]:
titanic.head(2)

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1


Pop the target variable and change it to 0,1

In [12]:
y = titanic.pop('Survived')

In [13]:
y = y.eq('yes').mul(1)

## 3.2 encode the data 

In [14]:
#convert all columns to 
titanic['Pclass'] = titanic['Pclass'].astype(str)
titanic.dtypes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Sex         object
Embarked    object
Pclass      object
dtype: object

In [15]:
titanic_enc = pd.get_dummies(titanic)

In [16]:
titanic_enc.dtypes

Sex_female          uint8
Sex_male            uint8
Embarked_C          uint8
Embarked_Q          uint8
Embarked_S          uint8
Embarked_missing    uint8
Pclass_1            uint8
Pclass_2            uint8
Pclass_3            uint8
dtype: object

In [17]:
titanic_enc.head(2)

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_missing,Pclass_1,Pclass_2,Pclass_3
0,0,1,0,0,1,0,0,0,1
1,1,0,1,0,0,0,1,0,0


## 4 build a NB model and use cross validation to see how it performs across folds

In [18]:
cnb = CategoricalNB() # create a categorical NB model

In [19]:
cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=True)

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(averag

{'fit_time': array([0.0052371 , 0.00243711, 0.00239229, 0.00262451, 0.00243878]),
 'score_time': array([0.00656414, 0.0039537 , 0.0040319 , 0.00412679, 0.00413942]),
 'test_f1': array([0., 0., 0., 0., 0.]),
 'train_f1': array([0., 0., 0., 0., 0.]),
 'test_accuracy': array([1., 1., 1., 1., 1.]),
 'train_accuracy': array([1., 1., 1., 1., 1.]),
 'test_recall': array([0., 0., 0., 0., 0.]),
 'train_recall': array([0., 0., 0., 0., 0.]),
 'test_precision': array([0., 0., 0., 0., 0.]),
 'train_precision': array([0., 0., 0., 0., 0.])}

In [20]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
pd.DataFrame(scores)

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(averag

Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.004113,0.007484,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
1,0.002793,0.005285,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
2,0.004086,0.004968,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
3,0.005537,0.005543,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
4,0.004368,0.004997,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


In [21]:
five_fold = pd.DataFrame(scores)


In [22]:
print("mean\n\n",five_fold.mean(axis=0))
print("\n\nstd\n\n",five_fold.std(axis=0))

mean

 fit_time           0.004179
score_time         0.005656
test_f1            0.000000
train_f1           0.000000
test_accuracy      1.000000
train_accuracy     1.000000
test_recall        0.000000
train_recall       0.000000
test_precision     0.000000
train_precision    0.000000
dtype: float64


std

 fit_time           0.000977
score_time         0.001049
test_f1            0.000000
train_f1           0.000000
test_accuracy      0.000000
train_accuracy     0.000000
test_recall        0.000000
train_recall       0.000000
test_precision     0.000000
train_precision    0.000000
dtype: float64


In [23]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
pd.DataFrame(scores)

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(averag

Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.008881,0.005092,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
1,0.002591,0.004251,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
2,0.002511,0.00416,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
3,0.002607,0.005463,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
4,0.002643,0.004565,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
5,0.002853,0.004532,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
6,0.00296,0.004425,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
7,0.002828,0.004542,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
8,0.003752,0.004981,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
9,0.002823,0.004721,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


In [24]:
scores = cross_validate(
    cnb, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
ten_fold = pd.DataFrame(scores)

print("mean\n\n",ten_fold.mean(axis=0))
print("std\n\n",ten_fold.std(axis=0))

        

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(averag

mean

 fit_time           0.003467
score_time         0.004590
test_f1            0.000000
train_f1           0.000000
test_accuracy      1.000000
train_accuracy     1.000000
test_recall        0.000000
train_recall       0.000000
test_precision     0.000000
train_precision    0.000000
dtype: float64
std

 fit_time           0.001195
score_time         0.000249
test_f1            0.000000
train_f1           0.000000
test_accuracy      0.000000
train_accuracy     0.000000
test_recall        0.000000
train_recall       0.000000
test_precision     0.000000
train_precision    0.000000
dtype: float64


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [25]:
!cp "/content/drive/My Drive/Colab Notebooks/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

!jupyter nbconvert --to html "4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb"

cp: cannot stat '/content/drive/My Drive/Colab Notebooks/4482_Naive_Bayes_CV-Titanic-Tutorial.ipynb': No such file or directory
This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    E