# Naive Bayes

*October 2019 | Hilary Goh, Perth - Australia*

---

Naive Bayes notebook based on the following references

Sklearn Iris: https://scikit-learn.org/stable/modules/naive_bayes.html

Titantic blog: https://www.sicara.ai/blog/2018-02-28-naive-bayes-classification-sklearn
* nice explanation of Navie Bayes
* also talks about NaNs and how to deal with them

---

# Iris example

### Load libraries

In [None]:
%matplotlib inline
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split 
from sklearn.datasets import load_iris 
import matplotlib.pyplot as plt

### Load data

In [None]:
iris = load_iris()
x = iris['data']
y = iris['target']

In [None]:
iris #note Pandas is not being used in this example

### Create data splits and train classifier

In [None]:
x_train, x_test, y_train, y_test = train_test_split( x, y, random_state = 42) 
# try different random states

In [None]:
clf = GaussianNB()
clf = clf.fit(x_train, y_train)
clf.predict(x_test)
# make predictions

### Examine results

In [None]:
y_pred = clf.predict(x_test)
print("Number of mislabeled points out of a total %d points : %d" 
      % (x_test.shape[0],(y_test != y_pred).sum())) 

In [None]:
clf.score(x_test, y_test)
# close to 1 is a good result

In [None]:
import pandas as pd

In [None]:
train = pd.DataFrame(x_train, columns=iris.feature_names)
train['target'] = y_train

In [None]:
iris.target_names

In [None]:
virginica = train[train.target==2]
setosa = train[train.target==1]
versicolor =train[train.target ==0]

In [None]:
iris.feature_names[0]

In [None]:
import seaborn as sns
plt.figure(figsize=(15,8))
sns.distplot(virginica['sepal length (cm)'])
sns.distplot(setosa['sepal length (cm)'])
sns.distplot(versicolor['sepal length (cm)'])

In [None]:
y_train[:5]

In [None]:
clf.class_prior_

In [None]:
for i in [0,1,2]:
    print(f'number of class {i} is {(y_train == i).sum()}')

# Titanic example

Which dataset to use?

Stanford: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html
* use this one if you don't want to make a Kaggle acount
* smaller dataset (less classes)

Kaggle: https://www.kaggle.com/c/titanic
* requires Kaggle account to download data
* larger dataset

### Load libraries

* assuming this is a new notebook/experiment to above
* also in a new notebook try typing in the code rather than copying, this is a good way to practise python

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

### Load data

In [None]:
data = pd.read_csv("titanic.csv")
# you will get an error here because the dataset is not saved on your PC/locally
# find the Titanic dataset - see blog
# save this dataset to the same directory as this Jupyter notebook 

In [None]:
data.columns

### Data preparation

Converting categorical data to numerical i.e. Male = 0, Female = 1

Note which Titanic dataset you are using.

In [None]:
# Convert categorical variable to numeric
data["Sex_cleaned"]=np.where(data["Sex"]=="male",0,1)
#data["Embarked_cleaned"]=np.where(data["Embarked"]=="S",0,
#                                  np.where(data["Embarked"]=="C",1,
#                                           np.where(data["Embarked"]=="Q",2,3))

In [None]:
# Cleaning dataset of NaN
data=data[[
    "Survived",
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
]].dropna(axis=0, how='any')
# changed column titles in csv

### Create data splits and train classifier

In [None]:
# Split dataset in training and test datasets
X_train, X_test = train_test_split(data, test_size=0.5, random_state=int(time.time()))
#X_train, X_test = train_test_split(data, test_size=0.5, random_state=0)

In [None]:
# Instantiate the classifier
gnb = GaussianNB()
used_features =[
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",]

In [None]:
# Train classifier
gnb.fit(X_train[used_features].values,X_train["Survived"])
y_pred = gnb.predict(X_test[used_features])

### Examine results

In [None]:
# Print results
print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
      .format(X_test.shape[0],(X_test["Survived"] != y_pred).sum(),
    100*(1-(X_test["Survived"] != y_pred).sum()/X_test.shape[0])))

## Illustration with 1 feature

We want to know probability of survival when we consider the fare.

In [None]:
mean_survival=np.mean(X_train["Survived"])

In [None]:
mean_not_survival=1-mean_survival

In [None]:
print("Survival prob = {:03.2f}%, Not survival prob = {:03.2f}%"
      .format(100*mean_survival,100*mean_not_survival))

In [None]:
1502/2224 
# died/total check, % who died 
# is it the same as above? 
#why might it be different?

In [None]:
mean_fare_survived = np.mean(X_train[X_train["Survived"]==1]["Fare"]) # 1 is survived, 0 is died
std_fare_survived = np.std(X_train[X_train["Survived"]==1]["Fare"])
mean_fare_not_survived = np.mean(X_train[X_train["Survived"]==0]["Fare"])
std_fare_not_survived = np.std(X_train[X_train["Survived"]==0]["Fare"])

In [None]:
print("mean_fare_survived = {:03.2f}".format(mean_fare_survived))
print("std_fare_survived = {:03.2f}".format(std_fare_survived))
print("mean_fare_not_survived = {:03.2f}".format(mean_fare_not_survived))
print("std_fare_not_survived = {:03.2f}".format(std_fare_not_survived))

check out the rest of the blog for further discussion of the results

FIN