# SVM and Naive Bayes 

Use the Titanic data predict whether or not a passenger survived. 

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.


Variables:
- survived: Survival (0 = No, 1 = Yes)
- pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- name: Name of the passenger
- sex: Sex of the passenger
- age: Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number	 
- fare: Passenger fare	 
- cabin: Cabin number	 
- embarked: Port of Embarkation	(C = Cherbourg, Q = Queenstown, S = Southampton)


In [1]:
# import the packages needed 
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.svm import SVC 
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
import matplotlib.pyplot as plt
%matplotlib inline

# define a function for calculating the metric to be used later 
def classification_metrics(Y_pred, Y_true):
    acc = accuracy_score(Y_true, Y_pred)
    precision = precision_score(Y_true, Y_pred)
    recall = recall_score(Y_true, Y_pred)
    f1score = f1_score(Y_true, Y_pred)
    auc = roc_auc_score(Y_true, Y_pred)

    return acc, precision, recall, f1score, auc

# define a function for printing the metrics 
def display_metrics(classifierName,Y_pred,Y_true):
    print ("______________________________________________")
    print ("Model: "+classifierName)
    acc, precision, recall, f1score, auc = classification_metrics(Y_pred,Y_true)
    print ("Accuracy: "+str(acc))
    print ("Precision: "+str(precision))
    print ("Recall: "+str(recall))
    print ("F1-score: "+str(f1score))
    print ("AUC: "+str(auc))
    print ("______________________________________________")
    print ("")


## Get the Data

In [2]:
df = pd.read_csv('titanic.csv') # import the titanic data

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# Data Procesing
- Do one or more of the following to process the data before using it to train the model: 
    - Creating dummy variables for categorical data (use pd.get_dummies)
    - Dropping null values (use dropna) 
    - Dropping unwanted variables (use drop)
    - Feature engineering: Derive new features from the original features 

In [5]:
cat_feats = ['Sex'] 
final_data = pd.get_dummies(df, columns=cat_feats, prefix_sep=' - ', drop_first=True) 

In [6]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [7]:
final_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Sex - male'],
      dtype='object')

In [8]:
final_data = final_data.drop(['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Embarked', 'Name', 'Cabin', 'Ticket'], axis=1) 
final_data = final_data.dropna() 
final_data.head()

Unnamed: 0,Survived,Age,Fare,Sex - male
0,0,22.0,7.25,1
1,1,38.0,71.2833,0
2,1,26.0,7.925,0
3,1,35.0,53.1,0
4,0,35.0,8.05,1


In [9]:
 # see # of rows and columns

In [10]:
final_data.shape 

(714, 4)

## Train Test Split

In [11]:
X = final_data.drop('Survived',axis=1) 
y = final_data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)


In [12]:
X_train.head()

Unnamed: 0,Age,Fare,Sex - male
83,28.0,47.1,1
534,30.0,8.6625,0
588,22.0,8.05,1
163,17.0,8.6625,1
71,16.0,46.9,0


## Training a SVM
- Use SVM in sklearn 
- Some paramters: 
    - C: Penalty parameter C of the error term.
    - kernel: Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used
    - gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
    - see all parameters in the documentation: https://scikit-learn.org/0.21/modules/generated/sklearn.svm.SVC.html

In [13]:
model_svm = SVC(C=5, gamma=0.5, random_state=101)

model_svm.fit(X_train, y_train)

y_pred = model_svm.predict(X_test)

display_metrics('SVM', y_pred, y_test)


______________________________________________
Model: SVM
Accuracy: 0.6697674418604651
Precision: 0.6721311475409836
Recall: 0.44565217391304346
F1-score: 0.5359477124183005
AUC: 0.6415252739483916
______________________________________________



## Training the Naive Bayes model
- Use GaussianNB in sklearn 
    - See documentation: https://scikit-learn.org/stable/modules/naive_bayes.html

In [14]:
model_nb = GaussianNB()

model_nb.fit(X_train, y_train)


## Predictions and Evaluation

In [15]:
y_pred = model_nb.predict(X_test)

In [16]:
confusion_matrix_results = confusion_matrix(y_test, y_pred)

print('confusion matrix: \n', confusion_matrix_results)

display_metrics('Naive Bayes', y_pred, y_test)

confusion matrix: 
 [[102  21]
 [ 28  64]]
______________________________________________
Model: Naive Bayes
Accuracy: 0.772093023255814
Precision: 0.7529411764705882
Recall: 0.6956521739130435
F1-score: 0.7231638418079096
AUC: 0.7624602332979851
______________________________________________

