## Preliminaries

In [1]:
# Load libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, cross_val_score


from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn import svm




## Load Preprocessed data and target

In [75]:
X_train = np.loadtxt('data_preproc_train.txt', delimiter=',')
y_train = np.loadtxt('target_train.txt',delimiter=',')
X_test = np.loadtxt('data_preproc_test.txt',delimiter=',')

### Bayes Theorem

Bayes theorem is a famous equation that allows us to make predictions based on data. Here is the classic version of the Bayes theorem:

P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
/
P
(
B
)
This might be too abstract, so let us replace some of the variables to make it more concrete. In a bayes classifier, we are interested in finding out the class (e.g. male or female, spam or ham) of an observation given the data:

p
(
class
∣
data
)
=
p
(
data
∣
class
)
∗
p
(
class
)
/
p
(
data
)

where:

class is a particular class (e.g. male)

data is an observation’s data

p(class∣data) is called the posterior

p(data|class)is called the likelihood

p(class) is called the prior

p(data) is called the marginal probability

### Gaussian Naive Bayes Classifier

A gaussian naive bayes is probably the most popular type of bayes classifier. To explain what the name means, let us look at what the bayes equations looks like when we apply our two classes of survival(yes and no) and six feature variables (Pclass, Sex, Age, SibSp,Parch, Embarked):

\begin{equation*}
posterior(yes) = \frac{P(yes)p(pclass∣yes)p(sex∣yes)p(age∣yes)p(SibSp∣yes)p(Parch∣yes)p(Embarked∣yes)}{marginal probability}
\end{equation*}

\begin{equation*}
posterior(no) = \frac{P(no)p(pclass∣no)p(sex∣no)p(age∣no)p(SibSp∣no)p(Parch∣no)p(Embarked∣no)}{marginal probability}
\end{equation*}

Now let us unpack the top equation a bit:

P(yes) is the prior probabilities. It is, as you can see, simply the probability an observation is yes. This is just the number of yes in the dataset divided by the total number of people in the dataset.

p(pclass∣no)p(sex∣no)p(age∣no)p(SibSp∣no)p(Parch∣no)p(Embarked∣no) is the likelihood. Notice that we have unpacked person’s data so it is now every feature in the dataset. The “gaussian” and “naive” come from two assumptions present in this likelihood:

	1.If you look each term in the likelihood you will notice that we assume each feature is uncorrelated from each other. That is, foot size is independent of weight or age etc.. This is obviously not true, and is a “naive” assumption - hence the name “naive bayes.”

	2.Second, we assume have that the value of the features (e.g. the age of a survial outcome, the sex of survival outcome) are normally (gaussian) distributed. This means that p(age∣no) is calculated by inputing the required parameters into the probability density function of the normal distribution:

p(age∣no)= (1 / √2π * variance of no survival age in the data) * e−(observation’s age − average age of no's in the data)^2 / 2 * variance of no age in the data

marginal probability is probably one of the most confusing parts of bayesian approaches. In toy examples (including ours) it is completely possible to calculate the marginal probability. However, in many real-world cases, it is either extremely difficult or impossible to find the value of the marginal probability. This is not as much of a problem for our classifier as you might think. Why? Because we don’t care what the true posterior value is, we only care which class has a the highest posterior value. And because the marginal probability is the same for all classes 

1) we can ignore the denominator

2) calculate only the posterior’s numerator for each class

3) pick the largest numerator. That is, we can ignore the posterior’s denominator and make a prediction solely on the relative values of the posterior’s numerator.

### Calculate Priors

Priors can be either constants or probability distributions. In our example, this is simply the probability of being a gender. Calculating this is simple:

In [77]:
np.shape(X_train)

(714, 6)

In [78]:
np.shape(X_test)

(332, 6)

In [76]:
np.shape(y_train)

(714,)

In [79]:
n_yes = np.count_nonzero(y_train)
print(n_yes)

290


In [80]:
np.size(y_train)

714

In [81]:
# Total rows
total_ppl = np.size(y_train)

# Number of yes == 1
n_yes = np.count_nonzero(y_train)
print(n_yes)

# Number of no == 0
n_no = total_ppl - n_yes
print(n_no)

# Number of survived divided by the total rows
P_yes = n_yes/total_ppl
print(P_yes)

# Number of not survived divided by the total rows
P_no = n_no/total_ppl
print(P_no)

290
424
0.4061624649859944
0.5938375350140056


### Calculate Likelihood

\begin{equation*}
p(age∣no) =  \frac{1}{2π * variance of no survival age in the data}* e− ^\frac{(observations age − average age of nos in the data)^2} {2* variance of no age in the data}
\end{equation*} is the likelihood. 

This means that for each class (e.g. no) and feature (e.g. age) combination we need to calculate the variance and mean value from the data. 

In [82]:
df_X_train = pd.DataFrame(X_train, columns=['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Embarked'])

In [83]:
df_X_train['Survial'] = pd.DataFrame(y_train)

In [84]:
df_X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked,Survial
0,0.911,0.759,-0.53,0.525,-0.506,0.519,0.0
1,-1.476,-1.317,0.572,0.525,-0.506,-2.053,1.0
2,0.911,-1.317,-0.255,-0.552,-0.506,0.519,1.0
3,-1.476,-1.317,0.365,0.525,-0.506,0.519,1.0
4,0.911,0.759,0.365,-0.552,-0.506,0.519,0.0


In [85]:
# Group the data by gender and calculate the means of each feature
data_means = df_X_train.groupby('Survial').mean()

# View the values
data_means

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Embarked
Survial,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,0.297255,0.445642,0.06387,0.014229,-0.077217,0.148972
1.0,-0.434917,-0.651248,-0.093352,-0.021024,0.112845,-0.217124


In [86]:
# Group the data by gender and calculate the variance of each feature
data_variance = df_X_train.groupby('Survial').var()

# View the values
data_variance

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Embarked
Survial,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,0.787827,0.553644,0.953078,1.264646,1.061262,0.738524
1.0,0.996964,0.942124,1.060719,0.619537,0.896301,1.310329


In [87]:
# Means for survived
yes_Pclass_mean = data_means['Pclass'][data_variance.index == 1.0].values[0]
yes_Sex_mean = data_means['Sex'][data_variance.index == 1.0].values[0]
yes_Age_mean = data_means['Age'][data_variance.index == 1.0].values[0]
yes_SibSp_mean = data_means['SibSp'][data_variance.index == 1.0].values[0]
yes_Parch_mean = data_means['Parch'][data_variance.index == 1.0].values[0]
yes_Embarked_mean = data_means['Embarked'][data_variance.index == 1.0].values[0]

# Variance for survived
yes_Pclass_variance= data_variance['Pclass'][data_variance.index == 1.0].values[0]
yes_Sex_variance= data_variance['Sex'][data_variance.index == 1.0].values[0]
yes_Age_variance= data_variance['Age'][data_variance.index == 1.0].values[0]
yes_SibSp_variance= data_variance['SibSp'][data_variance.index == 1.0].values[0]
yes_Parch_variance= data_variance['Parch'][data_variance.index == 1.0].values[0]
yes_Embarked_variance= data_variance['Embarked'][data_variance.index == 1.0].values[0]

# Means for not survived
no_Pclass_mean = data_means['Pclass'][data_variance.index == 0.0].values[0]
no_Sex_mean = data_means['Sex'][data_variance.index == 0.0].values[0]
no_Age_mean = data_means['Age'][data_variance.index == 0.0].values[0]
no_SibSp_mean = data_means['SibSp'][data_variance.index == 0.0].values[0]
no_Parch_mean = data_means['Parch'][data_variance.index == 0.0].values[0]
no_Embarked_mean = data_means['Embarked'][data_variance.index == 0.0].values[0]

# Variance for not survived
no_Pclass_variance= data_variance['Pclass'][data_variance.index == 0.0].values[0]
no_Sex_variance= data_variance['Sex'][data_variance.index == 0.0].values[0]
no_Age_variance= data_variance['Age'][data_variance.index == 0.0].values[0]
no_SibSp_variance= data_variance['SibSp'][data_variance.index == 0.0].values[0]
no_Parch_variance= data_variance['Parch'][data_variance.index == 0.0].values[0]
no_Embarked_variance= data_variance['Embarked'][data_variance.index == 0.0].values[0]

Finally, we need to create a function to calculate the probability density of each of the terms of the likelihood (e.g. p(age∣no)).

In [88]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

### Apply Bayes Classifier To New Data Point

In [89]:
df_X_test = pd.DataFrame(X_test, columns=['Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Embarked'])

In [92]:
print(df_X_test.shape)
df_X_test.head()

(332, 6)


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked
0,1.012,0.787,0.299,-0.552,-0.491,-0.511
1,1.012,-1.271,1.181,0.594,-0.491,0.651
2,-0.171,0.787,2.241,-0.552,-0.491,-0.511
3,1.012,0.787,-0.231,-0.552,-0.491,0.651
4,1.012,-1.271,-0.584,0.594,0.744,0.651


In [93]:
print(df_X_train.shape)
df_X_train.head()

(714, 7)


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked,Survial
0,0.911,0.759,-0.53,0.525,-0.506,0.519,0.0
1,-1.476,-1.317,0.572,0.525,-0.506,-2.053,1.0
2,0.911,-1.317,-0.255,-0.552,-0.506,0.519,1.0
3,-1.476,-1.317,0.365,0.525,-0.506,0.519,1.0
4,0.911,0.759,0.365,-0.552,-0.506,0.519,0.0


In [94]:
# Numerator of the posterior if the unclassified observation is a male
P_yes * \
p_x_given_y(df_X_test['Pclass'][0], yes_Pclass_mean, yes_Pclass_variance) * \
p_x_given_y(df_X_test['Sex'][0], yes_Sex_mean, yes_Sex_variance) * \
p_x_given_y(df_X_test['Age'][0], yes_Age_mean, yes_Age_variance) * \
p_x_given_y(df_X_test['SibSp'][0], yes_SibSp_mean, yes_SibSp_variance) * \
p_x_given_y(df_X_test['Parch'][0], yes_Parch_mean, yes_Parch_variance) * \
p_x_given_y(df_X_test['Embarked'][0], yes_Embarked_mean, yes_Embarked_variance) 

0.00013129843663910163

In [95]:
# Numerator of the posterior if the unclassified observation is a female
P_no * \
p_x_given_y(df_X_test['Pclass'][0], no_Pclass_mean, no_Pclass_variance) * \
p_x_given_y(df_X_test['Sex'][0], no_Sex_mean, no_Sex_variance) * \
p_x_given_y(df_X_test['Age'][0], no_Age_mean, no_Age_variance) * \
p_x_given_y(df_X_test['SibSp'][0], no_SibSp_mean, no_SibSp_variance) * \
p_x_given_y(df_X_test['Parch'][0], no_Parch_mean, no_Parch_variance) * \
p_x_given_y(df_X_test['Embarked'][0], no_Embarked_mean, no_Embarked_variance) 

0.0014269063962945774

In [132]:
def model_predict(df):
    prediction = []
    probability = []
    for index, row in df.iterrows():
        yes = P_yes * \
p_x_given_y(row[0], yes_Pclass_mean, yes_Pclass_variance) * \
p_x_given_y(row[1], yes_Sex_mean, yes_Sex_variance) * \
p_x_given_y(row[2], yes_Age_mean, yes_Age_variance) * \
p_x_given_y(row[3], yes_SibSp_mean, yes_SibSp_variance) * \
p_x_given_y(row[4], yes_Parch_mean, yes_Parch_variance) * \
p_x_given_y(row[5], yes_Embarked_mean, yes_Embarked_variance) 

        no = P_no * \
p_x_given_y(row[0], no_Pclass_mean, no_Pclass_variance) * \
p_x_given_y(row[1], no_Sex_mean, no_Sex_variance) * \
p_x_given_y(row[2], no_Age_mean, no_Age_variance) * \
p_x_given_y(row[3], no_SibSp_mean, no_SibSp_variance) * \
p_x_given_y(row[4], no_Parch_mean, no_Parch_variance) * \
p_x_given_y(row[5], no_Embarked_mean, no_Embarked_variance) 
        
        # Compile probabilities
        probability.append([no, yes])
        
        # Predict class based on probabilities
        if yes > no:
            prediction.append(1.0)
        else:
            prediction.append(0.0)
            
    return prediction, probability


In [133]:
prediction, probability = model_predict(df_X_test)

In [129]:
print(np.shape(prediction))
print(np.shape(probability))

(332,)
(332, 2)


In [134]:
print(prediction)

[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0,

In [135]:
print(probability)

[[0.0014269063962945774, 0.00013129843663910163], [6.665866291665353e-05, 0.00011513984774513449], [0.00014704070424402654, 2.9857728402909053e-05], [0.0015890760888979832, 0.00010847110343541907], [8.12090810147909e-05, 0.00021688339196688805], [0.0007687641089845151, 6.472284814518319e-05], [0.0001135874455402823, 0.0003442621725850155], [0.001462874620047816, 0.00026864321753798903], [1.0267991786240879e-05, 0.00011982525232961133], [0.00044408765292286494, 9.72068937287029e-06], [0.00022881943836993376, 0.00010324509106946658], [2.6331057392956757e-05, 0.0004256647547912549], [0.00014066140154122498, 1.831511216317755e-05], [1.6301005002926854e-05, 0.00021519720838544348], [1.690767014579252e-05, 0.0003829279447185302], [0.0017011362153974648, 0.0003574478775678998], [0.00015895249529037262, 5.600836096201061e-05], [0.00012256971092135965, 0.00024535875901830523], [9.813431456739301e-06, 8.67205109250858e-05], [1.1482073578428228e-05, 2.280763375968956e-05], [0.00036253592551701284

### Gaussian Naiive Bayes Classifier

In [136]:
from sklearn.naive_bayes import GaussianNB

### Train Gaussian Naive Bayes Classifier

In [137]:
# Create Gaussian Naive Bayes object with prior probabilities of each class
clf = GaussianNB(priors=[P_yes,P_no])

# Train model
model = clf.fit(X_train, y_train)

### Predict Class

In [144]:
# Predict class
model.predict(X_test)

array([0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0.,
       1., 1., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0.,
       0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 1., 0.,
       0., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1.,
       1., 0., 0., 1., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0., 0.,
       1., 0., 0., 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
       0., 1., 0., 1., 0., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 1.,
       0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0.,
       1., 0., 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 1.,
       0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0.,
       0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 1.,
       0., 1., 1., 0., 1.

### View The Model’s Score
How good is our trained model compared to our training data?

In [138]:
print("Our model is %.2f%% accurate!" % (model.score(X_train, y_train)*100))

Our model is 73.81% accurate!


### Create Pipeline

In [139]:
# Create standardizer
standardizer = StandardScaler()

# Create SVM classifier
naiive = clf

# Create a pipeline that standardizes, then runs logistic regression
pipeline = make_pipeline(standardizer, naiive)

### Create k-Fold Cross-Validation

In [140]:
# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

### Conduct k-Fold Cross-Validation

In [141]:
# Do k-fold cross-validation
cv_results = cross_val_score(pipeline, # Pipeline
                             X_train, # Feature matrix
                             y_train, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores

In [142]:
cv_results

array([0.72222222, 0.77777778, 0.73611111, 0.69444444, 0.69014085,
       0.78873239, 0.78873239, 0.61971831, 0.76056338, 0.76056338])

### Calculate Mean Performance Score

In [143]:
# Calculate mean
cv_results.mean()

0.7339006259780907