<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Naive Bayesian Classifiers</h1></center>

In this notebook, we'll be focusing on **_Naive Bayesian Classifiers_**, or "Naive Bayes" for short.  This is an algorithm that uses **_Bayes' Theorem_** to make a classification based on probability.  In case you're unfamiliar with Bayes' Theorem, let's look at the formula:
<br>
<br>
<center><img src='img/bayes_theorem.png' height=40% width=40%></center>
<br>
<br>
Don't worry if you've seen this mathematical notation before. In plain English, that formula reads:

"The probability of A given B equals the probability of B given A, times the probability of A, divided by the probability of B".  

Let's run through an example case here and see if we can demystify this equation a little bit more. 

<center><h3>Scenario: Spam Detection</h3></center>

We have a dataset of emails, and we're trying to build a classifier that can predict if an email is spam or not by examining the words based in the emails.  Each email in our training set has been labeled as "spam" or "ham" (a real email, not spam).  We've counted each word used in every email, and found the following:

**_65% of the emails in the dataset are "Spam"._**

**_"Spam"_** emails contain the word _"deal"_ 80% of the time, and _"win"_ 40% of the time.  

**_35% of the emails in the datasert are "Ham"._**

**_"Ham"_** emails contain the word _"deal"_ 17% of the time, and _"win"_ 6% of the time.  

The next email we try to predict contains the both words "deal" and "win". Given the information above, we can plug these numbers into Bayes' Theorem and predict the likelihood that this is email Spam. 

<center>P(Spam|deal, win) = (P(win, deal|Spam) * P(Spam)) / P(deal, win)</center>

This can be further broken down into: 
<br>
<br>
<center>P(Spam|deal) \* P(Spam|win) = P(deal|Spam) \* P(Spam) \*  P(win|Spam) \* P(Spam) / P(deal|Spam) + P(deal|!Spam) \* P(win|Spam) + P(win|!Spam)</center>

In the equation above, "P(deal|!Spam)" can be read as "the percentage that 'deal' occurs in 'Ham' emails".  

On the next step, we'll start defining the probabilities for everything in that equation so we can plug them in:

1. P(deal|Spam) = .8
1. P(win|Spam) = .4
1. P(Spam) = .65
1. P(deal|!Spam) = .17
1. P(win|!Spam) = .06
1. P(!Spam) = .35

Let's replace some of these terms with the probabilities listed above and see how it works out:
<br>
<br>
<center>(.8 \* .65 \* .4 \* .65) / .8 \* .65 + .35 \* .17 \* .4 \* .65 + .35 \* .6 = **0.922595** </center>
<br>
<br>
Based on the math from Bayes' Theorem, we can predict probability that a new email containing both "deal" and "win" is "Spam" is approximately **92.2%**!

<center><h3>Using Naive Bayes in the Real World</h3></center>

In the above example, we did the math by hand.  That isn't very practical in the real world.  Luckily, `sklearn` contains some awesome implementations of Naive Bayesian Classifiers (and regressors!).  

For this assignment, we're going to use a `GaussianNB()` object.  There are a few different kinds of Naive Bayesian Classifiers, but for this one we'll stick to one that assumes our data follows a Gaussian (normal) distribution.  

Let's Get Started!

In [103]:
import numpy as np
np.random.seed(0)
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score


# Use load_iris() to load the data into the iris variable, and then assign iris.data and iris.target to the appropriate
# variables
iris = load_iris()
data = iris.data
labels = iris.target

# Use train_test_split to split the data into X_train, X_test, y_train, and y_test variables
X_train, X_test, y_train, y_test  = train_test_split(data, labels)

# Create a GaussianNB() object and fit it using the training data
clf = GaussianNB()
clf.fit(X_train, y_train)

# Use the fitted model to create predictions for the X_test data.
preds = clf.predict(X_test)


# Run it all and see how you did!
print(preds)
print(y_test)
print("accuracy score: {}".format(accuracy_score(y_test, preds)))
print("f1 score: {}".format(f1_score(y_test, preds, average='weighted')))

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]
accuracy score: 1.0
f1 score: 1.0


<center><h3>Some Caveats</h3></center>

You may have wondered why this particular model is a called a **_Naive_** Bayesian Classifier.  In this scenario, the word "Naive" simply means that the model makes the "naive" assumption that all features are independent of one another.  This leads us to the main caveat of this model--if you have feature columns that are highly correlated, this model may not work as well as we'd like.  **_If you're going to use Naive Bayes, make sure you check for highly correlated features beforehand!_**


<center><h3>Where to Go From Here</h3></center>

For the latter part of this assignment, you're going to use the famous Pima Indians Diabetes Dataset to build a Naive Bayesian Classifier that predicts whether or not an individual has diabetes.  You'll find the `pima_indians_diabetes.csv` file inside the `datasets` folder.  

To build this classifier successfully, you'll want to follow the best practices for loading in and preprocessing a data set that you've learned in class:

1. Importing the data
1. Exploring the data
1. "Cleaning" the data
1. Splitting the data into training and testing sets (or using KFold Cross val--more on this below)
1. Fitting the model
1. Validating the model (checking predictions on the test set)

Be sure to consider the following questions as you solve this problem:

* How will you deal with null values?
* For this model, does scaling the data improve your results? (HINT: test your assumption!)

On top of cleaning and preprocessing this data set, you'll also use **_Cross Validation_** to get a better measure of the accuracy of your model.  We did not use K Fold Cross Validation in the above model on purpose--instead, you'll need to work your way through `sklearn`'s [model_selection documentation](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) to figure out how to effectively make use of cross-validation.  

(**_Hint:_** There are several ways to implement cross validation using sklearn.  In the `model_selection` section of the documentation, pay special attention to the `KFold` object, as well as the methods available under the _Model Selection_ subsection.)


Good luck!

In [104]:
# path to file: "datasets/pima_indians_diabetes.csv". The first row of the .csv contains the column names.
# Note that in the "Outcome" column, 0 denotes someone that does NOT have diabetes, and 1 denotes someone that does.  
import pandas as pd
path = "datasets/pima_indians_diabetes.csv"
df = pd.read_csv(path)
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [105]:
no_null_df = df[(df["Outcome"].notnull()
                 & df["Pregnancies"].notnull()
                 & df["Glucose"].notnull()
                 & df["BloodPressure"].notnull()
                 & df["SkinThickness"].notnull()
                 & df["Insulin"].notnull()
                 & df["BMI"].notnull()
                 & df["DiabetesPedigreeFunction"].notnull()
                 & df["Age"].notnull())]
no_null_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [153]:
no_zero_df = no_null_df[(no_null_df["Glucose"] > 0)
                 & (no_null_df["BloodPressure"] > 0)
                 & (no_null_df["SkinThickness"] > 0)
                 & (no_null_df["Insulin"] > 0)
                 & (no_null_df["BMI"] > 0)]
no_zero_df_reset = no_zero_df.reset_index(drop=True)
no_zero_df_reset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,3.30102,122.627551,70.663265,29.145408,156.056122,33.086224,0.523046,30.864796,0.331633
std,3.211424,30.860781,12.496092,10.516424,118.84169,7.027659,0.345488,10.200777,0.471401
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0,0.0
25%,1.0,99.0,62.0,21.0,76.75,28.4,0.26975,23.0,0.0
50%,2.0,119.0,70.0,29.0,125.5,33.2,0.4495,27.0,0.0
75%,5.0,143.0,78.0,37.0,190.0,37.1,0.687,36.0,1.0
max,17.0,198.0,110.0,63.0,846.0,67.1,2.42,81.0,1.0


In [154]:
# save outcome data as dataset
labels = no_zero_df_reset['Outcome']

In [155]:
# remove outcome from dataframe
cleaned_df = no_zero_df_reset.drop("Outcome", 1)
cleaned_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,3.30102,122.627551,70.663265,29.145408,156.056122,33.086224,0.523046,30.864796
std,3.211424,30.860781,12.496092,10.516424,118.84169,7.027659,0.345488,10.200777
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0
25%,1.0,99.0,62.0,21.0,76.75,28.4,0.26975,23.0
50%,2.0,119.0,70.0,29.0,125.5,33.2,0.4495,27.0
75%,5.0,143.0,78.0,37.0,190.0,37.1,0.687,36.0
max,17.0,198.0,110.0,63.0,846.0,67.1,2.42,81.0


In [140]:
from sklearn.model_selection import KFold

def predict_with_K_Fold_Cross_Validation(n_splits, dataframe, labels):
    kf = KFold(n_splits = n_splits)
    accuracies = []
    fl_scores = []
    for train_index, test_index in kf.split(dataframe):
        X_train, X_test = dataframe.iloc[train_index], dataframe.iloc[test_index]
        y_train, y_test = labels[train_index], labels[test_index]
#         print("TRAIN:", train_index, "TEST:", test_index)
        # Create a GaussianNB() object and fit it using the training data
        clf = GaussianNB()
        clf.fit(X_train, y_train)

        # Use the fitted model to create predictions for the X_test data.
        preds = clf.predict(X_test)

        # Run it all and see how you did!
        print("accuracy score: {}".format(accuracy_score(y_test, preds)))
        accuracies.append(accuracy_score(y_test, preds))
        print("f1 score: {}".format(f1_score(y_test, preds, average='weighted')))
        fl_scores.append(f1_score(y_test, preds, average='weighted'))
    print("average accuracy scores: {}".format(np.average(accuracies)))
    print("average f1 scores: {}".format(np.average(fl_scores)))

## Prediction with all datasets
### Number of fold (n_splits) = 5

In [156]:
predict_with_K_Fold_Cross_Validation(5, cleaned_df, labels)

accuracy score: 0.8354430379746836
f1 score: 0.8347458857966417
accuracy score: 0.6708860759493671
f1 score: 0.6676036897387557
accuracy score: 0.8076923076923077
f1 score: 0.8064218064218063
accuracy score: 0.782051282051282
f1 score: 0.7892507892507893
accuracy score: 0.7948717948717948
f1 score: 0.796844181459566
average accuracy scores: 0.778188899707887
average f1 scores: 0.7789732705335118


### Number of fold (n_splits) = 10

In [157]:
predict_with_K_Fold_Cross_Validation(10, cleaned_df, labels)

accuracy score: 0.825
f1 score: 0.8257898130238557
accuracy score: 0.825
f1 score: 0.816949152542373
accuracy score: 0.6410256410256411
f1 score: 0.6367521367521367
accuracy score: 0.6666666666666666
f1 score: 0.6695715323166304
accuracy score: 0.8205128205128205
f1 score: 0.8232118758434547
accuracy score: 0.7948717948717948
f1 score: 0.7901234567901235
accuracy score: 0.7435897435897436
f1 score: 0.7535612535612537
accuracy score: 0.8461538461538461
f1 score: 0.8491124260355029
accuracy score: 0.7692307692307693
f1 score: 0.7765457184325109
accuracy score: 0.8205128205128205
f1 score: 0.8189486620859169
average accuracy scores: 0.7752564102564102
average f1 scores: 0.7760566027383757


# Make assumptions

## Patients

In [158]:
patients_df = no_zero_df_reset[no_zero_df_reset['Outcome'] == 1]
patients_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0
mean,4.469231,145.192308,74.076923,32.961538,206.846154,35.777692,0.625585,35.938462,1.0
std,3.916153,29.839388,13.021518,9.64277,132.699898,6.734687,0.40591,10.634705,0.0
min,0.0,78.0,30.0,7.0,14.0,22.9,0.127,21.0,1.0
25%,1.0,124.25,66.5,26.0,127.5,31.6,0.32975,27.25,1.0
50%,3.0,144.5,74.0,33.0,169.5,34.6,0.546,33.0,1.0
75%,7.0,171.75,82.0,39.75,239.25,38.35,0.7865,43.0,1.0
max,17.0,198.0,110.0,63.0,846.0,67.1,2.42,60.0,1.0


## Not Patients

In [159]:
patients_df = no_zero_df_reset[no_zero_df_reset['Outcome'] == 0]
patients_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,262.0,262.0,262.0,262.0,262.0,262.0,262.0,262.0,262.0
mean,2.721374,111.431298,68.969466,27.251908,130.854962,31.750763,0.472168,28.347328,0.0
std,2.617844,24.642133,11.892841,10.434135,102.626177,6.794971,0.29924,8.989008,0.0
min,0.0,56.0,24.0,7.0,15.0,18.2,0.085,21.0,0.0
25%,1.0,94.0,60.0,18.25,66.0,26.125,0.261,22.0,0.0
50%,2.0,107.5,70.0,27.0,105.0,31.25,0.4135,25.0,0.0
75%,4.0,126.0,76.0,34.0,163.75,36.1,0.62425,30.0,0.0
max,13.0,197.0,106.0,60.0,744.0,57.3,2.329,81.0,0.0


# 1. Prediction without Pregnancies, BloodPressure, SkinThickness and Age

In [160]:
new_df = no_null_df[['Glucose', 'Insulin', 'DiabetesPedigreeFunction', 'Outcome']]
new_df.describe()

Unnamed: 0,Glucose,Insulin,DiabetesPedigreeFunction,Outcome
count,768.0,768.0,768.0,768.0
mean,120.894531,79.799479,0.471876,0.348958
std,31.972618,115.244002,0.331329,0.476951
min,0.0,0.0,0.078,0.0
25%,99.0,0.0,0.24375,0.0
50%,117.0,30.5,0.3725,0.0
75%,140.25,127.25,0.62625,1.0
max,199.0,846.0,2.42,1.0


In [161]:
no_zero_new_df = new_df[(new_df['Glucose'] > 0) & (new_df['Insulin'] > 0)]
no_zero_new_df = no_zero_new_df.reset_index()
no_zero_new_df.describe()

Unnamed: 0,index,Glucose,Insulin,DiabetesPedigreeFunction,Outcome
count,393.0,393.0,393.0,393.0,393.0
mean,387.048346,122.615776,155.885496,0.52612,0.330789
std,215.742276,30.822276,118.738199,0.350386,0.471097
min,3.0,56.0,14.0,0.085,0.0
25%,204.0,99.0,77.0,0.27,0.0
50%,384.0,119.0,125.0,0.452,0.0
75%,567.0,143.0,190.0,0.687,1.0
max,765.0,198.0,846.0,2.42,1.0


In [162]:
# new labels
new_labels = no_zero_new_df['Outcome']

In [163]:
# remove Outcome form no_zero_new_df
new_clean_df = no_zero_new_df.drop('Outcome', 1)

### Number of fold (n_splits) = 5

In [164]:
predict_with_K_Fold_Cross_Validation(5, new_clean_df, new_labels)

accuracy score: 0.7088607594936709
f1 score: 0.7118514039775572
accuracy score: 0.6962025316455697
f1 score: 0.6815232422082759
accuracy score: 0.7848101265822784
f1 score: 0.780110327526846
accuracy score: 0.7948717948717948
f1 score: 0.7912120588302569
accuracy score: 0.7948717948717948
f1 score: 0.7825332562174667
average accuracy scores: 0.7559234014930217
average f1 scores: 0.7494460577520805


### Number of fold (n_splits) = 10

In [165]:
predict_with_K_Fold_Cross_Validation(10, new_clean_df, new_labels)

accuracy score: 0.775
f1 score: 0.7647798742138365
accuracy score: 0.8
f1 score: 0.8
accuracy score: 0.7
f1 score: 0.6907928388746803
accuracy score: 0.6923076923076923
f1 score: 0.6753246753246752
accuracy score: 0.7692307692307693
f1 score: 0.765113566184039
accuracy score: 0.7948717948717948
f1 score: 0.7901234567901235
accuracy score: 0.7948717948717948
f1 score: 0.7948717948717948
accuracy score: 0.7948717948717948
f1 score: 0.7892107892107892
accuracy score: 0.8205128205128205
f1 score: 0.8109060134037832
accuracy score: 0.7692307692307693
f1 score: 0.7546366676801458
average accuracy scores: 0.7710897435897436
average f1 scores: 0.7635759676553867


# 2. Prediction with DiabetesPedigreeFunction and Age
- More samples, less datasets

In [166]:
D_A_Df = no_null_df[['DiabetesPedigreeFunction', 'Age', 'Outcome']]
D_A_Df.describe()

Unnamed: 0,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0
mean,0.471876,33.240885,0.348958
std,0.331329,11.760232,0.476951
min,0.078,21.0,0.0
25%,0.24375,24.0,0.0
50%,0.3725,29.0,0.0
75%,0.62625,41.0,1.0
max,2.42,81.0,1.0


In [167]:
# outcomes
D_A_labels = D_A_Df['Outcome']

#remove outcomes from D_A_Df
cleaned_D_A_Df = D_A_Df.drop('Outcome', 1)

In [168]:
predict_with_K_Fold_Cross_Validation(5, cleaned_D_A_Df, D_A_labels)

accuracy score: 0.6688311688311688
f1 score: 0.6022600284925189
accuracy score: 0.6103896103896104
f1 score: 0.5521072265258312
accuracy score: 0.5974025974025974
f1 score: 0.5572525107408828
accuracy score: 0.7450980392156863
f1 score: 0.7363288617814522
accuracy score: 0.6601307189542484
f1 score: 0.5952428428998717
average accuracy scores: 0.6563704269586622
average f1 scores: 0.6086382940881114


In [169]:
predict_with_K_Fold_Cross_Validation(10, cleaned_D_A_Df, D_A_labels)

accuracy score: 0.5714285714285714
f1 score: 0.5073179491784143
accuracy score: 0.7662337662337663
f1 score: 0.7142857142857143
accuracy score: 0.6233766233766234
f1 score: 0.5563056260730679
accuracy score: 0.6233766233766234
f1 score: 0.5777683452102057
accuracy score: 0.6623376623376623
f1 score: 0.6145379121785656
accuracy score: 0.5844155844155844
f1 score: 0.528756957328386
accuracy score: 0.7012987012987013
f1 score: 0.7121092911808884
accuracy score: 0.7792207792207793
f1 score: 0.7458256029684602
accuracy score: 0.6578947368421053
f1 score: 0.6077556803274388
accuracy score: 0.6447368421052632
f1 score: 0.5698435277382646
average accuracy scores: 0.661431989063568
average f1 scores: 0.6134506606469405


# Summary

- More datasets, higher accuracy?
- 