# Naive Bayes (Suman Nokhwal)

Naive Bayes (NB) is a very simple algorithm based around conditional probability and counting. 
Essentially, your model is actually a probability table that gets updated through your training data. 
To predict a new observation, you'd simply "look up" the class probabilities in your "probability table" based on its feature values.
It's called "naive" because its core assumption of conditional independence (i.e. all input features are independent from one another)
rarely holds true in the real world.

Strengths: Even though the conditional independence assumption rarely holds true, NB models actually perform surprisingly well in practice, especially for how simple they are. 
           They are easy to implement and can scale with your dataset.

Weaknesses: Due to their sheer simplicity, NB models are often beaten by models properly trained and tuned using the previous algorithms listed.

Reference: 
http://scikit-learn.org/stable/modules/naive_bayes.html
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

In [1]:
# Pretty display for notebooks
%matplotlib inline

import os
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
from time import time
#import cPickle
from IPython.display import display 
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
# Import supplementary visualization code visuals.py
#import visuals as vs
import xlsxwriter
from sklearn.metrics import precision_recall_fscore_support as score
import _pickle as cPickle

# Implement Naive Bayes
test


In [2]:
df_onehot = pd.read_csv('data/ny_hmda_2015_minmax.csv', low_memory=False, header=0, delimiter=",")


In [3]:
num_rows = df_onehot.shape[0]
num_col = df_onehot.shape[1] 
dataset = df_onehot.values
X = dataset[:, 0:num_col]
Y = dataset[:,0]

In [4]:
x_minmax = np.array(df_onehot.drop(['action_taken'], 1))
y_minmax = np.array(df_onehot['action_taken'])

X_train, X_test, Y_train, Y_test = train_test_split(x_minmax, y_minmax, test_size=0.33, random_state=42)

model = GaussianNB().fit(X_train, Y_train)

accuracy = model.score(X_test,Y_test)
print(accuracy)

precision, recall, fscore, support = score(Y_test, model.predict(X_test),average="macro")
print(precision)
print(recall)

0.752138815784
0.782811118364
0.816534013458


In [5]:
df = pd.read_csv('data/ny_hmda_2015_normalize.csv', low_memory=False, header=0, delimiter=",")
x_normalize = np.array(df.drop(['action_taken'], 1))
y_normalize = np.array(df['action_taken'])

X_train, X_test, Y_train, Y_test = train_test_split(x_normalize, y_normalize, test_size=0.3, random_state=42)

model =  GaussianNB().fit(X_train, Y_train)
acc=model.score(X_test, Y_test)
print(acc)

precision, recall, fscore, support = score(Y_test, model.predict(X_test),average="macro")
print(precision)
print(recall)

0.752414678412
0.783178182425
0.816595710126


In [6]:
df = pd.read_csv('data/ny_hmda_2015_robust.csv', low_memory=False, header=0, delimiter=",")
x_robust = np.array(df.drop(['action_taken'], 1))
y_robust = np.array(df['action_taken'])

X_train, X_test, Y_train, Y_test = train_test_split(x_robust, y_robust, test_size=0.3, random_state=42)

model =  GaussianNB().fit(X_train, Y_train)
acc=model.score(X_test, Y_test)
print(acc)

precision, recall, fscore, support = score(Y_test, model.predict(X_test),average="macro")
print(precision)
print(recall)

0.752414678412
0.783178182425
0.816595710126


In [7]:
cPickle.dump(model,open('models/gaussian_nb_model.p', 'wb'))