# Transparent Gaussian Naive Bayes

## Overview

This project is meant to serve as an educational tool for me and anyone else who wants to understand how Naive Bayes classifiers are implemented in code. My goal is to clearly document each step of the algorithm, including the underlying math as much as possible. Furthermore, I will try to keep the code as simple and easy-to-understand as possible, sacrificing performance and robustness if necessary.

### Dataset

The dataset I will be using is the [Credit Approval Dataset](http://archive.ics.uci.edu/ml/datasets/Credit+Approval) from the UCI Machine Learning Repository. It has a good mix of continuous and categorical attributes, and a binary label.

### Data Exploration

Now I will load the data into a pandas dataframe and view its statistics.

In [29]:
import pandas as pd
import numpy as np

names = []
for i in range(1, 16):
    names.append("A" + str(i))
names.append("approve?")

dtype = {'A1': str,
         'A2': np.float32,
         'A3': np.float32,
         'A4': str,
         'A5': str,
         'A6': str,
         'A7': str,
         'A8': np.float32,
         'A9': str,
         'A10': str,
         'A11': np.float32,
         'A12': str,
         'A13': str,
         'A14': np.float32,
         'A15': np.float32,
         'approve?': str}

data = pd.read_csv("./data.csv", header=None, names=names, dtype=dtype, na_values=['?'])

print(data.head())
print(data.describe())

  A1         A2     A3 A4 A5 A6 A7    A8 A9 A10  A11 A12 A13    A14    A15  \
0  b  30.830000  0.000  u  g  w  v  1.25  t   t  1.0   f   g  202.0    0.0   
1  a  58.669998  4.460  u  g  q  h  3.04  t   t  6.0   f   g   43.0  560.0   
2  a  24.500000  0.500  u  g  q  h  1.50  t   f  0.0   f   g  280.0  824.0   
3  b  27.830000  1.540  u  g  w  v  3.75  t   t  5.0   t   g  100.0    3.0   
4  b  20.170000  5.625  u  g  w  v  1.71  t   f  0.0   f   s  120.0    0.0   

  approve?  
0        +  
1        +  
2        +  
3        +  
4        +  
               A2          A3          A8         A11          A14  \
count  678.000000  690.000000  690.000000  690.000000   677.000000   
mean    31.568169    4.758725    2.223407    2.400000   184.014771   
std     11.957860    4.978165    3.346511    4.862927   173.806808   
min     13.750000    0.000000    0.000000    0.000000     0.000000   
25%     22.602500    1.000000    0.165000    0.000000    75.000000   
50%     28.460000    2.750000    

### Data Preprocessing

To preprocess the data, we'll first drop the rows with NaN values, then remove the labels from the dataset. We then one-hot encode the categorical columns, and normalize the entire dataset using min-max scaling.

In [30]:
data.dropna(axis=0, inplace=True)

y = data['approve?']
X = data.drop('approve?', axis=1)

X = pd.get_dummies(X)

X = (X - X.min()) / (X.max() - X.min())

X.head()

Unnamed: 0,A2,A3,A8,A11,A14,A15,A1_a,A1_b,A4_l,A4_u,...,A7_z,A9_f,A9_t,A10_f,A10_t,A12_f,A12_t,A13_g,A13_p,A13_s
0,0.271111,0.0,0.04386,0.014925,0.101,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
1,0.713016,0.159286,0.106667,0.089552,0.0215,0.0056,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,0.170635,0.017857,0.052632,0.0,0.14,0.00824,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
3,0.223492,0.055,0.131579,0.074627,0.05,3e-05,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
4,0.101905,0.200893,0.06,0.0,0.06,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


### Splitting the data

Now we split the data into training and testing sets. We use a random state for reproducible results. Because we're only comparing our model against the benchmark, we don't need to go to the extent of implementing K-Fold cross validation.

In [31]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

## Benchmark Model

We will use scikit-learn's GaussianNB as the benchmark to test our from-scratch model against.

In [34]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

model = GaussianNB()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, pred) * 100))

Accuracy: 79.39%
