# Transparent Naive Bayes

## Overview

This project is meant to serve as an educational tool for me and anyone else who wants to understand how Naive Bayes classifiers are implemented in code. My goal is to clearly document each step of the algorithm, including the underlying math as much as possible. Furthermore, I will try to keep the code as simple and easy-to-understand as possible, sacrificing performance and robustness if necessary.

## Bernoulli Naive Bayes

In [5]:
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(42)

X = np.random.binomial(1, 0.5, size=(1000, 4))
y = []

noise_factor = 0.1

for x in X:
    activation = 1 * x[0] + 2 * x[1] + 2 * x[2] + 3 * x[3]
    output = 1 if activation >= 5 else 0
    
    # Randomly flip output according to noise_factor
    if np.random.random() < noise_factor:
        output = 1 - output
    
    y.append(output)
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
from sklearn.naive_bayes import BernoulliNB as SKLearnBernoulliNB
from sklearn.metrics import accuracy_score
from bernoulli_nb import BernoulliNB

def evaluate_model(model_class, X_train, X_test, y_train, y_test, params={}, name=None):
    model = model_class(**params)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    
    if name is None:
        name = model.__class__.__name__
    print("Accuracy score of {}: {:.2f}%".format(name, accuracy_score(y_test, pred) * 100))

evaluate_model(SKLearnBernoulliNB, X_train, X_test, y_train, y_test, {'alpha': 0, 'binarize': None}, 'Benchmark')
evaluate_model(BernoulliNB, X_train, X_test, y_train, y_test)

Accuracy score of Benchmark: 92.50%
Accuracy score of BernoulliNB: 92.50%


  'setting alpha = %.1e' % _ALPHA_MIN)
  f_count = len(X.loc[y_true][X[f] == 1])
  f_count = len(X.loc[y_false][X[f] == 1])


### Dataset

The dataset I will be using is the [Credit Approval Dataset](http://archive.ics.uci.edu/ml/datasets/Credit+Approval) from the UCI Machine Learning Repository. It has a good mix of continuous and categorical attributes, and a binary label.

### Data Exploration

Now I will load the data into a pandas dataframe and view its statistics.

In [None]:
import pandas as pd
import numpy as np

names = []
for i in range(1, 16):
    names.append("A" + str(i))
names.append("approve?")

dtype = {'A1': str,
         'A2': np.float32,
         'A3': np.float32,
         'A4': str,
         'A5': str,
         'A6': str,
         'A7': str,
         'A8': np.float32,
         'A9': str,
         'A10': str,
         'A11': np.float32,
         'A12': str,
         'A13': str,
         'A14': np.float32,
         'A15': np.float32,
         'approve?': str}

data = pd.read_csv("./data.csv", header=None, names=names, dtype=dtype, na_values=['?'])

print(data.head())
print(data.describe())

### Data Preprocessing

To preprocess the data, we'll first drop the rows with NaN values, then remove the labels from the dataset. We then one-hot encode the categorical columns, and normalize the entire dataset using min-max scaling.

In [None]:
data.dropna(axis=0, inplace=True)

y = data['approve?']
X = data.drop('approve?', axis=1)

X = pd.get_dummies(X)

X = (X - X.min()) / (X.max() - X.min())

X.head()

### Splitting the data

Now we split the data into training and testing sets. We use a random state for reproducible results. Because we're only comparing our model against the benchmark, we don't need to go to the extent of implementing K-Fold cross validation.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

## Benchmark Model

We will use scikit-learn's GaussianNB as the benchmark to test our from-scratch model against.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

model = GaussianNB()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, pred) * 100))