# Testcases for Naive Bayes

Testcases to check if our *mixed naive bayes* algorithm works as expected


## Overview

* Setup
* Amazon Dataset (Categorical, High Dimensional, Multiclass)
* Custom Dataset (Numeric & Categorical, Small)
* Custom Dataset (Categorical, Small)
* Wine Dataset (Numeric, Multiclass)
* Cross Validation with Pipeline


## Setup

In [1]:
import pandas as pd
import numpy as np
from naive_bayes import MixedNB
from sklearn.model_selection import GridSearchCV, KFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_wine

In [2]:
SEED = 42

## Amazon Dataset (Categorical, High Dimensional, Multiclass)

In [3]:
df_amazon = pd.read_csv("../data/amazon_review.csv")
X_amazon = df_amazon.drop(columns=["ID","Class"])
y_amazon = df_amazon["Class"]
X_amazon_sqrt = np.sqrt(X_amazon)

**Sklearn**

In [4]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB(var_smoothing=0.01)
scores = cross_validate(model,X_amazon_sqrt,y_amazon)
print(f"avg. accuracy:   {scores['test_score'].mean(): 2.4f}")
print(f"avg. fit time:   {scores['fit_time'].mean(): 2.4f} s")
print(f"avg. score time: {scores['score_time'].mean(): 2.4f} s")

avg. accuracy:    0.6440
avg. fit time:    0.1009 s
avg. score time:  0.3526 s


**Our Approach**

In [5]:
mask = [False] * X_amazon.shape[1]
model = MixedNB(categorical_feature_mask=mask,laplace_smoothing=1,var_smoothing=0.01)
scores = cross_validate(model,X_amazon_sqrt,y_amazon)
print(f"avg. accuracy:   {scores['test_score'].mean(): 2.4f}")
print(f"avg. fit time:   {scores['fit_time'].mean(): 2.4f} s")
print(f"avg. score time: {scores['score_time'].mean() : 2.4f} s")

avg. accuracy:    0.6440
avg. fit time:    0.4303 s
avg. score time:  2.8464 s


Interesting that sklearn has the same accuracy as our implementation, but is substantially faster

## Custom Dataset (Numeric & Categorical, Small)

In [6]:
df_cat_num = pd.read_csv("../data/naive-bayes_example_cat+num.csv",sep=";")
df_cat_num.head()

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play
0,Sunny,35,62,False,No
1,Sunny,32,73,True,No
2,Overcast,30,68,False,Yes
3,Rainy,20,65,False,Yes
4,Rainy,12,35,False,Yes


In [7]:
X = df_cat_num.drop(columns=["Play"])
y = df_cat_num["Play"]

In [8]:
model = MixedNB(categorical_feature_mask=[True,False,False,True],laplace_smoothing=1,var_smoothing=1e-09)
model.fit(X,y)

In [9]:
X_test = pd.DataFrame([
    ["Overcast",15,50, False],
    ["Rainy", 19, 58, True]
])
y_pred = model.predict(X_test)
y_pred

0    Yes
1     No
dtype: object

## Custom Dataset (Categorical, Small)

In [10]:
df_cat = pd.read_csv("../data/naive-bayes_example_cat.csv",sep=";")
df_cat.head()

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play
0,Sunny,Hot,High,False,No
1,Sunny,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Rainy,Mild,High,False,Yes
4,Rainy,Cool,Normal,False,Yes


In [11]:
X = df_cat.drop(columns=["Play"])
y = df_cat["Play"]

In [12]:
model = MixedNB(categorical_feature_mask=[True,True,True,True],laplace_smoothing=1,var_smoothing=1)
model.fit(X,y)

In [13]:
X_test = pd.DataFrame([
    ["Overcast","Mild","Normal",False],
    ["Rainy", "Hot", "High", False]
])
y_pred = model.predict(X_test)
y_pred

0    Yes
1    Yes
dtype: object

## Wine Dataset (Numeric, Multiclass)

In [14]:
data_wine = load_wine()
X = data_wine.data
y = data_wine.target

In [15]:
mask = [False] * X.shape[1]
model = MixedNB(categorical_feature_mask=mask,laplace_smoothing=1,var_smoothing=1)
scores = cross_validate(model, X, y)
print(f"avg. accuracy:   {scores['test_score'].mean(): 2.4f}")
print(f"avg. fit time:   {scores['fit_time'].mean(): 2.4f} s")
print(f"avg. score time: {scores['score_time'].mean(): 2.4f} s")

avg. accuracy:    0.6463
avg. fit time:    0.0043 s
avg. score time:  0.0164 s


## Cross Validation with Pipeline

In [16]:
df_cat_num = pd.read_csv("../data/naive-bayes_example_cat+num.csv",sep=";")
df_cat_num.head()

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play
0,Sunny,35,62,False,No
1,Sunny,32,73,True,No
2,Overcast,30,68,False,Yes
3,Rainy,20,65,False,Yes
4,Rainy,12,35,False,Yes


In [17]:
X = df_cat_num.drop(columns=["Play"])
y = df_cat_num["Play"]

In [18]:
pipe = Pipeline([
    ('m', MixedNB(categorical_feature_mask=[True,False,False,True],var_smoothing=1))
])
parameters = {
    "m__laplace_smoothing": [0.5, 1, 1.5],
    "m__var_smoothing": [10e-9, 10e-6, 10e-3]
}
kf = KFold(n_splits=3)
grid = GridSearchCV(pipe, parameters,cv=kf)
grid.fit(X, y)

In [19]:
grid.best_score_, grid.best_params_

(0.7166666666666667, {'m__laplace_smoothing': 0.5, 'm__var_smoothing': 1e-08})

In [20]:
grid.cv_results_

{'mean_fit_time': array([0.01324368, 0.01218168, 0.01308354, 0.01118882, 0.01019375,
        0.01051799, 0.01035921, 0.01051776, 0.01453479]),
 'std_fit_time': array([1.77418177e-03, 1.69584060e-03, 2.94762088e-03, 4.74816744e-04,
        4.78720535e-04, 2.97360213e-07, 2.28551813e-04, 7.37000982e-07,
        6.40081922e-03]),
 'mean_score_time': array([0.00400853, 0.00467881, 0.00350269, 0.00350285, 0.0035096 ,
        0.00417336, 0.00350269, 0.00334231, 0.00450269]),
 'std_score_time': array([0.00121864, 0.00062723, 0.00040414, 0.00040853, 0.00040388,
        0.00024473, 0.00040862, 0.00062193, 0.00040843]),
 'param_m__laplace_smoothing': masked_array(data=[0.5, 0.5, 0.5, 1, 1, 1, 1.5, 1.5, 1.5],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'param_m__var_smoothing': masked_array(data=[1e-08, 1e-05, 0.01, 1e-08, 1e-05, 0.01, 1e-08, 1e-05,
                    0.01],
         