# COMP47750 Machine Learning Assignment
# Gaussian Naive Bayes
A reimplementation of the `sklearn` Gaussian Naive Bayes classifier. 

1. Provide a python class MyGaussianNB that implements Gaussian Naive Bayes. 
The API specification for sklearn classifiers is here: https://scikit-learn.org/stable/developers/develop.html 
You should implement the ‘fit’ and ‘predict’ methods, there is no need to implement ‘predict_proba’. 


In [3]:
import numpy as np
import pandas as pd
import math
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.base import BaseEstimator, ClassifierMixin
from collections import Counter
from sklearn.metrics import accuracy_score

## My GaussianNB
Reimplementation of a Gaussian Naive Bayes.

In [4]:
class MyGaussianNB(BaseEstimator, ClassifierMixin):          
    def fit(self, Xt, yt):
        self.var_smoothing = 1e-9   # zero variance will cause division by zero errors.
        self.Xt = Xt
        self.yt = yt
        self.n_feat = Xt.shape[1]
        self.mus = {}
        self.sig_sqs = {}
        self.priors = {}
        
        c_dict = Counter(self.yt)
        
        for c in c_dict.keys():
            self.mus[c] = np.zeros(self.n_feat) # where the means will be stored
            self.sig_sqs[c] = np.zeros(self.n_feat) # where the variances will be stored
            self.priors[c] = c_dict[c]/Xt.shape[0]
            
            mask = self.yt == c
            X_tr_c = self.Xt[mask, :] # the rows for this class label
            
            for f in range(self.n_feat):
                self.mus[c][f] = np.mean(X_tr_c[:,f])
                self.sig_sqs[c][f] = np.var(X_tr_c[:,f] + self.var_smoothing)  #var              
        #print(self.mus)
        #print(self.sig_sqs)
        
        return self
    
    # The predictions are the most common class in the training set.
    def predict(self, Xtes):
        #print("Predicting MGNB")
        self.Xtes = Xtes
         
        res_list = []
        for sample in Xtes:
            res_list.append(self.predict_single(sample))
            
        return np.array(res_list)
    
    def predict_single(self, x_single):
        probs = {}
        for c in self.priors.keys():   # for each of the class labels
            probs[c] = self.priors[c]
            for i, f in enumerate(x_single):
                t1 = 1/math.sqrt(2*math.pi*self.sig_sqs[c][i])
                num = (f - self.mus[c][i])**2
                den = 2*self.sig_sqs[c][i]
                pxi_y = t1 * math.exp(-num/den)
                probs[c] = probs[c] * pxi_y
                #print(t1, num, den, pxi_y)
                #print(probs)
            #print(c, self.priors[c])
        return max(probs, key=probs.get) # Return the key with the largest value
    

## Testing
2. Test the performance of your implementation against the `GaussianNB` implementation in `scikit-learn`. You should use a range of datasets for this testing.   
Four datasets are used for testing; testing on a hold out set:
 - **penguins**: check that mean and variance estimates are the same, check that predictions are the same. 
 - **diabetes**: check that predictions are the same.
 - **glassV2**: test that predictions are the same. 
 - **bike_sharing**: test that predictions are the same. 

The main component in the testing is to check the `fidelity` of the testing against the Gaussian Naive Bayes implementation in `scikit-learn`.    
The `fidelity_tests` function compares predictions across multiple runs (different holdout tests). It uses `accuracy_score` to do the comparison.

In [5]:
def fidelity_tests (X,y, nreps = 10):
    for rs in range(1, nreps + 1):
        X_tr_raw, X_ts_raw, y_train, y_test = train_test_split(X_raw, y, 
                                                               random_state=rs, 
                                                               test_size=1/2)
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_tr_raw)
        X_test = scaler.transform(X_ts_raw)
        gnb = GaussianNB()
        mgnb = MyGaussianNB()
        mgnb.fit(X_train,y_train)
        gnb.fit(X_train,y_train)
        ascore = accuracy_score(gnb.predict(X_test),mgnb.predict(X_test)) 
        gacc = accuracy_score(gnb.predict(X_test),y_test)
        macc = accuracy_score(mgnb.predict(X_test),y_test)
        print ("Run: %d Score: %.2f SK acc: %.2f My acc: %.2f" % (rs, ascore, gacc, macc))

### Penguins

In [37]:
penguins = pd.read_csv('penguins_af.csv', index_col = 0)
print(penguins.shape)
penguins.head()

(333, 8)


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


In [38]:
y = penguins.pop('species')
# penguins = pd.get_dummies(penguins)

In [39]:
penguins.values[0]

array(['Torgersen', 39.1, 18.7, 181.0, 3750.0, 'male', 2007], dtype=object)

In [31]:

X_raw = penguins.values
X_tr_raw, X_ts_raw, y_train, y_test = train_test_split(X_raw, y, random_state=2, test_size=1/2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_tr_raw)
X_test = scaler.transform(X_ts_raw)
max_k = X_train.shape[1]
X_train.shape, X_test.shape

((166, 10), (167, 10))

In [32]:
X_train[0]

array([ 0.4451256 ,  0.67719924, -0.72189828, -0.8968907 , -1.36135792,
       -1.        ,  1.26243812, -0.35951593,  1.04941213, -1.04941213])

**Model Parameters**  
The means are the same:

In [None]:
gnb = GaussianNB()
gnb.fit(X_train,y_train)
mgnb = MyGaussianNB()
mgnb.fit(X_train,y_train)

In [None]:
gnb.var_

In [None]:
mgnb.sig_sqs

In [None]:
gnb.theta_

In [None]:
mgnb.mus

Accuracy scores are the same:

In [None]:
gnb.score(X_test, y_test)

In [None]:
mgnb.score(X_test, y_test)

### Fidelity tests

Look at the lables of the predictions of the first 10 test samples:

In [None]:
mgnb.predict(X_test[:10])

In [None]:
gnb.predict(X_test[:10])

Run multiple tests

In [None]:
fidelity_tests(X_raw, y)

Finally, we use `accuracy_score` to compare all predictions on the test set.   
The score of 1.0 indicates perfect agreement.

### Diabetes dataset  
Test that the predictions are the same on a holdout set. 

In [None]:
diabetes = pd.read_csv('diabetes.csv') #, index_col = 0)
print(diabetes.shape)
diabetes.head()

In [None]:
y = diabetes.pop('neg_pos').values
X_raw = diabetes.values

In [None]:
fidelity_tests(X_raw, y, nreps = 5)

### Glass Dataset
Test that the predictions are the same on a holdout set. 

In [None]:
glass = pd.read_csv('glassV2.csv') #, index_col = 0)
print(glass.shape)
glass.head()

In [None]:
glass_orig = pd.read_csv('glassV2.csv')

In [None]:
glass_orig['Type'].value_counts()

In [None]:
glass = glass_orig[glass_orig['Type'].isin([1,2])]

In [None]:
y = glass.pop('Type')
X_raw = glass.values

In [None]:
fidelity_tests(X_raw, y, nreps = 10)

One miss-match here. 

## Bike Sharing

In [None]:
bikes_df = pd.read_csv('bike_sharing.csv')
bikes_df.head()
bikes_df['usage'] = 'Low'
bikes_df.loc[bikes_df['count'] > 4500, 'usage'] = 'High'

In [None]:
bikes_df['usage'].value_counts()

In [None]:
y = bikes_df.pop('usage').values
bikes_df.pop('casual').values
bikes_df.pop('registered').values
bikes_df.pop('instant').values
bikes_df.pop('dteday').values
bikes_df.pop('count').values
X_raw = bikes_df.values
X_raw.shape, y.shape

In [None]:
fidelity_tests(X_raw, y, nreps = 10)