To be done in groups of 2 students.

You have two choices:

1) Choose a data set from the  UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php) and a Machine Learning algorithm. Implement the algorithm and train it with your chosen data set to have the best performance over unseen data.
2) Choose a task appropriate for a machine learning and an algorithm to learn it. Implement the algorithm and train it for the task to have the best performance.

We chose the first option. We chose the data set "Spambase Data Set" from the UCI Machine Learning Repository and the algorithm "Naive Bayes". The data set can be found here: https://archive.ics.uci.edu/dataset/94/spambase

1. Import libraries

In [2]:
import numpy as np
import pandas as pd

 2. Load data

In [3]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 



3. Preprocessing

In [4]:
# convert to numpy arrays
X = np.array(X)
y = np.array(y)

# reshape y to 1d array
y = y.ravel()


4. Implement Naive Bayes

In [5]:
def calculate_prior(df, Y):
    classes = sorted(list(df[Y].unique()))
    prior = []
    for i in classes:
        prior.append(len(df[df[Y]==i])/len(df))
    return prior

def precompute_stats(df, features, Y):
    stats = {}
    for feat in features:
        groups = df.groupby(Y)[feat]
        stats[feat] = groups.agg(['mean', 'std'])
    return stats

def calculate_likelihood_gaussian_optimized(feat_val, mean, std):
    p_x_given_y = (1 / (np.sqrt(2 * np.pi) * std)) *  np.exp(-((feat_val-mean)**2 / (2 * std**2 )))
    return p_x_given_y

def naive_bayes_gaussian(df, X, Y):
    features = df.columns[:-1]
    labels = df[Y].unique()
    prior = calculate_prior(df, Y)
    stats = precompute_stats(df, features, Y)

    Y_pred = []
    for x in X:
        likelihood = np.ones(len(labels))
        for feat, val in zip(features, x):
            for i, label in enumerate(labels):
                mean, std = stats[feat].loc[label]
                likelihood[i] *= calculate_likelihood_gaussian_optimized(val, mean, std)
        post_prob = likelihood * prior
        Y_pred.append(labels[np.argmax(post_prob)])

    return np.array(Y_pred)


4. Train and test the model

In [8]:
# split data into train and test sets, make sure to have spam and non-spam in both sets
from sklearn.model_selection import train_test_split
# transform spambaes to pandas dataframe
data = pd.DataFrame(X)
data['spam'] = y
train, test = train_test_split(data, test_size=0.2, random_state=42)

# train model
X_test = test.iloc[:,:-1].values
y_test = test.iloc[:,-1].values
X_train = train.iloc[:,:-1].values
y_train = train.iloc[:,-1].values

predictions = naive_bayes_gaussian(train, X_test, 'spam')

# evaluate model
# other metrics: precision, recall, f1-score, etc.
print("Naive Bayes classification report")
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))


# compare with sklearn's implementation
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
predictions = gnb.predict(X_test)
print("------------------------------------------")
print("Sklearn's Naive Bayes classification report")
print(classification_report(y_test, predictions))



Naive Bayes classification report
              precision    recall  f1-score   support

           0       0.94      0.74      0.83       531
           1       0.72      0.94      0.82       390

    accuracy                           0.82       921
   macro avg       0.83      0.84      0.82       921
weighted avg       0.85      0.82      0.82       921

------------------------------------------
Sklearn's Naive Bayes classification report
              precision    recall  f1-score   support

           0       0.95      0.73      0.82       531
           1       0.72      0.95      0.82       390

    accuracy                           0.82       921
   macro avg       0.83      0.84      0.82       921
weighted avg       0.85      0.82      0.82       921

