<h1 style="text-align: center">Naive Bayes algorithm</h1>

<p style="text-align: center">Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with <strong>strong (naive) independence assumptions between the features</strong>. It is easy to build and particularly useful for very large data sets. Along with other linear classifiers, Naive Bayes is a good baseline classifier to which more sophisticated methods can be compared.
<br>
<br>
The Naive Bayes classifier is highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. In particular, the parameters of the model are estimated using a "closed-form" expression, which takes linear time, rather than iterative approximation as used for many other types of classifiers. This makes it particularly useful for very large data sets. The model is also relatively robust, requiring only a small number of training data to estimate the necessary parameters.
</p>

In [155]:
import numpy as np

class NaiveBayes:
    def __init__(self, stop_words=[]):
        self.labels = []
        self.likelihoods = {}
        self.priors = {}
        self.stop_words = []

    def _preprocess_text_(self, text):
        """
        Preprocess a text by removing punctuation, numbers, and stop words and returning a list of words

        Args:
            text (float): text to preprocess

        Returns: list: List of preprocessed words
        """ 
        text = str(text)
        text = text.encode('ascii', errors='replace').decode('ascii')
        text = text.lower()
        text = ''.join([c for c in text if c not in '0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'])
        text = ' '.join([word for word in text.split() if word not in self.stop_words])

        return text.split()
    
    def _likelihoods_ (self, X, y):
        '''
        Calculate the likelihoods of each word in each class
        
        Args:
            X (list): List of texts
            y (list): List of labels
        
        Returns: dict: Dictionary of likehoods for each class'''
        # Initialize the dictionaries for the words and their counts and the likelihoods
        dicts = {c: {} for c in self.labels}
        likelihoods = {c: {} for c in self.labels}

        # Fill the dictionaries with the words and their counts
        for i in range(len(X)):
            words = self._preprocess_text_(X[i])
            for word in words:
                if word in dicts[y[i]]:
                    dicts[y[i]][word] += 1
                else:
                    dicts[y[i]][word] = 1
        
        # Calculate the likelihoods of each word in each class
        for c in self.labels:
            for word in dicts[c]:
                likelihoods[c][word] = dicts[c][word] / sum(dicts[c].values())

        return likelihoods

    def fit(self, X, y):
        self.labels = np.unique(y)

        self.priors = {c: np.sum(c == y) /len(y) for c in self.labels}

        self.likelihoods = self._likelihoods_(X, y)

    def predict(self, text):
        # Preprocess the text
        words = self._preprocess_text_(text)
        # Initialize the scores for each class
        scores = {c: self.priors[c] for c in self.labels}
        
        for c in self.labels:
            for word in words:
                # If the word is in the likelihoods for the class
                if word in self.likelihoods[c]:
                    # Multiply the class score by the likelihoods of the word in the class
                    scores[c] = scores[c] * self.likelihoods[c][word]
                # If the word is NOT in the likelihood for the class
                else:
                    # Set the class score to 0
                    scores[c] = 0

        # Return the class with the highest score
        return max(scores, key=scores.get)
    
    def evaluate(self, X, y):
        # To evaluate the model, we use accuracy metric (correct predictions / total predictions)
        correct = 0
        for i in range(len(X)):
            if self.predict(X[i]) == y[i]:
                correct += 1
        return correct / len(y)

    def print(self):
        for c in self.labels:
            print(f'Class: {c}')
            print(f'Prior: {self.priors[c]}')
            print(f'Likelihoods: {np.array(self.likelihoods[c])}')

In [156]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('../../data/NaiveBayes/SMSSpamCollection.csv', sep='\t', names=['label', 'message'])
stop_words = pd.read_csv('../../data/NaiveBayes/stop_words_EN.txt', delimiter='t', on_bad_lines='skip')
stop_words = np.array(stop_words)

X = np.array(data.message)
y = np.array(data.label)

In [157]:
n_tests = 50 # Number of tests to run
accuracies = [] # List to store the accuracies

model = NaiveBayes(stop_words=stop_words)

for i in range(n_tests):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)
    model.fit(X_train, y_train)
    accuracies.append(model.evaluate(X_test, y_test))

print(f'Average accuracies: {np.mean(accuracies)}') # Average score is the mean of the scores => formula: sum(x) / n
print(f'Standard deviation: {np.std(accuracies)}') # Standard deviation is a measure of how spread out numbers are => formula: sqrt(sum((x - mean(x))**2) / n)
print(f'Minimum accuracies: {np.min(accuracies)}')
print(f'Maximum accuracies: {np.max(accuracies)}')

Average accuracies: 0.9294708520179372
Standard deviation: 0.007950782665996496
Minimum accuracies: 0.9139013452914798
Maximum accuracies: 0.9461883408071748


# Acknowledgements
This [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) is public available at the [UCI Machine Learning Repository](https://archive-beta.ics.uci.edu/dataset/228/sms+spam+collection) for research. 

Tiago A. Almeida and José María Gómez Hidalgo\
Department of Computer Science\
Federal University of Sao Carlos (UFSCar)
