# Naive Bayes algorithm

Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions between the features. It is easy to build and particularly useful for very large data sets. Along with other linear classifiers, Naive Bayes is a good baseline classifier to which more sophisticated methods can be compared.

The Naive Bayes classifier is highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. In particular, the parameters of the model are estimated using a "closed-form" expression, which takes linear time, rather than iterative approximation as used for many other types of classifiers. This makes it particularly useful for very large data sets. The model is also relatively robust, requiring only a small number of training data to estimate the necessary parameters.

To classify a new object we'll use the following Naive Bayes formula:


$P(C_k|x_1, x_2, ..., x_n) = \frac{P(x_1, x_2, ..., x_n|C_k)P(C_k)}{P(x_1, x_2, ..., x_n)}$

where:

- $P(C_k|x_1, x_2, ..., x_n): \text{posterior probability - probability of class given features}$
- $C_k: \text{class}$
- $x_i: \text{feature}$
- $P(x_i|C_k): \text{likelihood - probability of feature given class}$
- $P(C_k): \text{prior probability - probability of class}$
- $P(x_i): \text{marginal probability - probability of feature (it can be ignored)}$

<img src="../../../assets/naivebayes_example.png" style="display: block; margin-left: auto; margin-right: auto">

## Example
We have 2 classes: spam and not spam (ham) and 2 features: contains word "free" and contains word "money", we want to classify new email, which class it belongs to (spam or ham) and what is the probability of this class.

|      | Spam | Ham  | Total|
|------|------|------| -----|
| Free | 2    | 20   | 22   |
| Money| 5    | 10   | 15   |
| Total| 10   | 90   | 100  |

1) $Priors:\;P(Spam) = \frac{10}{100} = 0.1,\;P(Ham) = \frac{90}{100} = 0.9$
2) $Likelihood-money:\;P(Free|Spam) = \frac{2}{10} = 0.2,\;P(Free|Ham) = \frac{20}{90} = 0.22$
3) $Likelihood-free:\;P(Money|Spam) = \frac{5}{10} = 0.5,\;P(Money|Ham) = \frac{10}{90} = 0.11$
4) $Posterior:\;P(Spam|Free, Money) = P(Free, Money|Spam)P(Spam) = 0.2 \cdot 0.5 \cdot 0.1 = 0.01$
5) $Posterior:\;P(Ham|Free, Money) = P(Free, Money|Ham)P(Ham) = 0.22 \cdot 0.11 \cdot 0.9 = 0.02$
6) $P(Ham|Free, Money) > P(Spam|Free, Money)$ (so we classify new email as Ham with probability 0.02)

## Implementation

In [4]:
import numpy as np

class NaiveBayes:
    '''
    Naive Bayes classifier for text classification
    
    Attributes:
        labels (list): List of unique labels
        likelihoods (dict): Dictionary of likelihoods for each class
        priors (dict): Dictionary of priors for each class
        stop_words (list): List of stop words to remove from the text
    
    Methods:
        _preprocess_text(text: str) -> list: Preprocess a text by removing punctuation, numbers, and stop words and returning a list of words
        _likelihoods(X: np.ndarray, y: np.ndarray) -> dict: Calculate the likelihoods of each word in each class
        fit(X: np.ndarray, y: np.ndarray) -> None: Fit the model to the data
        predict(text: str) -> str: Predict the class of a text
        evaluate(X: list, y: list) -> float: Evaluate the model on a list of texts and labels
        show_word_probabilities(word: str) -> None: Print the likelihoods of a word in each class
    '''
    def __init__(self, stop_words=[]):
        self.labels = []
        self.likelihoods = {}
        self.priors = {}
        self.stop_words = stop_words

    def _preprocess_text(self, text: str) -> list:
        """
        Preprocess a text by removing punctuation, numbers, and stop words and returning a list of words

        Args:
            text (float): text to preprocess

        Returns: list: List of preprocessed words
        """ 
        text = str(text)
        text = text.encode('ascii', errors='replace').decode('ascii')
        text = text.lower()
        text = ''.join([c for c in text if c not in '0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'])
        text = ' '.join([word for word in text.split() if word not in self.stop_words])

        return text.split()
    
    def _likelihoods(self, X: np.ndarray, y: np.ndarray) -> dict:
        '''
        Calculate the likelihoods of each word in each class
        
        Args:
            X (np.ndarray): List of texts
            y (np.ndarray): List of labels
        
        Returns: dict: Dictionary of likehoods for each class'''
        # Initialize the dictionaries for the words and their counts and the likelihoods
        dicts = {c: {} for c in self.labels}
        likelihoods = {c: {} for c in self.labels}

        # Fill the dictionaries with the words and their counts
        for i in range(len(X)):
            words = self._preprocess_text_(X[i])
            for word in words:
                if word in dicts[y[i]]:
                    dicts[y[i]][word] += 1
                else:
                    dicts[y[i]][word] = 1
        
        # Calculate the likelihoods of each word in each class
        for c in self.labels:
            for word in dicts[c]:
                likelihoods[c][word] = dicts[c][word] / sum(dicts[c].values())

        return likelihoods

    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        '''
        Fit the model to the data

        Args:
            X (np.ndarray): List of texts
            y (np.ndarray): List of labels
        '''
        # Get the unique labels
        self.labels = np.unique(y)

        # Calculate the priors and likelihoods
        self.priors = {c: np.sum(c == y) /len(y) for c in self.labels}
        self.likelihoods = self._likelihoods_(X, y)


    def predict(self, text: str) -> str:
        '''
        Predict the class of a text

        Args:
            text (str): Text to predict the class of

        Returns: str: Predicted class
        '''
        # Preprocess the text
        words = self._preprocess_text_(text)
        # Initialize the scores for each class
        scores = {c: self.priors[c] for c in self.labels}
        
        for c in self.labels:
            for word in words:
                # If the word is in the likelihoods for the class
                if word in self.likelihoods[c]:
                    # Multiply the class score by the likelihoods of the word in the class
                    scores[c] = scores[c] * self.likelihoods[c][word]
                # If the word is NOT in the likelihood for the class
                else:
                    # Set the class score to 0
                    scores[c] = 0

        # Return the class with the highest score
        return max(scores, key=scores.get)
    
    def evaluate(self, X: list, y: list) -> float:
        '''
        Evaluate the model on a list of texts and labels
        
        Args:
            X (list): List of texts
            y (list): List of labels
        
        Returns: float: Accuracy of the model
        '''
        correct = 0
        for i in range(len(X)):
            if self.predict(X[i]) == y[i]:
                correct += 1
        return correct / len(y)
    
    def show_word_probabilities(self, word: str) -> None:
        '''
        Print the likelihoods of a word in each class

        Args:
            word (str): Word to print the likelihoods of
        '''
        probabilities = {}

        print(f'Word: {word}')
        for c in self.labels:
            print(f'P({word}|{c}) = {self.likelihoods[c][word]} * {self.priors[c]} / {sum(self.priors.values())} = {self.likelihoods[c][word] * self.priors[c] / sum(self.priors.values())}')
            probabilities[c] = self.likelihoods[c][word] * self.priors[c] / sum(self.priors.values())

        print(f'Prediction: {max(probabilities, key=probabilities.get)}')

    def __str__(self) -> str:
        for c in self.labels:
            print(f'Class: {c}')
            print(f'Prior: {self.priors[c]}')
            print(f'Likelihoods: {self.likelihoods[c]}')

        return ''


## Dataset

We will use the [SMS Spam Collection Dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) from the UCI Machine Learning Repository. This dataset contains 5574 SMS messages that have been collected for mobile phone spam research. The collection is composed by just one text file, where each line has the correct class followed by the raw message. We will use this dataset to train a Naive Bayes classifier that can classify new messages as spam or ham.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('../../../data/SMSSpamCollection.csv', sep='\t', names=['label', 'message'])
stop_words = pd.read_csv('../../../data/stop_words_EN.txt', delimiter='t', on_bad_lines='skip')
stop_words = np.array(stop_words)

X = np.array(data.message)
y = np.array(data.label)

## Train and Test

In [6]:
n_tests = 50 # Number of tests to run
accuracies = [] # List to store the accuracies

model = NaiveBayes(stop_words=stop_words)

for i in range(n_tests):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)
    model.fit(X_train, y_train)
    accuracies.append(model.evaluate(X_test, y_test))

print(f'Average accuracies: {np.mean(accuracies)}') # Average score is the mean of the scores => formula: sum(x) / n
print(f'Standard deviation: {np.std(accuracies)}')  # Standard deviation is a measure of how spread out numbers are => formula: sqrt(sum((x - mean(x))**2) / n)
print(f'Minimum accuracies: {np.min(accuracies)}')
print(f'Maximum accuracies: {np.max(accuracies)}')

Average accuracies: 0.9294708520179372
Standard deviation: 0.007950782665996496
Minimum accuracies: 0.9139013452914798
Maximum accuracies: 0.9461883408071748


## Different words and their frequency in spam and ham messages (probability and class)

In [8]:
words = ['free', 'call', 'text', 'you', 'stop', 'reply', 'mobile', 'service']

for word in words:
    model.show_word_probabilities(word)
    print('---')

Word: free
P(free|ham) = 0.0009218150349913462 * 0.866502131478573 / 1.0 = 0.0007987546926489968
P(free|spam) = 0.014084507042253521 * 0.13349786852142698 / 1.0 = 0.001880251669315873
Prediction: spam
---
Word: call
P(call|ham) = 0.003612009933027316 * 0.866502131478573 / 1.0 = 0.003129814305889947
P(call|spam) = 0.02207627789207153 * 0.13349786852142698 / 1.0 = 0.00294713604347825
Prediction: ham
---
Word: text
P(text|ham) = 0.0010158777936639326 * 0.866502131478573 / 1.0 = 0.0008802602735315477
P(text|spam) = 0.008308276626048425 * 0.13349786852142698 / 1.0 = 0.0011091372206638575
Prediction: spam
---
Word: you
P(you|ham) = 0.027673263601474905 * 0.866502131478573 / 1.0 = 0.02397894189564642
P(you|spam) = 0.019069473017882577 * 0.13349786852142698 / 1.0 = 0.0025457340017141874
Prediction: ham
---
Word: stop
P(stop|ham) = 0.0006208142072390699 * 0.866502131478573 / 1.0 = 0.0005379368338248346
P(stop|spam) = 0.007675265073587593 * 0.13349786852142698 / 1.0 = 0.001024631527660897
Predic

## Acknowledgements
This [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) is public available at the [UCI Machine Learning Repository](https://archive-beta.ics.uci.edu/dataset/228/sms+spam+collection) for research. 

Tiago A. Almeida and José María Gómez Hidalgo\
Department of Computer Science\
Federal University of Sao Carlos (UFSCar)
