# Naive Bayes - Email Spam Classification

**Objective**: To classify if the email is not a scam (0) or a scam (1).

**Outline**:

1. **Constructing the Model** 
    - 1.1 Multinomial Naive Bayes Model (NB)
2. **Data Preprocessing** 
    - 2.1 Binary Conversion
    - 2.2 Data Splitting
2. **Model Evaluation**

Dataset from https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

In [288]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

SEED = 42 # set the random seed to 42 for reproducibility

## 1. Constructing the Model

### 1.1 Multinomial Naive Bayes Model (NB)

**Terminologies in Bayes Theorem** -> P(Y|X) = P(X|Y)P(Y)/ P(X)
 
- P(Y|X) -> **posterior** <br>
- P(X|Y) -> **likelihood** <br>
- P(Y) -> **prior** <br>
- P(X) -> **marginal** (a constant gets cancelled out during MLE/ differentiation)

**Model Outline**:
  1. **\_\_init\_\_** -> for initializing the hyperparameters
  2. **fit** -> to fit X_train and y_train into the model to train
  3. **predict** -> return the predicted class
  4. **_compute_class_likelihood_prior** -> compute "likelihood + prior" for each class and word 
  5. **_compute_class_posterior** -> compute posterior for each class and word

In [289]:
class MultinomialNaiveBayes:

  # public:
  def __init__(self, alpha_smoothing_parameter):


    """
    One hyperparameter in NB:
    1. alpha_smoothing_parameter -> to control the value of Laplace smoothing to avoid zero counts

    """


    self.alpha_smoothing_parameter = alpha_smoothing_parameter
    self.trained = False

  def fit(self, X, y):


    """
    fit/ train:
    1. initialize the feature matrix X and target vector y
    2. calculate the "likelihood + prior" for each class and word

    # original should be "likelihood * prior" but we are taking log-likelihood

    """

    
    self._initialize(X, y)
    self._compute_class_likelihood_prior()

    self.trained = True

  def predict(self, X):

    """
    predict:
    1. return the class/ index of the maximum posterior for each row and store in y_pred
    2. return y_pred 

    """

    y_pred = []

    if self.trained:
      for row in range(len(X)):
        y_pred.append(np.argmax(self._compute_class_posterior(X[row, :])))

    return np.array(y_pred)


  # private:
  def _initialize(self, X, y):
    self.m_samples, self.n_features = X.shape
    self.classes = [0,1] # binary classification

    if isinstance(X, pd.DataFrame) or isinstance(X, pd.Series):
      X = X.to_numpy()
    self.X = X

    if isinstance(y, pd.Series):
      y = y.to_numpy()
    self.y = y.ravel() # the model takes in 1D array instead of column vectors

  def _compute_class_likelihood_prior(self):


    """
    Variables:
    1. self.class_prior -> to store prior for each class and word in a dict 
    dimension: {k classes: k priors}

    2. self.class_likelihood -> to store likelihood for each class and word in a dict
    dimension: {k classes: n features}

    so there will be a double for loop for calculating the class likelihood

    """


    self.class_prior = {}
    self.class_likelihood = {}

    for k in self.classes:
      self.class_prior[k] = np.log(len(self.y[self.y==k])/ self.m_samples)
      
      likelihood_each_class = [] 

      for j in range(self.n_features):

        # laplace smoothing
        numerator = np.sum(self.X[self.y==k, j]) + self.alpha_smoothing_parameter # row for y's condition then column for each word x 
        denominator = len(self.y[self.y==k]) + self.alpha_smoothing_parameter

        likelihood = np.log(numerator/ denominator)
        likelihood_each_class.append(likelihood)
      
      self.class_likelihood[k] = likelihood_each_class
      
  def _compute_class_posterior(self, x): # per row (sample)


    """ 
    Variable:
    1. class_posterior -> store P(Y|X) or "likelihood + prior" for each class and word 
    dimension: [k classes]

    # doesn't use a dict here like computing the likelihood and prior because it is more convenient to do the np.argmax() operation in predict()

    """


    class_posterior = []

    for k in self.classes:
      likelihood = 1 # initialize 
      prior = self.class_prior[k]

      for j in range(self.n_features):
        # only works for binary values; likelihood^0 = 1 if x doesn't contain that word
        likelihood += self.class_likelihood[k][j] ** x[j] 
      
      class_posterior.append(likelihood + prior)

    return np.array(class_posterior) 

## 2. Data Preprocessing

In [290]:
df = pd.read_csv("emails.csv")
df

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5167,Email 5168,2,2,2,3,0,0,32,0,0,...,0,0,0,0,0,0,0,0,0,0
5168,Email 5169,35,27,11,2,6,5,151,4,3,...,0,0,0,0,0,0,0,1,0,0
5169,Email 5170,0,0,1,1,0,0,11,0,0,...,0,0,0,0,0,0,0,0,0,1
5170,Email 5171,2,7,1,0,2,1,28,2,0,...,0,0,0,0,0,0,0,1,0,1


### 2.1 Binary Conversion

To convert each word into binary.

In [291]:
df.drop(columns=['Email No.'], inplace=True) 

df = df.apply(lambda x: np.where(x == 0, 0, 1))
df


Unnamed: 0,the,to,ect,and,for,of,a,you,hou,in,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,1,1,0,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,1,1,1,0,1,1,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5167,1,1,1,1,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5168,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,1,0,0
5169,0,0,1,1,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,1
5170,1,1,1,0,1,1,1,1,0,1,...,0,0,0,0,0,0,0,1,0,1


### 2.2 Data Splitting 

To split the dataset into training and testing sets.

In [292]:
X = df.drop(columns=['Prediction']).to_numpy()
y = df['Prediction']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

## 3. Model Evaluation

A simple performance comparison between my model and sklearn's model using the accuracy score.

In [300]:
parameters = [1, 1e-9, 1e-12, 0]

for parameter in parameters:
  clf = MultinomialNaiveBayes(parameter) # initialize
  clf.fit(X_train, y_train) # train
  y_pred = clf.predict(X_test) # predict

  my_model_score = accuracy_score(y_test, y_pred)
  print(f"The accuracy score of my model is {my_model_score} with alpha = {parameter}")

The accuracy score of my model is 0.8827319587628866 with alpha = 1
The accuracy score of my model is 0.9439432989690721 with alpha = 1e-09
The accuracy score of my model is 0.946520618556701 with alpha = 1e-12


  likelihood = np.log(numerator/ denominator)


The accuracy score of my model is 0.9400773195876289 with alpha = 0


In [294]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB() # initialize
clf.fit(X_train, y_train) # train
y_pred = clf.predict(X_test) # predict

sklearn_model_score = accuracy_score(y_test, y_pred)
print(f"The accuracy score of sklearn's model is {sklearn_model_score}")

The accuracy score of sklearn's model is 0.9439432989690721


As shown above, my model (0.9465) slightly outperforms sklearn's model (0.9439) by 0.002 when the smoothing parameter is approaching to 0 but not 0. This is because a smaller value of the parameter makes less impact in calculating the likelihood of the feature that is absent from the whole dataset while a zero/not doing Laplace smoothing will result in zero division error.

In other words, changes in smoothing parameter only makes a difference when there is a feature having zero frequency. 

Additionally, though under the assumptions of conditional independence (ignoring the order of word appearances) and binary conversion (ignoring the word counts), both models perform exceptionally good.

Nonetheless, my model's runtime is significantly longer because I used a for loop to calculate the posterior for each sample. 