# CA02: Spam Email Detection Using Naive Bayes Classification Algorithm

<i>
Luis Otero
<br>
BSAN 6070
<br>
January 30, 2024
<br>
https://github.com/otero106/BSAN6070
</i>

## Project Overview

In this Python project, we will be utilizing a supervised machine learning algorithm called Naive Bayes to build a model that can determine whether an email is spam or not. The algorithm is based on Bayes' Theorem, which is a mathematical formula used to find the probability of an event occuring when we are given the probability of another event that has already occurred. Another important fundamental in Naive Bayes is the "naive" part, meaning we are assuming that each feature in a provided dataset is conditionally independent from one another. For this project, we will be importing two folders containing a mix of spam and non spam emails, one is for training our model and one is testing the accuracy of the model. Then we will train our model by implementing it on our training folder and this will make it learn to discern which email is spam or not based on the words within the emails. Subsequently, we will implement the same model on the testing folder to compute the accuracy and if it could actually categorize emails as spam or non-spam.

## Python Implementation

### Part 1: Loading Libraries and Accessing Google Drive

In order to run certain lines of code in this project, we first need to import certain libraries. We will be using the OS module to interact with the operating system of the computer that this program is running on. NumPy is a Python library for working with arrays and mathematical functions. Counter is a sub-class of the collections module, used to count hashable objects in Python. We will be using scikit-learn to import the Gaussian Naive-Bayes and metrics to assess model performance. Finally, we will import Google drive to access the folders and files located on the local user's account.

In [1]:
#importing libraries to pre-process emails
import os
import numpy as np
from collections import Counter

#import naive bayes and metrics from scikit-learn
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

#import google drive
from google.colab import drive

In [2]:
#accessing local files that are in google drive
drive.mount('/content/drive')

Mounted at /content/drive


### Part 2: Writing Functions to Process Emails

The next step is to use functions that will process the each and every email that is in the train-mails and test-mails folders. We will ignore the subject lines of the emails and concentrate on the actual email content and file name. In here, the functions called make_Dictionary and extract_features will remove non-required words, expressions and symbols from text. The function extract_features will produce a matrix representation of the word frequency.

In [3]:
#function to make a dictionary of each word in email and their frequency
def make_Dictionary(root_dir):
  all_words = [] #empty list to store all words
  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)] #list of all files in folder

  #iterate through each and every email and turn every sentence into list of words
  for mail in emails:
    with open(mail) as m:
      for line in m:
        words = line.split()
        all_words += words #add list of words to list all_words

  #make a dictionary of all words and their count
  dictionary = Counter(all_words)
  list_to_remove = list(dictionary) #make list of all keys (words) in dictionary

  #iterate through list and remove words unnecessary words
  for item in list_to_remove:
    if item.isalpha() == False:
      del dictionary[item] #if word is not alphabet, remove from dictionary
    elif len(item) == 1:
      del dictionary[item] #if word is only one character, remove from dictionary

  #turn dictionary into a list of the 3000 most common words and their count
  dictionary = dictionary.most_common(3000)
  return dictionary #output of function is dictionary


In [4]:
#function to
def extract_features(mail_dir):

  #make a list of all emails(files) in folder and make matrices of zeroes
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  features_matrix = np.zeros((len(files),3000)) #size of matrix is no. of files and 3000
  train_labels = np.zeros(len(files)) #size of matrix is no. of files

  #variables to keep track of spam emails and documents
  count = 1;
  docID = 0;

  #iterate through every line in every email
  for fil in files:
    with open(fil) as fi:

      #if line is third line, turn it into a list of words and iterate thru list
      #skip subject line and go to email content
      for i, line in enumerate(fi):
        if i ==2:
          words = line.split()
          for word in words:
            wordID = 0 #variable to keep track which word it is

            #loop thru dictionary of words created from previous function and
            #check if word is in dictionary
            for i, d in enumerate(dictionary):
              if d[0] == word:
                wordID = i #if yes, set word ID to index
                features_matrix[docID,wordID] = words.count(word) #set count of word in matrix

      #set label of file and split file path to list
      train_labels[docID] = 0;
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens)-1] #get last character in list

      #check if email is spam if the file path starts with spmsg
      if lastToken.startswith("spmsg"):
        train_labels[docID] = 1;
        count = count + 1 #increase spam count by 1
      docID = docID + 1 #increase document ID by 1

  #output features matrix and labels
  return features_matrix, train_labels

### Part 3: Load Training and Test Emails

We need data to model! We need to specify the paths of the folders that contain the data. In this case, my folders are located on my google drive so I will use their associated file paths. If you are running this notebook on your colab environment in your own account, please make the appropriate changes. Refer to the README that is located in the GitHub folder of this assignment.

In [5]:
#enter the file path of your "train_mails" and "test-mails" folders located on your Google Drive
train_path = '/content/drive/MyDrive/Colab Notebooks/BSAN6070/Computer_Assignments/CA02/train-mails'
test_path = '/content/drive/MyDrive/Colab Notebooks/BSAN6070/Computer_Assignments/CA02/test-mails'

### Part 4: Cleaning All Emails

We now use our functions to process all the data in the train and test folders. With the outputs, we can jump right into using a Naive Bayes model.

In [7]:
#using the make_Dictionary function on training data, store all words from training data in dictionary
dictionary = make_Dictionary(train_path)

#make features matrix and labels on both training and test data
print("reading and processing emails from TRAIN and TEST folders")
features_matrix, labels = extract_features(train_path)
test_features_matrix, test_labels = extract_features(test_path)

reading and processing emails from TRAIN and TEST folders


### Part 5: Training and Testing the Model

The last step of this project is to create our Naive Bayes model using the scikit-learn package and train the model using the email training data we have cleaned for this purpose. We then make predictions on the prepared testing data and retrieve the accuracy of our model.

In [8]:
#training the model using Naive Bayes algorithm
print("Training Model using Gaussian Naive Bayes algorithm")
print("Training completed")
gnb = GaussianNB()
gnb.fit(features_matrix, labels)

#making predictions on testing set
print("Testing trained model to predict Test Data labels")
test_pred = gnb.predict(test_features_matrix)

#comparing actual response values (test_labels) with predicted response values (test_pred)
#testing accuracy of model
print("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:",
      metrics.accuracy_score(test_labels, test_pred))

Training Model using Gaussian Naive Bayes algorithm
Training completed
Testing trained model to predict Test Data labels
Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels: 0.9615384615384616
