# Spam eMail Detection using Naive Bayes Classification Algorithm

In this project, a model is trained with set of emails labelled as either from Spam or Non-Spam. There are 702 emails equally divided into spam and non spam category.

Next, the model is tested on 260 emails. The model is tasked to predict the category of the emails and compare the accuracy with known correct classifications.

There are two folders: test-mails and train-mails. Train-mails are to train the model. Test-mails are used to test the accuracy of the model.

Each email's first line is the subject; the content starts from the third line.

If you navigate to any of the train-mails or test-mails, you shall see file names in two patterns:

* *number-numbermsg[number].txt : example 3-1msg1.txt* (non spam emails)

* *spmsg[Number].txt : example **spms**ga162.txt* (spam emails).

In [1]:
# Importing modules
import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from google.colab import drive

# Mounting google drive to link directories
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Linking directories
train_data = '/content/drive/MyDrive/MSBA_Colab_2020/ML_Algorithms/CA02/Data/train-mails'
test_data = '/content/drive/MyDrive/MSBA_Colab_2020/ML_Algorithms/CA02/Data/test-mails'

## 1. Cleaning and Preparing Data

First, the data is cleaned and prepared for the modelby removing non-required words, expressions and symbols from text.

### Building a Dictionary

This function builds a dictionary using the most common 3000 words from all email content. It will also remove all non-alpha-numeric and single character alpha-numeric characters.

In [3]:
def create_dict(root_folder):                                                         # root_folder = main folder where all email txt files are stored
  all_email_words = []                                                                # creates an empty list to append all words in all emails
  all_emails = [os.path.join(root_folder,file) for file in os.listdir(root_folder)]   # creates list of paths: root_folder/.../file
  for email in all_emails:
    with open(email) as e:                                                            # opens file and returns it as file object
      for text in e:
        email_words = text.split()                                                    # splits up email content into list of individual words
        all_email_words += email_words                                                # adds each new word to email_words
  words_dict = Counter(all_email_words)                                               # creates dictionary pairs: email_words and their counts
  list_to_remove = list(words_dict)                                                   # transforms dictionary to list; will store unneeded words to remove 
  for word in list_to_remove:
    if word.isalpha() == False:                                                       # checks if word is NOT alphanumeric
      del words_dict[word]                                                            # removes non-alphanumeric word from "words_dict"
    elif len(word) == 1:                                                              # checks if word is single character
      del words_dict[word]                                                            # removes single character words from "words_dict"
  words_dict = words_dict.most_common(3000)                                           # only keeps 3000 most kommon words in "words_dict"
  return words_dict                                                                   # prints the final dictionary of 3000 most common words in all email

## 2. Building the Functions

In this section, we create the functions to extract the accurate data which is fed in the model to make the appropriate predictions. 

### Extracting features and corresponding label frequency matrix

A word frequency matrix and train labels are created. 

**Frequency matrix.** The 3000 columns represent features, in this case the 3000 most commonly used words that were previously identified. There are as many rows as the number of emails, each row representing one email. The values indicate how many times a word occurs in each email.

**Train labels.** A one-dimensional array. The number of columns correspond to the number of emails. It is populated with 0 if it is believed to be non-spam and 1 if it is believed to be spam.

It will analyze the file name of each email and decide if it is spam or not based on the naming convention in order to categorize it as spam or non-spam.

In [4]:
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10],[11, 12, 13, 14, 15]])
print(arr)
print(arr[2,3])

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]
14


In [5]:
def get_features(main_folder):
  email_files = [os.path.join(main_folder,fil) for fil in os.listdir(main_folder)]      # creates list of paths of all email files: root_folder/.../file
  frequency_matrix = np.zeros((len(email_files),3000))                                  # creates 2D array filled with zeros. # of rows = # emails; # of columns = 3000 (most common words)
  train_labels = np.zeros(len(email_files))                                             # creates 1D awway filled with zeros. length = # of emails
  spam_count = 0;
  email_ID = 0;                                                                         # email_ID = 0 -> starts at first email (first row)
  for f in email_files:
    with open(f) as fi:                                                                 # opens file and returns it as file object
      for pos, text in enumerate(fi):                                                   # enumerate() in for loop returns email position and value of each email
        if pos ==2:                                                                     # x==2 -> third line (first line is subject and content is on third line)
          email_body = text.split()                                                     # splits email text into individual words (returns list)
          for word in email_body:
            word_ID = 0
            for pos, w in enumerate(words_dict):                                        # for every position and {word:count} pair of the 3000 most common words
              if w[0] == word:                                                          # checks if word in dictionary occurs in email body
                word_ID = pos                                                           # if so, word_ID = position of word in dictionary (distinguishes word)
                frequency_matrix[email_ID,word_ID] = email_body.count(word)             # email_ID=row position, word_ID=column position. populates value with number of occurances of word in all emails
      train_labels[email_ID] = 0;                                                       # creates training labels to predict spam or non-spam; fills row with 0s (assuming there is no spam)                                                    
      filepathTokens = f.split('/')                                                     # creates list with each path component name 
      lastToken = filepathTokens[len(filepathTokens)-1]                                 # last name in path -> email file name
      if lastToken.startswith("spmsg"):
        train_labels[email_ID] = 1;                                                     # if file name name starts with "smpsg" (=spam), it's populated with 1 
        spam_count = spam_count + 1                                                     # "count" counts spam emails, add 1 each time a new spam is detected
      email_ID = email_ID + 1                                                           # moves on to next email (next row)      
  return frequency_matrix, train_labels

  # 0=no, 1=yes

## 3. Training the Model and Predicting Results

In this section, we choose a method to train the model and use the previously defined functions to extract the input data.

### sklearn Naive Bayes

sklearn Naive Bayes provides three alternatives for model training:

1.   **Gaussian**
2.   Multinominal
3.   Bernoulli

This model uses the Gaussian method, which is used in classification and  assumes that features follow a normal distribution.

In [6]:
words_dict = create_dict(train_data)                                    # specifies that dictionary is using train data
print ("reading and processing emails from TRAIN and TEST folders")
frequency_matrix, labels = get_features(train_data)                     # specifies that frequency_matrix and labels are derived from train data
test_frequency_matrix, test_labels = get_features(test_data)            # stores test data frequency_matrix and labels into seperate variables

model = GaussianNB()                                                    # specifies model type

print ("Training Model using Gaussian Naive Bayes algorithm .....")
model.fit(frequency_matrix, labels)                                     # trains model using train data 
print ("Training completed")        
print ("testing trained model to predict Test Data labels")
predicted_labels = model.predict(test_frequency_matrix)                 # model predicts labels based on test frequency matrix

reading and processing emails from TRAIN and TEST folders
Training Model using Gaussian Naive Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels


## 4. Evaluation

Lastly, we evaluate the model's performance based on the calculated accuracy score.

### Accuracy score

The accuracy score for predicting labels is calculated. The accuracy score represents the percentage of correct predictions, in this case whether an email is spam or not.

In [7]:
print ("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:")
print (accuracy_score(test_labels, predicted_labels))                   # accuracy score = how close predicted labels are to test labels

Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
0.9653846153846154


### Result

The returned accuracy score is 0.9654, meaning that the model is 96.54% accurate when predicting whether an email is spam. This indicates a high quality model.