# Spam eMail Detection using Naive Bayes Classification Algorithm

This is a eMail Spam Classifers that uses Naive Bayes supervised machine learning algorithm.

In [55]:
# import packages
import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [57]:
# mount drive to import folders
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [58]:
# import folders
train_mails = '/content/drive/My Drive/MSBA_Colab_2020/ML_Algorithms/CA02/Data/train-mails'
test_mails = '/content/drive/My Drive/MSBA_Colab_2020/ML_Algorithms/CA02/Data/test-mails'

## Cleaning and Preparing the data

This function builds a Dictionary of most common 3000 words from all the email content. First it adds all words and symbols in the dictionary. Then it removes all non-alpha-numeric characters and any single character alpha-numeric characters. After this is complete it shrinks the Dictionary by keeping only most common 3000 words in the dictionary. It returns the Dictionary.

In [56]:
def create_Dictionary(root_dir): # primary directory storing all txt files
  all_words = [] # blank list for storing all words from both folders
  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
  for email in emails:
    with open(email) as e: # opens and returns file
      for line in e:
        words = line.split() # divides text into separate words
        all_words += words # adds all new words to 'words'
  words_dictionary = Counter(all_words) # creates dictionary that keeps track of how many times equivalent values are added
  remove_list = list(words_dictionary) # transform words_dictionary into list of words to remove

  for item in remove_list:
    if item.isalpha() == False: # checks if characters in the text are letters
      del words_dictionary[item] # removes values that are not letters
    elif len(item) == 1: # checks for single-character words
      del words_dictionary[item] # deletes values that are single-character
  words_dictionary = words_dictionary.most_common(3000) # keeps only the 3,000 most common words
  return words_dictionary # prints the 3,000 most common words

## Extracting features and corresponding label matrix

This function extracts feature columns and populates their values (Feature Matrix of 3000 comumns and rows equal to the number of email files). The function also analyzes the File Names of each email file and decides if it's a Spam or not based on the naming convention. Based on this the function also creates the Labelled Data Column. This function is used to extract the training dataset as well as the testing dataset and returns the Feature Dataset and the Label column.

In [60]:
def derive_features(mail_master):
  mail_files = [os.path.join(mail_master,mf) for mf in os.listdir(mail_master)]
  features_matrix = np.zeros((len(mail_files),3000)) # creates array filled with zeroes
  train_labels = np.zeros(len(mail_files))
  count = 1;
  docID = 0;
  for fil in mail_files:
    with open(fil) as mf: # opens and returns file
      for i, line in enumerate(mf): # adds a counter to mf and returns it in a form of enumerate object
        if i ==2:
          words = line.split() # divides text into separate words
          for word in words:
            wordID = 0
            for i, d in enumerate(words_dictionary): # adds a counter to the dictionary and returns it in a form of enumerate object
              if d[0] == word: 
                wordID = i
                features_matrix[docID,wordID] = words.count(word) # checks for words that also exist in the dictionary and counts number of occurences
      train_labels[docID] = 0;
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens)-1]
      if lastToken.startswith("spmsg"): # 'spmsg' represents spam
        train_labels[docID] = 1;
        count = count + 1 # adds 1 for each occurence of spam
      docID = docID + 1
  return features_matrix, train_labels                

The section is the main Program that calls the above two functions and gets executed first. First it "trains" the model using model.fit function and Training Dataset. After that it scores the Test Data set by running the Trained Model with the Test Data set. At the end it prints the model performance in terms of accuracy score.

## Training and predicting with sklearn Naive Bayes




In [63]:
words_dictionary = create_Dictionary(train_mails) # assigns training data folder to dictionary

print ("reading and processing emails from TRAIN and TEST folders")
features_matrix, labels = derive_features(train_mails) # assigns training data folder to the features matrix
test_features_matrix, test_labels = derive_features(test_mails) # assigns testing data folder to the test features matrix

model = GaussianNB() # assigns Gaussian as the model type

print ("Training Model using Gaussian Naibe Bayes algorithm .....")
model.fit(features_matrix, labels) # uses training data to train the model
print ("Training completed")
print ("testing trained model to predict Test Data labels")
predicted_labels = model.predict(test_features_matrix) # uses testing data for the prediction model

reading and processing emails from TRAIN and TEST folders
Training Model using Gaussian Naibe Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels


## Accuracy Score

In [62]:
print ("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:")
print (accuracy_score(test_labels, predicted_labels))

Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
0.9653846153846154
