#CA02: Spam e-Mail Detection using Naive Bayes Classification Algorithm

In this exercise we shall train the model with set of emails labelled as either from Spam or Not Spam. There are 702 emails equally divided into spam and non spam category. Next, we shall test the model on 260 emails. We shall ask model to predict the category of this emails and compare the accuracy with correct classification that we already know.

# Importing the necessary libraries 

In [0]:
#importing necessary modules 
#to create, change or  move directory
import os
import numpy as np 

#A Counter is a dict subclass for counting hashable objects
from collections import Counter 

#importing the Naive Bayes algorithm module Gaussian which is used in classification; it assumes that features follow normal distribution
from sklearn.naive_bayes import GaussianNB 

#In multilabel classification, this function computes subset accuracy
from sklearn.metrics import accuracy_score 


### Setting the Current Directory Path

Mounting the google drive

In [16]:
#mounting the google drive folder in case it returns an error
from google.colab import drive
drive.mount('/content/drive')
root_path = '/content/drive/My Drive/MSBA_Colab_2020/ML Algorithms/Naive_Bayes'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Step 1 - Cleaning and Preparing the Data 



Code to remove the non essential , repititive words and store it as a list using for loops in a function named new_Dict

Below is the description of the functions used in the code beow

*   Counter() - A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.
*   .most_common() - Return a list of the n most common elements and their counts from the most common to the least


*   os.listdir() -returns a list of items contained in the path

* os.path.join() - joins two or more directory paths. It is used to access the emails contained in the given directory folder in the google drive  





In [0]:
#This function builds a Dictionary of most common 3000 words from all the email content.
# First it adds all words and symbols in the dictionary. 
#Then it removes all non-alpha-numeric characters and any single character alpha-numeric characters. 
#After this is complete it shrinks the Dictionary by keeping only most common 3000 words in the dictionary. 
#It returns the Dictionary

#create a function named 'new_Dict' to store all the common 3000 words from emails

def make_Dictionary(root_dir):
  #instantiate an empty list
  word_list = [] 
  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)] #consists of emails contained in the folder at the given path
  for mail in emails:
  #open mail files one by one by iterating over the folder 
    with open(mail) as m: 
      #iterate through lines in the mail opened in previous line
      for line in m: 
        words = line.split() #split all the words in the mail and store it as items in a variable
        word_list += words #add them to the list
  word_dictionary = Counter(word_list) #A Counter is a dict subclass for counting hashable objects
  word_to_remove = list(word_dictionary) #iterates over dictionary list

  for item in word_to_remove: #checks if an item is non-alphabetic (numbers and symbols) and deletes it
    if item.isalpha() == False:
      del word_dictionary[item] 
    elif len(item) == 1: #deletes one letter words 
      del word_dictionary[item]
  word_dictionary = word_dictionary.most_common(3000) #Return a list of the n most common elements and their counts from the most common to the least
  return word_dictionary#create a list of the words 
  # Dictionary can be seen by print command below
  
#It contains the letters and their frequency in the decreasing order

# Part 2 : Extracting Features and Corresponding label matrix

With the help of dictionary we create a label and frequency matrix

The below python code will generate a feature vector matrix whose rows denote 700 files of training set and columns denote 3000 words of dictionary

In [0]:
#This function extracts feature columns and populates their values 
#Feature Matrix has  3000 comumns and rows are equal to the number of email files.
# The function also analyzes the File Names of each email file and decides if it's a Spame or not based on the naming convention.
# Based on this the function also creates the Labelled Data Column. 

In [0]:
#This function is used to extract the training dataset as well as the testing dataset and returns the Feature Dataset and the Label column

def extract_features(mail_dir):
  mailfolder = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  features_matrix = np.zeros((len(mailfolder),3000))
  train_labels = np.zeros(len(mailfolder))
  count = 1; #instantiate a counter
  mailID = 0; #instatiate document variable
  for mail in mailfolder: #iterates through each mail file
    with open(mail) as fi: #open the mail file
      for i, line in enumerate(fi): #iterate through each line of the email
        if i ==2: # starts from 3rd line of the mail because text starts from there
          words = line.split() #split the line into words
          for word in words: #iterates through each and every word
            wordID = 0 #set word counter to zero
            for i, d in enumerate(word_dictionary): #goes through the dictionary of 3000 most common words
              if d[0] == word:  #if the word is from the dictionary above then add it to feature matrix
                wordID = i
                #mailID -> Row id for 700 mails; wordID ->  column for 3000 most common words
                features_matrix[mailID,wordID] = words.count(word) #feature matrix consists of 700 mails as rows and 3000 common words as columns 
     
     #This part of code tests for the spam mails by testing if it contains "spmsg" in its mail body
      train_labels[mailID] = 0;
      filepathTokens = mail.split('/')
      lastToken = filepathTokens[len(filepathTokens)-1]
      if lastToken.startswith("spmsg"): #checking if the text file is a spam or no
        train_labels[mailID] = 1; #increment  row label by 1 
        count = count + 1 #count of spam files
      mailID = mailID + 1
  return features_matrix, train_labels #returns a feature matrix (700 rows * 3000 coumns) with labels




# Training and predicting with sklearn Naive Bayes

The data-set used here, is split into a training set and a test set containing 702 mails and 260 mails respectively.



In [0]:
#The section is the main Program that calls the above two functions and gets executed first.
# First it "trains" the model using model.fit function and Training Dataset. 
#After that it scores the Test Data set by running the Trained Model with the Test Data set.
# At the end it prints the model performance in terms of accuracy score.




In [19]:
#Give the data path train-mails and test-mails and store in Training and Testing variables respectively
TRAIN_DIR = '/content/drive/My Drive/MSBA_Colab_2020/ML Algorithms/CA02/Naive_Bayes/train-mails'
TEST_DIR = '/content/drive/My Drive/MSBA_Colab_2020/ML Algorithms/CA02/Naive_Bayes/test-mails'

# Create a dictionary of words with its frequency
dictionary = make_Dictionary(TRAIN_DIR)

# Prepare feature vectors per training mail and its labels
print ("reading and processing emails from TRAIN and TEST folders")
features_matrix, labels = extract_features(TRAIN_DIR) #feature matrix for training data
test_features_matrix, test_labels = extract_features(TEST_DIR) # feature matrix testing data

#using Gaussian Naive Bayes Algorithm
model = GaussianNB()

#Training Naive bayes classifier
print ("Training Model using Gaussian Naive Bayes algorithm .....")
model.fit(features_matrix, labels)
print ("Training completed")

#Test the unseen mails for Spam
print ("testing trained model to predict Test Data labels")
predicted_labels = model.predict(test_features_matrix)
print ("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:")

#print accuracy of the model
# Accuracy score is just percentage of correct predictions
print (accuracy_score(test_labels, predicted_labels))






reading and processing emails from TRAIN and TEST folders
Training Model using Gaussian Naibe Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
0.9615384615384616


References

[SPAM Email filtering](https://https://www.kdnuggets.com/2017/03/email-spam-filtering-an-implementation-with-python-and-scikit-learn.html)

[OS module in Python](https://https://pythonprogramming.net/python-3-os-module/)

[Enumerate Function](https://https://www.geeksforgeeks.org/enumerate-in-python/)