In [35]:
# Import packages
import os   # for reading files
import numpy as np    
from collections import Counter
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [36]:
#Quickly check what a text file looks like
print(open("./test-mails\\8-899msg1.txt", 'r').read())

Subject: book : phonetic / speech production

shigeru kiritanus , hajime hirose hiroya fujisakus ( editor ) speech production language honor osamu fujimura 1997 . 23 x 15 , 5 cm . x , 302 page . cloth dm 188 , - / approx . us $ 134 . 0 isbn 3-11 - 6847 - 0 speech research 13 mouton de gruyter * berlin * york osamu fujimura renown interest competence wide variety subject rang physics , physiology phonetics linguistics artificial intelligence . through fusion discipline show us human speech language relate physical physiological process phonetics abstract , higher-level linguistic structure . reflect osama fujimura 's long-stand interest , chapter volume provide wide perspective various aspect speech production ( physical , physiological , syntactic , information theoretic ) relationship structure speech language . content 1 background * manfr r . schroeder , speech : physicist remember * 2 larygeal function speech * minoru hirano , kiminorus sato keiichiro yukizane , male - female diffe

This function builds a Dictionary of most common 3000 words from all the email content. First it adds all words and symbols in the dictionary. Then it removes all non-alpha-numeric characters and any single character alpha-numeric characters. After this is complete it shrinks the Dictionary by keeping only most common 3000 words in the dictionary. It returns the Dictionary.


In [37]:
def make_Dictionary(root_dir):
  all_words=[]    #create empty list
  emails = [os.path.join(root_dir, f) for f in os.listdir(root_dir)]     # create a list with email paths 
  for mail in emails:    # loop through all emails in the list of emails 
    with open(mail) as m:    # open emails 
        for i, line in enumerate(m):    # loop through each line
            if i == 2:     # start at the 3rd line which is text
                words = line.split()    # split the sentence into words 
                all_words += words    #append each word tolist
  dictionary = Counter(all_words)    # Counter count how many time each word appears in a list and create dictionary

  list_to_remove = list(dictionary)    # creating a list of keys/words without the count
  
  for item in list_to_remove:
    if item.isalpha() == False:    # .isalpha() check if all characters are letters
      del dictionary[item]    # remove all non letter characters
    elif len(item) == 1:    # check if any words are one-letter word
      del dictionary[item]    # remove all 1 letter words
  dictionary = dictionary.most_common(3000)    # .most_common() is a method to the Counter dictionary subclass
  return dictionary    # return a list of tuples 

This function extracts feature columns and populates their values (Feature Matrix of 3000 comumns and rows equal to the number of email files). The function also analyzes the File Names of each email file and decides if it's a Spame or not based on the naming convention. Based on this the function also creates the Labelled Data Column. This function is used to extract the training dataset as well as the testing dataset and returns the Feature Dataset and the Label column.

In [38]:
spam = 'spmsg'
def extract_features(mail_dir):
    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]    # get file path
    features_matrix = np.zeros((len(files), 3000))    # creating an empty matrix
    train_labels = np.zeros(len(files))
    count = 1;  
    docID = 0;  
    for fil in files:    # for path in file_paths, looping through each files from folder
        with open(fil) as fi:
            for i, line in enumerate(fi):    # loop through each line
                if i == 2:    # start at the 3rd line which is text
                    words = line.split()    # split on white space
                    for word in words:    # loop through each word in a sentence
                        wordID = 0
                        for i, d in enumerate(dictionary):    # d is the tuples in the list of tuples
                            if d[0] == word:    # d[0] is the word in the tuple, d[1] is the count of that word
                                wordID = i    # if there is a match of the word, i would be the column number of the feature_matrix
                                features_matrix[docID, wordID] = words.count(word)   # filling the matrix with the count of each word in a mail
            train_labels[docID] = 0;  # assume every sentence is non-spam which is labeled by 0
# An Alternative
#             try :
#                 fil.index(spam)
#                 train_labels[docID] = 1
#             except ValueError:
#                 print("Not found!")       
            filepathTokens = fil.split('\\')    # split the file path in to seperate strings into a list. The string is actually seperated by '\\' nor '/'
            lastToken = filepathTokens[-1]    # choose the last item
            if lastToken.startswith('spmsg'):    # all spam mesages starts with spmsg
                train_labels[docID] = 1;    # spam messages are labeled as 1 
                count = count + 1    # add 1 to count when there is a match
            docID = docID + 1    # move on to the next message and add 1 to docID
    return features_matrix, train_labels

The section is the main Program that calls the above two functions and gets executed first. First it "trains" the model using model.fit function and Training Dataset. After that it scores the Test Data set by running the Trained Model with the Test Data set. At the end it prints the model performance in terms of accuracy score.

In [39]:
# set path for files
TRAIN_DIR = './train-mails'
TEST_DIR = './test-mails'

# create dictionary
dictionary = make_Dictionary(TRAIN_DIR)
print(dictionary[0:10]) # check dictionary

print('reading and processing emails from TRAIN and TEST folders...')
features_matrix, labels = extract_features(TRAIN_DIR)
test_features_matrix, test_labels = extract_features(TEST_DIR)

model = GaussianNB()

[('order', 1414), ('address', 1293), ('report', 1216), ('mail', 1127), ('send', 1079), ('language', 1072), ('email', 1051), ('program', 1001), ('our', 987), ('list', 935)]
reading and processing emails from TRAIN and TEST folders...


In [40]:
print('Training Model using Gaussian Naive Bayes algorithm......')
model.fit(features_matrix, labels)
print('Training completed')
print('testing trained model to predict Test Data labels')
predicted_labels = model.predict(test_features_matrix)
print('Completed classification of the Test Data ... now printing Accuracy Score by comparing predicted labels with the test labels:')
print(accuracy_score(test_labels, predicted_labels))

Training Model using Gaussian Naive Bayes algorithm......
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data ... now printing Accuracy Score by comparing predicted labels with the test labels:
0.9615384615384616
