The features that will be used are the words that can be found in the each document in the entire corpus. In order to extract all the features from the corpus, we first need to extract all the words in the corpus, the words will serve as a feature

The following are the libraries that will be used in this tutorial

In [1]:
#imports here
import re
import os
import json
import pandas
import numpy as np
from pprint import pprint

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import csv 

from numpy import genfromtxt


Create a class that will hold the Data about each Corpus. 
In the paper, they used 4 different corpus

1. Bare : Original Email Data
2. Lemm : Words in the emails are lemmatized
3. Stop : Stops words are removed from the emails
4. Lemm_stop : Words are lemmatized and stop words are removed from the emails

In [2]:
# create a class for the the Corpus data, it will store the total number of emails in the corpus, 
# along with the total number of spam and legit emails
class CorpusData: 
    corpusName = ""
    totalEmailCtr = 0
    spamEmailCtr = 0
    legitEmailCtr = 0

    def __init__(self, corpusName, totalEmailCtr, spamEmailCtr, legitEmailCtr):
        self.corpusName = corpusName
        self.totalEmailCtr = totalEmailCtr
        self.spamEmailCtr = spamEmailCtr
        self.legitEmailCtr = legitEmailCtr

The function `extractWords`, extracts the words in a textfle given its filepath as its parameter

`words =  [i for i in words if len(i) >= 4]` filters out the words that are less than 4 letters

`words = list(set(words))` Creates a set from the list of words extracted, and duplicate words are removed


The function `writeToFile`, write to text file given the filepath, and the data to write as its parameters

In [3]:
def extractWords(filepath):
    file = open(filepath, 'r')
    # .lower() returns a version with all upper case characters replaced with lower case characters.
    text = file.read().lower()
    file.close()
    # replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
    text = re.sub('[^a-z]+', " ", text)
    words = list(text.split())
    
     # remove duplicate words in the list
    words = list(set(words))
    # removes words that are less than 4 letters/characters
    words =  [i for i in words if len(i) >= 4] 
    return words;

# this functions write the list of tokens in the tokenWords parameter on the filepath passed on filename parameter 
def writeToFile(filename, tokenWords):
    f = open(filename, 'w')
    json.dump(tokenWords, f)
    f.close()

The code below traverses all of the documents that is under a given directory, we have 4 corpus, so we need to do this 4 times, one for each corpus. 

In [4]:
def preprocessCorpus(corp):    
    # this will hold all the words in the corpus
    wordList = []
    #this regular expression will match all text files with this pattern (Files that has this pattern are the Legitimate Emails)
    pattern = re.compile("\d+-\d+msg\d+.txt")

    legitEmailCTR = 0
    spamCTR = 0
    counter = 0  
    spamEmails = []
    legitEmails = []
    rootdir = "Emails/"
    rootdir = rootdir + corp

    #for each subdirectory in a corpus (folders - part 1 - 10)
    for subdir, dirs, files in os.walk(rootdir):
        #for each file in a folder
        for file in files:   
            tempList = []
            filepath =  subdir + os.sep + file
            #words are extracted from a file
            tempList = extractWords(filepath)
            #extracted words are added to the word list
            wordList.extend(tempList)

            #increment total email counter for this corpus
            counter = counter +1
            # create a string of all the words in a document (instead of a list)
            joinedStr = ' '.join(tempList)
            # if file is a legitimate email
            if pattern.match(file):    
                #add string/file to legitimate emails list
                legitEmails.append(joinedStr)
                #increment legitimate email counter
                legitEmailCTR =  legitEmailCTR + 1
            else: 
                #add string/file to spam emails list
                spamEmails.append(joinedStr)
                #increment spam email counter
                spamCTR = spamCTR + 1
    #Update word list to remove all duplicates words in the corpus (cause, a word might occur in multiple documents/emails)     
    wordList = list(set(wordList))     

    #tokens/words extracted from the emails will be stored in a text file so it can be used for further analyzation
    pd1 = pandas.DataFrame(wordList)
    pd1.to_csv("Features/"+corp+"/"+corp+"_words.csv",  header=False,  index=False)
    pd2 = pandas.DataFrame(legitEmails)
    pd2.to_csv("Features/"+corp+"/"+corp+"_legitEmails.csv",  header=False,  index=False)
    pd3 = pandas.DataFrame(spamEmails)
    pd3.to_csv("Features/"+corp+"/"+corp+"_spamEmails.csv",  header=False,  index=False)

    #create a class that will hold all relevant information about the Corpus    
    x = CorpusData(corp, counter, spamCTR, legitEmailCTR)

    return x

Create a 2D matrix of term occurrence, x = terms, y = documents, 1 = term occurred in document then 0 otherwise, 

We will now populate the term document matrix, we can have two approaches in doing this. 

1. Build our own term document matrix populator by traversing each document in a corpus, and compare each word in a document to the term list, to get its index and mark it as 1 (term occurred in document)

The first approach will have the following steps:

1. iterate over all the documents/emails in the corpus
2. Each document/email extract the words
3. Compare each word in the document to the term list
4. tdm[term][document] = 1

**MAJOR DOWNSIDE**: this will take so much time to finish executing

In [47]:
def approach1():
    #create document list
    documentList = corpusDataList[0].legitEmails

    #initialize document counter to 0 (first document)
    docCtr = 0
    print ("Starting")
    #for each document
    for doc in documentList:   
    #     print ("---DOC---",doc)
        #for each word in a document
        for word in doc:
            # if word is in the wordlist
    #         print ("---WORD---",word)
            if  word in wordList: 
                #get index of word in the word list
                b=wordList.index(word)
                print ("index:", b)
                #update tdm to 1 (1: word occured in document)
                tdm[b][docCtr] = 1
        #update document counter + 1, for the next document
        docCtr = docCtr + 1

    print ("done")
    print (tdm)

2: or we can use a sklearn function called count vectorizer, example of what `CountVectorizer` will do

given a list of Vocabulary (in our case, Term/Word List) it will create a term document matrix (array) with all of the terms mapped on the documents 

A sample of this library is shown in the next cell. 
1. *vocab*:  is the list of all the terms that will be used to tag the documents
2. *doc*: The list of documents

In [6]:
vocab = ['hot', 'cold', 'old']
#load all the term list here
cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=vocab)

doc = ['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']
#load all the documents here
cv.fit_transform(doc).toarray()

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]])

Applying that to our data

In [5]:
def approach2(corp):    
    #load the documents here
    rootDIR = "Features/"
    corpus = corp+"/"+corp
    spamMail = "_spamEmails.csv"
    legitMail = "_legitEmails.csv"
    

    #create a document list
    spamDocumentList = []
    legitDocumentList = []

    file_name_spam = rootDIR+corpus+spamMail
    file_name_legit = rootDIR+corpus+legitMail

    with open(file_name_spam, 'r') as f:  
        reader = csv.reader(f)   
        #for each row in the file, append it the document list
        for row in reader:
            for doc in row:
                spamDocumentList.append(doc)

    print ("The File that we are using for the documents is:", file_name_spam)
    print ("There are: ", len(spamDocumentList), "Spam Documents")

    with open(file_name_legit, 'r') as f:  
        reader = csv.reader(f)   
        #for each row in the file, append it the document list
        for row in reader:
            for doc in row:
                legitDocumentList.append(doc)

    print ("The File that we are using for the documents is:", file_name_legit)
    print ("There are: ", len(legitDocumentList), "Legitimate Documents")
    print ("done")

After we have, the terms and the documents loaded from the file. We can now start creating the term document matrix, and the term document matrix to file


term document matrix (tdm)
1. rows : documents
2. columns : terms/words

In [6]:
def createTDM(corp):   
    
    #load the vocabulary /word list/ term list from file
    wordList = open("Features/"+corp+"/"+corp+"_words.csv").read().splitlines()
    print ("There are: ", len(wordList), " Words")
    
    #use term list here
    cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=wordList)
    
    spamDocumentList = open("Features/"+corp+"/"+corp+"_spamEmails.csv").read().splitlines()
    legitDocumentList = open("Features/"+corp+"/"+corp+"_legitEmails.csv").read().splitlines()
    
    # -------------------- SPAM -----------------------
    #use documents list here
    tdmSpam = cv.fit_transform(spamDocumentList).toarray()
    print ("There are: ", tdmSpam.size, " cells in the term document matrix")

    #create a numpy array
    aSpam = np.asarray(tdmSpam)
    #save the Term Document Matrix to CSV (so we can access it later)
    np.savetxt("Features/"+corp+"/spamTDM.csv", aSpam, delimiter=",")


    # -------------------- LEGITIMATE -----------------------
    #use documents list here
    tdmLegit = cv.fit_transform(legitDocumentList).toarray()
    print ("There are: ", tdmLegit.size, " cells in the term document matrix")

    #create a numpy array
    aLegit= np.asarray(tdmLegit)
    #save the Term Document Matrix to CSV (so we can access it later)
    np.savetxt("Features/"+corp+"/legitTDM.csv", aLegit, delimiter=",")
    print("done")

Once we have the term document matrix, we can now start computing for the Mutual Information of each term in the corpus

\begin{align}
MI(X;C) =\sum_{x \epsilon {0,1}, c \epsilon {spam,legitimate}}^{ } P(X=x, C=c) \cdot log \frac{P(X=x, C=c)}{P(X=x)\cdot P(C=c)}
\end{align}


*P(X=x, C=c)* : Total number of documents the word `x` occurred in documents that are class `c`

*P(X=x)* : Total number of documents the word `x` occurred in the corpus

*P(C=c)* : Total number of documents the are class  `c` in the corpus

In [7]:
def processSpamTDM(corp): 
    #load the spamTDM from file
    spamTDM = genfromtxt("Features/"+corp+"/spamTDM.csv", delimiter=',')
    #prints the size of the matrix, (number of documents, number of terms)
    print("Spam TDM : ", spamTDM.shape)

    spamTermOccurCount = (spamTDM != 0).sum(0)
    print ("done")

    #create a numpy array
    spamTermCount = np.asarray(spamTermOccurCount)
    #save the Term Document Matrix to CSV (so we can access it later)
    np.savetxt("Features/"+corp+"/spamTermCount.csv", spamTermCount, delimiter=",")
    print("File saved")

In [8]:
def processLegitTDM(corp):
    #load the legitTDM from file
    legitTDM = genfromtxt("Features/"+corp+"/legitTDM.csv", delimiter=',')
    #prints the size of the matrix, (number of documents, number of terms)
    print("Legit TDM : ", legitTDM.shape)
    legitTermOccurCount = (legitTDM != 0).sum(0)
    print ("done")


    #create a numpy array
    legitTermCount = np.asarray(legitTermOccurCount)
    #save the Term Document Matrix to CSV (so we can access it later)
    np.savetxt("Features/"+corp+"/legitTermCount.csv", legitTermCount, delimiter=",")
    print("File Saved")

After all that pre-processing, now we can *finally* compute the mutual information score of the term on the entire corpus

In [9]:
import math

#A = P(X=x|C=c)
#P(X=x|C=c) = (Total number of times x occurred in documents that are c)/(Total number of documents that are c)
#B = P(X=x) = (Total number of times x occurred in the corpus)/(Total number of word occurrence in the corpus)
#C = P(C=c) = (Total number of documents that are c in the corpus)/(Total number of documents in the corpus)
      
    
def computeMI(termSpamCount, termLegitCount, totalWordOccurrence, totalLegitCount, totalSpamCount, totalDoc):
    totalTermCount = termSpamCount + termLegitCount
    
    ASpam = (termSpamCount/totalSpamCount)
    ALegit = (termLegitCount/totalLegitCount)
    B = totalTermCount/totalWordOccurrence
    CSpam = totalSpamCount/totalDoc
    CLegit = totalLegitCount/totalDoc
    
    try:
        try:
            insideLog = (ASpam / (B*CSpam))
        except ZeroDivisionError:
            insideLog = 0
        classSpam = ASpam * math.log10(insideLog)
    except ValueError:
        classSpam = 0
    
    try:
        try:
            insideLog = (ALegit / (B*CLegit))
        except ZeroDivisionError:
            insideLog = 0
        classLegit = ALegit * math.log10(insideLog) 
    except ValueError:
        classLegit = 0
        
    return (classSpam + classLegit) 
    

def createMI(corp, totalLegitCount, totalSpamCount):
    totalDoc  = totalLegitCount + totalSpamCount
    
    legitTermCountArr = genfromtxt('Features/'+corp+'/legitTermCount.csv', delimiter=',')
    spamTermCountArr = genfromtxt('Features/'+corp+'/spamTermCount.csv', delimiter=',')
    mi = []

    vfunc = np.vectorize(computeMI)
    totalWordOccurrence = np.sum(spamTermCountArr) + np.sum(legitTermCountArr) 
    mi = vfunc(spamTermCountArr, legitTermCountArr, totalWordOccurrence, totalLegitCount, totalSpamCount, totalDoc)

    wordList = open("Features/"+corp+"/"+corp+"_words.csv").read().splitlines()
    print (corp, len(wordList), len(mi))
    
    termMI_List = pandas.DataFrame(
        {'Term': wordList,
         'MI': mi
        })

    pprint (termMI_List[:5])
    sortedList =  termMI_List.sort_values('MI',ascending = False)
    pprint (sortedList[:5])

    #save the Term MI to CSV (so we can access it later)
    sortedList.to_csv("Features/"+corp+"/"+corp+"termMI.csv")
    print("File Saved")

After that is all set, we can now start generating the data that will be used

In [10]:
# corpus list 
corpus = ["bare", "lemm","lemm_stop", "stop"]

#create a list that will hold CorpusData Object
corpusDataList = []

#you can execute the steps per corpus continuously, but since i have limited resources, I'll do it one at a time. 
#step1
for corp in corpus:
    corpusDataList.append(preprocessCorpus(corp))
    print("Done: preprocessCorpus ", corp)
print ("done")

Done: preprocessCorpus  bare
Done: preprocessCorpus  lemm
Done: preprocessCorpus  lemm_stop
Done: preprocessCorpus  stop
done


In [11]:
# corpus list 
corpus = ["bare", "lemm","lemm_stop", "stop"]

#step2
for corp in corpus:
    approach2(corp)
    print("Done: approach2 ", corp)
print ("Done")

The File that we are using for the documents is: Features/bare/bare_spamEmails.csv
There are:  304 Spam Documents
The File that we are using for the documents is: Features/bare/bare_legitEmails.csv
There are:  2211 Legitimate Documents
done
Done: approach2  bare
The File that we are using for the documents is: Features/lemm/lemm_spamEmails.csv
There are:  452 Spam Documents
The File that we are using for the documents is: Features/lemm/lemm_legitEmails.csv
There are:  2324 Legitimate Documents
done
Done: approach2  lemm
The File that we are using for the documents is: Features/lemm_stop/lemm_stop_spamEmails.csv
There are:  281 Spam Documents
The File that we are using for the documents is: Features/lemm_stop/lemm_stop_legitEmails.csv
There are:  2409 Legitimate Documents
done
Done: approach2  lemm_stop
The File that we are using for the documents is: Features/stop/stop_spamEmails.csv
There are:  481 Spam Documents
The File that we are using for the documents is: Features/stop/stop_legi

In [12]:
# corpus list 
corpus = ["bare", "lemm","lemm_stop", "stop"]

#step2
for corp in corpus:
    createTDM(corp)
    print("Done: createTDM ",corp)
print("Done")

There are:  47661  Words
There are:  14488944  cells in the term document matrix
There are:  105378471  cells in the term document matrix
done
Done: createTDM  bare
There are:  44842  Words
There are:  20268584  cells in the term document matrix
There are:  104212808  cells in the term document matrix
done
Done: createTDM  lemm
There are:  44228  Words
There are:  12428068  cells in the term document matrix
There are:  106545252  cells in the term document matrix
done
Done: createTDM  lemm_stop
There are:  44749  Words
There are:  21524269  cells in the term document matrix
There are:  83233140  cells in the term document matrix
done
Done: createTDM  stop
Done


In [13]:
# corpus list 
corpus = ["bare", "lemm","lemm_stop", "stop"]

for corp in corpus:    
    processSpamTDM(corp)
    print("Done: processSpamTDM ",corp)
    processLegitTDM(corp)
    print("Done: processLegitTDM ",corp)

Spam TDM :  (304, 47661)
done
File saved
Done: processSpamTDM  bare
Legit TDM :  (2211, 47661)
done
File Saved
Done: processLegitTDM  bare
Spam TDM :  (452, 44842)
done
File saved
Done: processSpamTDM  lemm
Legit TDM :  (2324, 44842)
done
File Saved
Done: processLegitTDM  lemm
Spam TDM :  (281, 44228)
done
File saved
Done: processSpamTDM  lemm_stop
Legit TDM :  (2409, 44228)
done
File Saved
Done: processLegitTDM  lemm_stop
Spam TDM :  (481, 44749)
done
File saved
Done: processSpamTDM  stop
Legit TDM :  (1860, 44749)
done
File Saved
Done: processLegitTDM  stop


i will populate corpusDataList MANUALLY for now, because i already know how many are the legit emails and spam emails are per corpus but in reality you could:
1. Execute preprocessing (but why)
2. Create a new function that counts how many legit and spam emails are in the corpus

In [14]:
#create a list that will hold CorpusData Object
corpusDataList = []

bare = CorpusData("bare", 2515, 304, 452)
lemm = CorpusData("lemm", 2776, 452, 2324)
lemm_stop = CorpusData("lemm_stop", 2609, 281, 2409)
stop = CorpusData("stop", 2341, 481, 1860)

corpusDataList.append(bare)
corpusDataList.append(lemm)
corpusDataList.append(lemm_stop)
corpusDataList.append(stop)

for corp in corpusDataList:  
    createMI(corp.corpusName, corp.legitEmailCtr, corp.spamEmailCtr)
    print("Done: createMI ",corp.corpusName)

bare 47661 47661
         MI          Term
0  0.014192    succession
1  0.290927         india
2  0.023367         loser
3  0.007096  proscriptive
4  0.007096    bonvillain
              MI     Term
5354   18.049254  subject
40233  11.429276     this
21546  11.349580     with
8553   10.438905     from
8126   10.261849     that
File Saved
Done: createMI  bare
lemm 44842 44842
         MI          Term
0  0.002032    succession
1  0.001016         india
2  0.025039         loser
3  0.001016  proscriptive
4  0.001016         celao
             MI     Term
4967   5.279136  subject
37863  3.734168     this
4194   3.511598     have
20222  3.460249     with
7943   3.244249     from
File Saved
Done: createMI  lemm
lemm_stop 44228 44228
         MI          Term
0  0.001883    succession
1  0.000941         india
2  0.043492         loser
3  0.000941  proscriptive
4  0.000941         celao
             MI         Term
4896   5.372569      subject
24169  2.778434         mail
14508  2.502427    

Check the features folder if you want to see the generated files, this is the end of the feature extraction tutorial. 