# Topic Modelling of the Enron corpus, using LDA (Latent Dirichlet Allocation)

We will look at the "sent" directory of each of the 150 employees of Enron. We need to import the data and in turn, clean up the data. Info from [here](https://rforwork.info/2013/11/03/a-rather-nosy-topic-model-analysis-of-the-enron-email-corpus/) and here [here](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) proved to be very useful. Also see http://www.colorado.edu/ics/sites/default/files/attached-files/01-11_0.pdf 

In [1]:
# We use the following magic commands to time the cells in the notebook
%install_ext https://raw.github.com/cpcloud/ipython-autotime/master/autotime.py
%load_ext autotime

from os import listdir, chdir
import re

Installed autotime.py. To use it, type:
  %load_ext autotime




We are going to place all the emails of each user into one large list. In order to utalise the LDA algorithm we require there to me multiple documents. The obvious question that arises is whether to consider each email as a seperate document, or to consider the collection of each user's emails as a seperate document. For example:

Consider person $A$ has emails $A_1$, $A_2$, $A_3$ and person $B$ has emails $B_1$ and $B_2$. Then we can create a list that is L = [$A_1$, $A_2$, $A_3$, $B_1$, $B_2$] or L = [$A_1A_2A_3$, $B_1B_2$]. For now, all the emails are going to be treated as seperate documents. 

Once the LDA algorithm has been implemented, we want to be able to list all the documents that fall under a given catagory. 

We now set up the regular expressions to remove the 'clutter' from the emails.
(Note, they are purposefully long to avoid successive searches through large data)

In [2]:
# Defining regular expressions 

re1 = re.compile('(Message-ID(.*?\n)*X-FileName.*?\n)|'
                 '(To:(.*?\n)*?Subject.*?\n)|'
                 '(< (Message-ID(.*?\n)*.*?X-FileName.*?\n))')
re2 = re.compile('<|'
                 '>|'
                 '(---(.*?\n)?.*?---)|'
                 '(\*\*[.*?\s]\*\*)|'
                 '(.*?:(\s|(.*?\s)|))|'
                 '(\(\d+\))|'
                 '(\s.*?\..*?\s)|'
                 '(\s.*?\_.*?\s)|'
                 '(\s.*?\-.*?\s)|'
                 '(\s.*\/.*?\s)|'
                 '(\s.*@.*?\s)|'
                 '([\d\-\(\)\\\/\#\=]+(\s|\.))|'
                 '(\n.*?\s)|\d')
re3 = re.compile('\\\'')
re4 = re.compile('( . )|\s+')
#re5 = re.compile('( \S{1,3} )|( com )|( can )') # Some problem characters


time: 18.2 ms


We build the basic document, filtering accroding to our regular expressions. 

In [3]:
docs = []

chdir('/home/peter/Downloads/enron')
# For each user we extract all the emails in their inbox

names = [i for i in listdir()]
for name in names:
    sent = '/home/peter/Downloads/enron/' + str(name) + '/sent'   
    try: 
        chdir(sent)     
        for email in listdir():
            text = open(email,'r').read()
            # Regular expressions are used below to remove 'clutter'
            text = re.sub(re1,' ',text)
            text = re.sub(re2,' ',text)
            text = re.sub(re3,'',text)
            text = re.sub(re4,' ',text)
            #text = re.sub(re5,' ',text)
            docs.append(text)
            
    except:
        pass

time: 44.1 s


We can make use of either a) Stemming or b) Lemmatizing to find word roots. See [here](http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization) for a more detailed explination of the two. Right below, the stemmer is implemented, while two cells below, the lemmatizer is implemented. Make sure to choose which one to use before proceeding to Constructing the document-term matrix.

The stemmer generally cuts off prefixes of words according to some set rules. Thus words like 'facilitate' and shortened to 'faci' - this can be confusing and requires that the words are 're-built' before displayed. The lemmatizer also used set rules for words of a certain form, but it has the advantage of comparing words to a dictionary.

In general, the lemmatizer will have preference of use. 

### Using the stemmer:

In [None]:
# To build the dictionary
from collections import defaultdict
d = []

# We now employ the techniques as outline in the second link at the top - see **
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

texts = []

for doc in docs:
    # Tokenization
    raw = doc.lower()
    tokens = tokenizer.tokenize(raw)
    
    # Removing stop words

    # create English stop words list
    en_stop = get_stop_words('en')

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # Stemming 

    # Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()

    # stem token
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    texts.append(stemmed_tokens)
    
    # We now build the dictionary
    temp_d = defaultdict(int)
    for word in stemmed_tokens:
        temp_d[word] += 1
    d.append(temp_d)

### Using the lemmatizer (consider using this instead of the stemmer):

In [4]:
# To build the dictionary
from collections import defaultdict
d = defaultdict(int)

# We now employ the techniques as outline in the second link at the top - see **
from stop_words import get_stop_words
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

texts = []

for doc in docs:
    # Tokenization
    raw = doc.lower()
    tokens = tokenizer.tokenize(raw)
    
    # Removing stop words

    # create English stop words list
    en_stop = get_stop_words('en')

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # Stemming 

    # Create p_stemmer of class PorterStemmer
    wordnet_lemmatizer = WordNetLemmatizer()

    # stem token
    lemmatized_tokens = [wordnet_lemmatizer.lemmatize(i) for i in stopped_tokens]
    
    texts.append(lemmatized_tokens)
    
    # We now build the dictionary
    for word in lemmatized_tokens:
        d[word] += 1

time: 34.6 s


In [5]:
# Saving the LDA model to a JSON file
import json

chdir('/home/peter/Topic_Modelling/LDA/')
with open('texts_raw.jsn','w') as f:
    json.dump(texts,f)
    
with open('d.jsn','w') as f:
    json.dump(d,f)

time: 2.22 s


In [6]:
import json
chdir('/home/peter/Topic_Modelling/LDA/')
# Loading the LDA model
with open('texts_raw.jsn','r') as f:
    texts = json.load(f)
    
# Loading the python dictionary (not to be confused with other dictionary)
with open('d.jsn','r') as f:
    d = json.load(f)

time: 583 ms


We now want to remove the words from our documents that cause clutter. We will remove all the words that appear in more than 20% of documents as well as removing all the words that occur in less than 4 of the documents. We have a dictionary that counts the number of times a word in present across all the $\pm57000$ documents. 

To further enhance the quality of the text we analyse, the loops below remove all words of length 1 or 2. 

In [7]:
num_docs = len(texts)
temp_texts = texts
texts= []
upper_lim = int(0.20*num_docs)

for doc in temp_texts:
    temp_doc = []
    for word in doc:
        if 4 < d[word] < upper_lim and len(word) > 2:
            temp_doc.append(word)
    texts.append(temp_doc)

time: 1.36 s


In [None]:
import json
chdir('/home/peter/Topic_Modelling/LDA/')

# We save the new 'refined' texts file
with open('texts.jsn','w') as f:
    json.dump(temp_texts,f)

In [36]:
import json
chdir('/home/peter/Topic_Modelling/LDA/')

# Loading the texts file
with open('texts.jsn', 'r') as f:
    texts = json.load(f)

time: 414 ms


Below, we construct the document term matrix whereafter the fairly lengthy process of constructing the model takes place. Thus far the model seems be linear. With a single pass, the model takes just upward of a minute to execute, whereas for 5 passes, the model takes roughly 5.5 minutes.

The model was run for 350 passes and took 316 minutes to execute.

In [37]:
# Constructing a document-term matrix

from gensim import corpora, models

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]


time: 6.35 s


In [None]:
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=20, id2word = dictionary, passes=350)

We save both the LDA data as well as the results. We can reanalyse later. See the folder called LDAdata.

To load the files again:

ldamodel = models.LdaModel.load('ldamodel.model') and dictionary = corpora.Dictionary.load('dictionary')


In [None]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Saving the dictionary
dictionary.save('dictionary')

# Saving the corpus    
with open('corpus.jsn','w') as f:
    json.dump(corpus,f)    

# Saving the ldamodel
ldamodel.save('ldamodel')

In [None]:
chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Load dictionary
dictionary = corpora.Dictionary.load('dictionary')

In [38]:
chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Load ldamodel
ldamodel = models.LdaModel.load('ldamodel') 

time: 83.3 ms


In [None]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Load corpus
with open('corpus.jsn','r') as f:
    corpus = json.load(f)

We now print the words for each of the given topics. It must be noted, that even though considerable emphasis has been placed on the construction of the regular expressions, 'junk-text' may be present.

In [39]:
num_topics = 20
num_words = 10

List = ldamodel.print_topics(num_topics, num_words)
Topic_words =[]
for i in range(0,len(List)):
    word_list = re.sub(r'(.\....\*)|(\+ .\....\*)', '',List[i][1])
    temp = [word for word in word_list.split()]
    Topic_words.append(temp)
    print('Topic ' + str(i) + ': ' + '\n' + str(word_list))
    print('\n' + '-'*100 + '\n')

Topic 0: 
john ect future member broker brent click nymex board jason

----------------------------------------------------------------------------------------------------

Topic 1: 
contract party agreement language may transaction issue term credit payment

----------------------------------------------------------------------------------------------------

Topic 2: 
power california state energy market said utility price electricity cost

----------------------------------------------------------------------------------------------------

Topic 3: 
just get think going one dont day see good time

----------------------------------------------------------------------------------------------------

Topic 4: 
city new university houston school student producer san class administration

----------------------------------------------------------------------------------------------------

Topic 5: 
information need also project access employee process provide like issue

-----------------

In [None]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Saving the list of words
with open('topic_words.jsn','w') as f:
    json.dump(Topic_words,f)

In [34]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

with open('topic_words.jsn','r') as f:
    Topic_words = json.load(f)

time: 8.44 ms


We will now proceed to visualise the data above by using the [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html) package.

In [40]:
import warnings
warnings.filterwarnings('ignore')

import pyLDAvis.gensim

lda_visualise = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(lda_visualise)

IndexError: index 19749 is out of bounds for axis 1 with size 19749

time: 27.1 s


We now consider, for a given document, the  

In [27]:
# Set topic colours (Assigned randomly)
import random

topic_colour_gen = []

for i in range(0,num_topics):
    r = lambda: random.randint(0,255)
    topic_colour_gen.append((i,'#%02X%02X%02X' % (r(),r(),r())))
    
topic_colours = dict(topic_colour_gen)

time: 11.5 ms


The function below runs through a document of the user's choice and matches topic words within the document, highlighting them. 

In [28]:
from collections import defaultdict
import re

doc = ''

def read_doc(doc):
    chdir('/home/peter/Topic_Modelling/LDA')
    doc = open(str(doc),'r').read()
    Topics = defaultdict(int)
    for word in doc.split():
        word_edit = word.lower()
        try:
            word_edit = tokenizer.tokenize(word_edit)[0]
        except:
            pass
        word_edit = wordnet_lemmatizer.lemmatize(word_edit)
        try:
            topic = ldamodel.get_term_topics(word_edit)[0][0] 
            Topics[topic] += 1
            doc = doc.replace( ' ' + word + ' ', " <font color=" + str(topic_colours[topic]) + "'>" + word + "</font> ")      
        except:
            pass
    doc = re.sub(r'\n','<br>',doc)
    
    Topic_info = []
    num_topics = 0
    for topic in Topics:
        num_topics += Topics[topic]
        Topic_info.append([topic,Topics[topic]]) #Append Topic, number of words in document form topic and topic colour
    for item in Topic_info:
        item.append(round(item[1]/num_topics*100))
    for item in Topic_info:
        print('Topic ' + str(item[0]) + ': ' + str(item[2]) + '% ' + str(Topic_words[item[0]] ))
    return doc


time: 48.1 ms


In [30]:
# Example from http://jakevdp.github.io/blog/2013/06/01/ipython-notebook-javascript-python-communication/ adapted for IPython 2.0
# Add an input form similar to what we saw above

# Document: <input type="text" id="doc_input" size="5" height="2" value=""><br>
# <button onclick="exec_code()">Execute</button>

#Input the document we want to read
doc = '135.'

from IPython.display import HTML
from math import pi, sin

input_form = """
<div style="background-color:white; border:solid black; width:1100px; padding:20px;">
<p>"""+read_doc(doc)+"""</p>
</div>
"""

# javascript = """
# <script type="text/Javascript">
#     function exec_code(){
#         var var_name = document.getElementById('doc_input').value;
#         var command = "doc" + " = " + "read_doc" + "(" + "'" + var_name + "'" + ")";
#         console.log("Executing Command: " + command);      
#         var kernel = IPython.notebook.kernel;
#         text_to_print = kernel.execute(command);
#     }
# </script>
# """
 
HTML(input_form) # + javascript)

Topic 3: 2% ['just', 'get', 'think', 'going', 'one', 'dont', 'day', 'see', 'good', 'time']
Topic 1: 15% ['contract', 'party', 'agreement', 'language', 'may', 'transaction', 'issue', 'term', 'credit', 'payment']
Topic 18: 7% ['know', 'let', 'get', 'jeff', 'need', 'want', 'like', 'thanks', 'call', 'think']
Topic 19: 2% ['message', 'intended', 'information', 'email', 'communication', 'may', 'received', 'use', 'recipient', 'error']
Topic 5: 5% ['information', 'need', 'also', 'project', 'access', 'employee', 'process', 'provide', 'like', 'issue']
Topic 6: 41% ['agreement', 'attached', 'draft', 'copy', 'document', 'master', 'need', 'change', 'letter', 'form']
Topic 9: 5% ['gas', 'company', 'energy', 'trading', 'power', 'natural', 'product', 'trade', 'financial', 'pipeline']
Topic 12: 2% ['chris', 'gas', 'ben', 'book', 'daily', 'report', 'volume', 'thanks', 'forwarded', 'need']
Topic 13: 7% ['business', 'mark', 'group', 'risk', 'management', 'market', 'new', 'service', 'global', 'trading']
To

time: 16.8 ms
