# Topic Modelling of the Enron corpus, using LDA (Latent Dirichlet Allocation)

We will look at the "sent" directory of each of the 150 employees of Enron. We need to import the data and in turn, clean up the data. Info from [here](https://rforwork.info/2013/11/03/a-rather-nosy-topic-model-analysis-of-the-enron-email-corpus/) and here [here](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) proved to be very useful. Also see http://www.colorado.edu/ics/sites/default/files/attached-files/01-11_0.pdf 

In [1]:
# We use the following magic commands to time the cells in the notebook
%install_ext https://raw.github.com/cpcloud/ipython-autotime/master/autotime.py
%load_ext autotime

from os import listdir, chdir
import re

Installed autotime.py. To use it, type:
  %load_ext autotime




We are going to place all the emails of each user into one large list. In order to utalise the LDA algorithm we require there to me multiple documents. The obvious question that arises is whether to consider each email as a seperate document, or to consider the collection of each user's emails as a seperate document. For example:

Consider person $A$ has emails $A_1$, $A_2$, $A_3$ and person $B$ has emails $B_1$ and $B_2$. Then we can create a list that is L = [$A_1$, $A_2$, $A_3$, $B_1$, $B_2$] or L = [$A_1A_2A_3$, $B_1B_2$]. For now, all the emails are going to be treated as seperate documents. 

Once the LDA algorithm has been implemented, we want to be able to list all the documents that fall under a given catagory. 

We now set up the regular expressions to remove the 'clutter' from the emails.
(Note, they are purposefully long to avoid successive searches through large data)

An alternate set of regular expressions are also included. These are seperated and thus take longer to iterate. 

In [2]:
# Defining regular expressions 

re0 = re.compile('>')
re1 = re.compile('(Message-ID(.*?\n)*X-FileName.*?\n)|'
                 '(To:(.*?\n)*?Subject.*?\n)|'
                 '(< (Message-ID(.*?\n)*.*?X-FileName.*?\n))')
re2 = re.compile('(.+)@(.+)') # Remove emails
re3 = re.compile('\s(-----)(.*?)(-----)\s', re.DOTALL)
re4 = re.compile('''\s(\*\*\*\*\*)(.*?)(\*\*\*\*\*)\s''', re.DOTALL)
re5 = re.compile('\s(_____)(.*?)(_____)\s', re.DOTALL)
re6 = re.compile('\n( )*-.*')
re7 = re.compile('\n( )*\d.*')
re8 = re.compile('(\n( )*[\w]+($|( )*\n))|(\n( )*(\w)+(\s)+(\w)+(( )*\n)|$)|(\n( )*(\w)+(\s)+(\w)+(\s)+(\w)+(( )*\n)|$)')
re9 = re.compile('.*orwarded.*')
re10 = re.compile('From.*|Sent.*|cc.*|Subject.*|Embedded.*|http.*|\w+\.\w+|.*\d\d/\d\d/\d\d\d\d.*')
re11 = re.compile(' [\d:;,.]+ ')



time: 18.9 ms


We now build a list of strings - each string being an email (document). Each document is filtered according to the regular expressions above. We also build a dictionary, namely, docs_num_dict that stores for each iteration of a name, the corresponding name and as well as a list of the filtered text.

In [3]:
from collections import defaultdict

docs = []
docs_num_dict = [] # Stores email sender's name and number

chdir('/home/peter/Downloads/enron')
# For each user we extract all the emails in their inbox

names = [i for i in listdir()]
m = 0
for name in names:
    sent = '/home/peter/Downloads/enron/' + str(name) + '/sent'   
    try: 
        chdir(sent)
        d = []
        for email in listdir():          
            text = open(email,'r').read()
            # Regular expressions are used below to remove 'clutter'
            text = re.sub(re0, ' ', text)
            text = re.sub(re1, ' ', text)
            text = re.sub(re2, ' ', text)
            text = re.sub(re3, ' ', text)
            text = re.sub(re4, ' ', text)
            text = re.sub(re5, ' ', text)
            text = re.sub(re6, ' ', text)
            text = re.sub(re7, ' ', text)
            text = re.sub(re8, ' ', text)
            text = re.sub(re9, ' ', text)
            text = re.sub(re10, ' ', text)
            text = re.sub(re11, ' ', text)
            docs.append(text)
            d.append(text)
        docs_num_dict.append((m,[name,d]))
        m += 1
    except:
        pass
    
docs_num_dict = dict(docs_num_dict)

time: 2min 12s


We can make use of either a) Stemming or b) Lemmatizing to find word roots. See [here](http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization) for a more detailed explination of the two. Right below, the lemmatizer is implemented. 

The stemmer generally cuts off prefixes of words according to some set rules. Thus words like 'facilitate' and shortened to 'faci' - this can be confusing and requires that the words are 're-built' before displayed. The lemmatizer also used set rules for words of a certain form, but it has the advantage of comparing words to a dictionary.

In general, the lemmatizer will have preference of use. 

While creating a new 'texts' variable that stores the filtered documents, we also edit the docs_num_dict to update the words according to the tokenize,stop word, lemmatize procedure.

### Using the lemmatizer (consider using this instead of the stemmer):

In [4]:
# To build the dictionary
from collections import defaultdict
d = defaultdict(int)

# We now employ the techniques as outline in the second link at the top - see **
from stop_words import get_stop_words
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

texts = []

for i in range(0,len(docs_num_dict.items())):
    new_docs_num_dict_1 = []
    for doc in docs_num_dict[i][1]:
        # Tokenization
        raw = doc.lower()
        tokens = tokenizer.tokenize(raw)

        # Removing stop words

        # create English stop words list
        en_stop = get_stop_words('en')

        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]

        # Stemming 

        # Create wordnet_lemmatizer of class WordNetLemmatizer
        wordnet_lemmatizer = WordNetLemmatizer()

        # lemmatize token
        lemmatized_tokens = [wordnet_lemmatizer.lemmatize(i) for i in stopped_tokens]

        texts.append(lemmatized_tokens)
        new_docs_num_dict_1.append(lemmatized_tokens)

        # We now build the dictionary
        for word in lemmatized_tokens:
            d[word] += 1  
    docs_num_dict[i][1] = new_docs_num_dict_1


time: 53.1 s


The texts file as well as the dictinary d (this counts the total number of times a given word is used in the corpus) is saved.

In [5]:
import json

chdir('/home/peter/Topic_Modelling/LDA/')

# Save the texts file as texts_raw (will be edited again below)
with open('texts_raw.jsn','w') as f:
    json.dump(texts,f)
f.close()

# Save the dictionary d
with open('d.jsn','w') as f:
    json.dump(d,f)
f.close()

time: 3.52 s


In [6]:
import json

chdir('/home/peter/Topic_Modelling/LDA/')

# Loading the raw texts file
with open('texts_raw.jsn','r') as f:
    texts = json.load(f)
f.close()
    
# Loading the dictionary d 
with open('d.jsn','r') as f:
    d = json.load(f)
f.close()

time: 610 ms


We now build the dictionary of dictionaries, docs_name_dict. The dictinary associates to the names of each employee, a dictionary that stores all the words used by the given person, as well as the number of times they used each of these words. 

In [7]:
from collections import defaultdict
docs_name_dict = []

for i in range(0,len(docs_num_dict.items())):
    temp_dict = defaultdict(int)
    for j in docs_num_dict[i][1]:
        for k in j:
            temp_dict[k] += 1
    # Append the temporary dictionary to docs_name_dict
    docs_name_dict.append((docs_num_dict[i][0],temp_dict)) 
docs_name_dict = dict(docs_name_dict)

time: 969 ms


We now want to remove the words from our documents that cause clutter. We will remove all the words that appear in more than 20% of documents as well as removing all the words that occur in less than 4 of the documents. We have a dictionary that counts the number of times a word in present across all the $\pm57000$ documents. 

To further enhance the quality of the text we analyse, the loops below remove all words of length 1 or 2. 

In [8]:
num_docs = len(texts)
temp_texts = texts
texts= []
upper_lim = int(0.20*num_docs)

for doc in temp_texts:
    temp_doc = []
    for word in doc:
        # If the word is in the required interval, we add it to a NEW texts variable
        if 4 < d[word] < upper_lim and len(word) > 2:
            temp_doc.append(word)
        # If the word is not in the required interval, 
        # we lower the index of the word in the docs_name_dict dictinoary
        else:
            for group in docs_name_dict.items():
                person = group[0]
                if word in docs_name_dict[person]:
                    if docs_name_dict[person][word] > 1:
                        docs_name_dict[person][word] -= 1
                    else:
                        del docs_name_dict[person][word]
    texts.append(temp_doc)

time: 21.6 s


We proceed to save the refined texts file and the dictionary, docs_name_dict.

In [9]:
import json
chdir('/home/peter/Topic_Modelling/LDA/')

# We save the new 'refined' texts file
with open('texts.jsn','w') as f:
    json.dump(texts,f)
f.close()

time: 2.66 s


In [10]:
import pickle
chdir('/home/peter/Topic_Modelling/LDA/')

# We save the docs_name_dict global person, word-count dictionary
pickle.dump( docs_name_dict , open( "docs_name_dict.p", "wb" ) )

time: 99.8 ms


In [11]:
import json
chdir('/home/peter/Topic_Modelling/LDA/')

# Loading the texts file
with open('texts.jsn', 'r') as f:
    texts = json.load(f)
f.close()

time: 522 ms


In [12]:
import pickle
chdir('/home/peter/Topic_Modelling/LDA/')

# Loading the docs_name_dict dicitonary
docs_name_dict = pickle.load( open( "docs_name_dict.p", "rb" ) )

time: 90.2 ms


Below, we construct the document term matrix whereafter the fairly lengthy process of constructing the model takes place. Thus far the model seems be linear. With a single pass, the model takes just upward of a minute to execute, whereas for 5 passes, the model takes roughly 5.5 minutes.

The model was run for 350 passes and took 316 minutes to execute.

In [13]:
# Constructing a document-term matrix

from gensim import corpora, models

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]


time: 9.23 s


In [14]:
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=20, id2word = dictionary, passes=350)

time: 7h 13min 41s


We save both the LDA data as well as the results. We can reanalyse later. See the folder called LDAdata.

To load the files again:

ldamodel = models.LdaModel.load('ldamodel.model') and dictionary = corpora.Dictionary.load('dictionary')


In [15]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Saving the dictionary
dictionary.save('dictionary')

# Saving the corpus    
with open('corpus.jsn','w') as f:
    json.dump(corpus,f)    
f.close()

# Saving the ldamodel
ldamodel.save('ldamodel')

time: 14.1 s


In [16]:
from gensim import corpora

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Load dictionary
dictionary = corpora.Dictionary.load('dictionary')

time: 16.5 ms


In [17]:
from gensim import models

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Load ldamodel
ldamodel = models.LdaModel.load('ldamodel') 

time: 66.5 ms


In [18]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Load corpus
with open('corpus.jsn','r') as f:
    corpus = json.load(f)
f.close()

time: 4.29 s


We now print the words for each of the given topics. It must be noted, that even though considerable emphasis has been placed on the construction of the regular expressions, 'junk-text' may be present.

In [19]:
num_topics = 20
num_words = 10

List = ldamodel.print_topics(num_topics, num_words)
Topic_words =[]
for i in range(0,len(List)):
    word_list = re.sub(r'(.\....\*)|(\+ .\....\*)', '',List[i][1])
    temp = [word for word in word_list.split()]
    Topic_words.append(temp)
    print('Topic ' + str(i) + ': ' + '\n' + str(word_list))
    print('\n' + '-'*100 + '\n')

Topic 0: 
california state said utility energy price market electricity davis rate

----------------------------------------------------------------------------------------------------

Topic 1: 
way web site houston address center hotel member click city

----------------------------------------------------------------------------------------------------

Topic 2: 
received content date type george com mail version man gov

----------------------------------------------------------------------------------------------------

Topic 3: 
713 north america corp houston texas fax phone 853 646

----------------------------------------------------------------------------------------------------

Topic 4: 
game love saturday night friend year school god life little

----------------------------------------------------------------------------------------------------

Topic 5: 
year say now even fact meter without vote many point

----------------------------------------------------------------

The list of words created above is saved below, from longest to shortest length.

In [43]:
for i in range(0,len(Topic_words)):
    temp = Topic_words[i]
    sort_key = lambda s: (-len(s), s)
    temp.sort(key = sort_key)
    print(temp)
    Topic_words[i] = temp

['electricity', 'california', 'utility', 'energy', 'market', 'davis', 'price', 'state', 'rate', 'said']
['address', 'houston', 'center', 'member', 'click', 'hotel', 'city', 'site', 'way', 'web']
['received', 'content', 'version', 'george', 'date', 'mail', 'type', 'com', 'gov', 'man']
['america', 'houston', 'north', 'phone', 'texas', 'corp', '646', '713', '853', 'fax']
['saturday', 'friend', 'little', 'school', 'night', 'game', 'life', 'love', 'year', 'god']
['without', 'meter', 'point', 'even', 'fact', 'many', 'vote', 'year', 'now', 'say']
['confidential', 'information', 'recipient', 'intended', 'received', 'message', 'email', 'error', 'copy', 'mail']
['arbitration', 'approval', 'facility', 'auction', 'request', 'brazil', 'permit', 'shall', 'unit', 'bid']
['counterparty', 'transaction', 'trading', 'credit', 'master', 'legal', 'trade', 'isda', 'sara', 'swap']
['trading', 'future', 'market', 'option', 'price', 'share', 'stock', 'value', 'week', 'year']
['capacity', 'contract', 'delivery'

In [45]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

# Saving the list of words
with open('topic_words.jsn','w') as f:
    json.dump(Topic_words,f)
f.close()

time: 7.75 ms


We also want to export the list of words in a csv file such that we can use the data in out D3 visualisation.

In [42]:
import json

chdir('/home/peter/Topic_Modelling/LDA/LDAdata_results')

with open('topic_words.jsn','r') as f:
    Topic_words = json.load(f)
f.close()

time: 11 ms


We will now proceed to visualise the data above by using the [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html) package.

In [47]:
import warnings
warnings.filterwarnings('ignore')

import pyLDAvis.gensim

lda_visualise = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(lda_visualise)

time: 54.4 s


We use the colour pallate called Tableau_20 that contains 20 different colours. We assign these to seperate topics.

In [24]:
from palettable.tableau import Tableau_20

topic_colour_gen = []
for i in range(0,num_topics):
    topic_colour_gen.append((i, Tableau_20.hex_colors[i]))
    
topic_colours = dict(topic_colour_gen)


time: 59.6 ms


The function below runs through a document of the user's choice and matches topic words within the document, highlighting them. 

In [25]:
from nltk.stem.wordnet import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
from collections import defaultdict
import re

doc = ''

def match_words(word):
    word_edit = word.lower()
    try:
        word_edit = tokenizer.tokenize(word_edit)[0]
    except:
        pass
    return wordnet_lemmatizer.lemmatize(word_edit)
    
def build_html_colour(word, topic):
    #return " <font color=" + str(topic_colours[topic]) + "'>" + word + "</font> "
    return ' <span style="background-color: ' + str(topic_colours[topic])  +'">' + word + '</span>'

def read_doc(doc):
    chdir('/home/peter/Topic_Modelling/LDA/text_files')
    doc = open(str(doc),'r').read()
    
    # Variables so recalculation is not necessary
    doc_split = doc.split()
    
    # Build dictionary of topic's distribution for a given document
    num_topics_weight = 0
    Topics = defaultdict(int)
    for word in doc_split:
        word_edit = match_words(word)
        try:
            word_topics = ldamodel.get_term_topics(word_edit)
            if word_topics:
                for topic in word_topics:
                    Topics[topic[0]] += topic[1]
                    num_topics_weight += topic[1]            
        except:
            pass
    # Find topic info
    # Append Topic, number of words in document from given topic and doc percentage of topic
    Topic_info = []
    for topic in Topics:
        Topic_info.append([topic, Topics[topic], round((Topics[topic]/num_topics_weight)*100)]) 
    
    # Topic info for three most prevalent topics for a given document
    Topic_info_top3 = []
    Topic_info_copy = []
    for i in Topic_info:
        Topic_info_copy.append(i)
    
    for i in range(0,3):
        max = Topic_info_copy[0]
        for topic in Topic_info_copy:
            if topic[2] > max[2]:
                max = topic
        Topic_info_top3.append(max)
        Topic_info_copy.remove(max)
        
    
    # Format the document according to topics
    for word in doc_split:
        word_edit = match_words(word)
        try:
            topic = ldamodel.get_term_topics(word_edit)[0][0]
            if (topic == Topic_info_top3[0][0]) or (topic == Topic_info_top3[1][0]) or (topic == Topic_info_top3[2][0]):
                doc = doc.replace( ' ' + word + '', build_html_colour(word,topic))
                #doc = doc.replace( '' + word + ' ', build_html_colour(word,topic))
        except:
            pass
    doc = re.sub(r'\n','<br>',doc)
    
    Output = []
    for item in Topic_info_top3:
        colour = build_html_colour('Topic ' + str(item[0]), item[0])
        topic_info = colour + ': ' + str(item[2]) + '% ' + str(Topic_words[item[0]])
        Output.append(topic_info)
    return Output, doc


time: 71.1 ms


HTML is used to add colour to the printed text. See [here](https://jakevdp.github.io/blog/2013/06/01/ipython-notebook-javascript-python-communication/) for more information.

In [26]:
# Example from http://jakevdp.github.io/blog/2013/06/01/ipython-notebook-javascript-python-communication/ adapted for IPython 2.0

#Input the document we want to read

doc = 'dickson-s_3.'

from IPython.display import HTML

input_form = """
<div style="background-color:white; border:solid black; width:1100px; padding:20px;">
<p>"""+read_doc(doc)[0][0]+"""</p>
<p>"""+read_doc(doc)[0][1]+"""</p>
<p>"""+read_doc(doc)[0][2]+"""</p>
<p>"""+read_doc(doc)[1]+"""</p>
</div>
"""

HTML(input_form) # + javascript)

time: 68.3 ms


We now also have a method to see which topics are prevalent for a given person.

Below, we create two functions, namely, get_person_topics and get_topic_persons.

get_person_topics takes in a specific person as a string and returns a dictionary with a ratio value (out of 1) for each of the 20 topics. This indicates the prevalance of each of the topics as a percentage for a given person.

get_topic_persons takes in a topic as an integer and returns a dictionary with a ratio value (out of 1) for all the employees. This indicates which employees fall under a specific topic. 

In [27]:
from collections import defaultdict

def get_person_topics(person):
    person_topics = defaultdict(int)
    total = 0
    for word in docs_name_dict[person]:
        try:
            term_topics = ldamodel.get_term_topics(word)
            if term_topics:
                for topic_tuple in term_topics:
                    person_topics[topic_tuple[0]] += topic_tuple[1]
                    total += topic_tuple[1]
        except:
            pass
        
    #scale the values
    for person in person_topics:
        person_topics[person] = person_topics[person]/total
    return person_topics

def get_topic_persons(topic):
    specific_topic_persons = defaultdict(int)
    
    total = 0
    for person in docs_name_dict:
        person_topics = get_person_topics(person)
        person_value = person_topics[topic]
        specific_topic_persons[person] += person_value
        total += person_value
    
    
    #Scale the numbers in the dictionary to a percentage
    for person in docs_name_dict:
        specific_topic_persons[person] = specific_topic_persons[person]/total
        
    return specific_topic_persons
                

time: 21.1 ms


We now see which person falls under a given topic the 'most' as well as which topic falls under a given person the 'most'.

In [28]:
# Finding top person for a given topic

topic_person = get_topic_persons(10)
maximum_person = max(topic_person.keys(), key=(lambda key: topic_person[key]))
print(maximum_person, '{0:.2%}'.format(topic_person[maximum_person]))

ybarbo-p 3.92%
time: 7.31 s


In [29]:
# Finding top topic for a given person

person_topic = get_person_topics('allen-p')
maximum_topic = max(person_topic.keys(), key=(lambda key: person_topic[key]))
print(maximum_topic, '{0:.2%}'.format(person_topic[maximum_topic]))

3 20.01%
time: 97 ms


We now make use of matplotlib to plot the above data. 

In [30]:
def get_tot_words_person(person):
    n = 0
    for word in docs_name_dict[person]:
        n += docs_name_dict[person][word]
    return n

time: 1.81 ms


We make a datastructure to export as a csv.
The data fields are,

$
\begin{array}{|c|c|c|c|c|c|c|}\hline
\text{Person Name} & \text{id} & \text{tot words} & \text{Top Topic} & \text{Top Topic} & \text{Second Topic} & \text{Second Topic} &\ldots  \\\hline
\text{Dickson, S} & \text{dickson-s} & & \ldots & \ldots & \ldots & \ldots & \ldots \\ \hline
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots
\end{array}
$

In [31]:
Data = []
list_of_names = []
list_of_names_dup = []
for name in docs_name_dict:
    list_of_names.append(name.capitalize().replace('-',', '))
    list_of_names_dup.append(name)
list_of_names.sort()
list_of_names_dup.sort()

for i in range(0,len(list_of_names)):
    name = list_of_names[i][0:-1]
    first_name = list_of_names[i][-1].capitalize()
    list_of_names[i] = name + first_name
    Data.append([name+first_name,list_of_names_dup[i],get_tot_words_person(list_of_names_dup[i])])

time: 65.3 ms


In [32]:
for data in Data:
    name = data[1]
    person_topics = get_person_topics(name)
    person_topics = [(v, k) for k, v in person_topics.items()]
    person_topics.sort()
    person_topics.reverse()
    for tuples in person_topics:
        data.append(tuples[1])
        data.append(tuples[0])
    L = range(0,20)
    for num in L:
        if num not in data:
            data.append(num)
            data.append(0)
    

time: 6.17 s


In [33]:
Data = [['Employee', 'id', 'tot_words', 'A', 'Ap', 'B', 'Bp', 'C', 'Cp'
         , 'D', 'Dp', 'E', 'Ep', 'F', 'Fp', 'G', 'Gp', 'H', 'Hp',
         'I', 'Ip', 'J', 'Jp', 'K', 'Kp', 'L', 'Lp', 'M', 'Mp', 'N', 
         'Np', 'O', 'Op', 'P', 'Pp', 'Q', 'Qp', 'R', 'Rp', 'S', 'Sp', 'T', 'Tp']] + Data

time: 2.76 ms


In [34]:
import csv

with open("bubbles_data.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(Data)
f.close()

time: 18.5 ms
