# LDA analysis of 'sent' subset of Enron emails

The full Enron corpus was downloaded, and an attempt was made to filter through the Inboxes of each of the 150 employees. A clear problem emereged, namely, that the emails were vastly inconsistent. The following [PDF](http://www.colorado.edu/ics/sites/default/files/attached-files/01-11_0.pdf) was found that shed light on analysing the Corpus. The observation was made that the 'Sent' directory would contain much 'cleaner' data, since a person would be less likely to forward junk-mail. In addition to making this useful observation, the writer pointed to a 'filtered' verison of the sent mails in the corpus. Thus, rather than doing the work of filtering through the corpus, the project will progress directly to LDA techniques. Once a rustic model is set up, one can revisit the regural expressions to see if generalisations are possible. 

** Second link [here](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)

In [45]:
from os import listdir, chdir
import re

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

Let us first import the data into a list 

In [47]:
docs = []
chdir('/home/peter/Downloads/enronsent')
for file in listdir():
    if file.startswith('enron'):
        text = open(file).read()
        
        text = re.sub(r'[\w\.-]+@[\w\.-]+','',text) # Remove emails
        text = re.sub(r'[\*\\\/\_\=\"-\$(...)(~~~)(---)]+', '' ,text) # Remove misc 
        text = re.sub(r'(\d+th)|(\d+)', "",text) # Remove arbitrary numbers
        text = re.sub(r'\(.\)','',text) # Remove possibile multiplicity in words
        text = re.sub('\\\'', "",text) # NO IDEA HOW TO FILTER THIS OUT!
        text = re.sub(r'-----Original Message-----', ' ', text) ###
        text = re.sub(r'(will|can|please|can|know|thank)', ' ', text) #######
        text = re.sub(r'\s+',' ',text) # Remove newline and whitespace 
        text = re.sub(r'omni.* ',' ', text)###3
        text = re.sub(r'( . )|( .\: )',' ', text) # Remove single characters
        docs.append(text)


In [None]:
# TESTING docs DATA
print(len(docs))
docs[0][1000000:1040000]


In [48]:
# We now employ the techniques as outline in the second link at the top - see **
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer

texts = []

for doc in docs:
    # Tokenization
    raw = doc.lower()
    tokens = tokenizer.tokenize(raw)
    
    # Removing stop words

    # create English stop words list
    en_stop = get_stop_words('en')

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # Stemming 

    # Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()

    # stem token
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    texts.append(stemmed_tokens)

In [49]:
# Constructing a document-term matrix

from gensim import corpora, models

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]


In [50]:
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=7, id2word = dictionary, passes=10)


In [51]:
# Save the LDA model
import pickle
with open('lda_model_array_strict_version.pkl', 'wb') as f:
    pickle.dump(ldamodel, f, pickle.HIGHEST_PROTOCOL)


NameError: name 'pickle' is not defined

In [52]:
num_topics = 7
num_words = 9

List = ldamodel.print_topics(num_topics, num_words)
for i in range(0,len(List)):
    print('Topic ' + str(i) + ': ' + str(re.sub(r'(.\....\*)|(\+ .\....\*)', '',List[i][1])))

Topic 0: carol thank need fax pm enron pleas get agreement
Topic 1: thank need pleas ga get let enron deal work
Topic 2: thank enron get go need time like pleas work
Topic 3: thank enron pleas email vinc work need attach pm
Topic 4: thank deal go get let just need im pm
Topic 5: power messag thank enron state get california call energi
Topic 6: pm cndjohn forneyoudhouodect omniupdatedbi omnicalendarentryid omniorgt omnicalendarentri omniappointmenttyp omnienddatetim


In [33]:
len(ldamodel.print_topics())

7