# LDA analysis of 'sent' subset of Enron emails

The full Enron corpus was downloaded, and an attempt was made to filter through the Inboxes of each of the 150 employees. A clear problem emereged, namely, that the emails were vastly inconsistent. The following [PDF](http://www.colorado.edu/ics/sites/default/files/attached-files/01-11_0.pdf) was found that shed light on analysing the Corpus. The observation was made that the 'Sent' directory would contain much 'cleaner' data, since a person would be less likely to forward junk-mail. In addition to making this useful observation, the writer pointed to a 'filtered' verison of the sent mails in the corpus. Thus, rather than doing the work of filtering through the corpus, the project will progress directly to LDA techniques. Once a rustic model is set up, one can revisit the regural expressions to see if generalisations are possible. 

** Second link [here](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)

In [2]:
from os import listdir, chdir
import re

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

Let us first import the data into a list 

In [3]:
docs = []
chdir('/home/peter/Downloads/enronsent')
for file in listdir():
    if file.startswith('enron'):
        text = open(file).read()
        
        text = re.sub(r'[\w\.-]+@[\w\.-]+','',text) # Remove emails
        text = re.sub(r'[\*\\\/\_\=\"-\$(...)(~~~)(---)]+', '' ,text) # Remove misc 
        text = re.sub(r'(\d+th)|(\d+)', "",text) # Remove arbitrary numbers
        text = re.sub(r'\(.\)','',text) # Remove possibile multiplicity in words
        text = re.sub('\\\'', "",text) # NO IDEA HOW TO FILTER THIS OUT!
        text = re.sub(r'\s+',' ',text) # Remove newline and whitespace 
        text = re.sub(r'( . )|( .\: )',' ', text) # Remove single characters
        docs.append(text)


In [6]:
# TESTING docs DATA
print(len(docs))
docs[0][0:30000]


45


'because MG is by no means credit risk by any definition The only other document relating to the metal that would like to have is evidence that Aluminum of Siberia did pay Unimetal for the metal in question Could you lay hands on that? will attempt to send you draft of the proposed letter to the Bad Guys later this evening in the hope that we can discuss on Tuesday Regards Mark Pleased to recap CURRENT situation as follows: On Friday July Judge Garbis District of Maryland ordered that the attachment of our metal be vacated ie released but that ANY proceeds from any sales be paid to: Bank of America Light Street Baltimore MD Account name: Miles and Stockbridge PC Account number: Escrow We intend to challenge this condition immediately hopefully Tuesday Wednesday July Judge Garbis appears to be on our side but since this was Friday afternoon judgement he has allowed Base Metals little more time to try and prove we were involved in some fraud The Judge told their lawyers that he could see

In [7]:
# We now employ the techniques as outline in the second link at the top - see **
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer

texts = []

for doc in docs:
    # Tokenization
    raw = doc.lower()
    tokens = tokenizer.tokenize(raw)
    
    # Removing stop words

    # create English stop words list
    en_stop = get_stop_words('en')

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # Stemming 

    # Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()

    # stem token
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    texts.append(stemmed_tokens)

In [8]:
# Constructing a document-term matrix

from gensim import corpora, models

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]


In [None]:
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=150)


In [124]:
# Save the LDA model

with open('lda_model.pkl', 'wb') as f:
    pickle.dump(ldamodel, f, pickle.HIGHEST_PROTOCOL)


In [None]:
List = ldamodel.print_topics(num_topics=5, num_words=9)
for i in List:
    print(i)