# Corpus

This part of the work is based on two tutorial take from [here](https://radimrehurek.com/gensim/tut1.html) and [here](https://radimrehurek.com/gensim/tut2.html#available-transformations).



In [807]:
import pandas as pd                                     
import numpy as np                                      
import os 

import matplotlib.pyplot as plt

from datetime import datetime

%matplotlib inline
import seaborn as sns                                   # For pretty plots
import gensim

from os import path
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from PIL import Image
from gensim import corpora, models, similarities
from collections import defaultdict

### Clean data

We decide tu use only 'extractedBodyText' form the email beacause this flield is susceptible to contain more information about topics.

In [808]:
emails = pd.read_csv("hillary-clinton-emails/Emails.csv")
len(emails)

7945

We have 7945 emails but some of them doesn't have extractedBody, so we decide to drop them. Then we have 6742 emails left.

In [809]:
emails_text = emails['ExtractedBodyText'].dropna()
len(emails_text)

6742

For cleaning emails we first transfom all characters into lowercase and we transform all word u.s.a on usa and all u.s on us. After that we can remplace juste all punctuation signs by an espace. 

In [810]:
emails_text = emails_text.apply(lambda x: x.lower())
emails_text = emails_text.apply(lambda x: x\
                                .replace('u.s.a','usa')\
                                .replace('u.s','us'))
emails_text = emails_text.apply(lambda x: x\
                                .replace(':',' ')\
                                .replace('—',' ')\
                                .replace('-',' ')\
                                .replace('.',' ')\
                                .replace(',',' ')\
                                .replace('.',' ')\
                                .replace('<',' ')\
                                .replace('>',' ')\
                                .replace('=',' ')\
                                .replace('•',' ')\
                                .replace("\\",' ')\
                                .replace('\n', ' ')\
                                .replace('^',' ')\
                                .replace('\\',' ')\
                                .replace('?',' ')\
                                .replace('\'',' ')\
                               )
#emails_text = emails_text.apply(lambda x: x.replace('—',' '))


Because your emails_text is of type *pandas.core.series.Series* we want to convert it into list :

In [811]:
documents = emails_text.tolist()
type(documents)

list

We check that we still have 6742 emails

In [812]:
len(documents)

6742

The next step is to remove all common words. For that we create a basic stoplist from a STOPWORDS list. By iteration we have added some common words that appeared in the results. 

In [813]:
stoplist = set(STOPWORDS)
stoplist.add('pls')
stoplist.add('yes')
stoplist.add('call')
stoplist.add('pm')
stoplist.add('no')
stoplist.add('com')
stoplist.add('doc')
stoplist.add('docx')
stoplist.add('pdf')
stoplist.add('mr')
stoplist.add('mrs')
stoplist.add('call')
stoplist.add('need')
stoplist.add('one')
stoplist.add('two')
stoplist.add('fyi')
stoplist.add('00')
stoplist.add('will')
stoplist.add('know')
stoplist.add('re')
stoplist.add('ok')
stoplist.add('also')
stoplist.add('see')
stoplist.add('us')
stoplist.add('good')
stoplist.add('thx')
stoplist.add('new')
stoplist.add('go')
stoplist.add('you')
stoplist.add('now')
stoplist.add('done')
stoplist.add('yet')
stoplist.add('wrote')
stoplist.add('etc')
stoplist.add('back')
stoplist.add('today')
stoplist.add('said')
stoplist.add('many')
stoplist.add('already')
stoplist.add('want')


for i in range(10,100) :
    stoplist.add(i)



Now we uste the stoplist to create a list of word that are not in stoplist and that are bigger than 1 (this is inspired by the code of [tutorial of radimrehurek website](https://radimrehurek.com/gensim/tut1.html)

In [814]:
# remove list contains on stoplist and tokenize
texts = [[word for word in document.lower().split() if word not in stoplist and len(word) > 1]
         for document in documents]

For each word referenced we count how many time its appear (again code inspired of [tutorial of radimrehurek website](https://radimrehurek.com/gensim/tut1.html))

In [815]:
# counts the number of appartion of a word
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
        
# remove word that apprears only once
texts = [[token for token in text if frequency[token] > 1]
      for text in texts]        

We use now a dictionary to map the word with integer ids 

In [819]:
dictionary = corpora.Dictionary(texts)

In [820]:
print(dictionary)

Dictionary(17663 unique tokens: ['145', 'pouch', 'capabilities', 'rooms', 'mirage']...)


### Create Vector list

Now that we have a dictionary that contains uniques word, we can create your vector. The vector contains how many times each word appears in each documents (in your case one document is one email)

In [822]:
corpus = [dictionary.doc2bow(text) for text in texts]

### Apply transformation
We can now apply your Latent Dirichlet Allocation transformation. 

#### 5 topics

In [827]:
model_5 = models.LdaModel(corpus, id2word=dictionary, num_topics=5)


In [828]:
model_5.print_topics(5)

[(0,
  '0.004*"obama" + 0.004*"government" + 0.004*"president" + 0.004*"american" + 0.003*"people" + 0.003*"state" + 0.003*"states" + 0.003*"united" + 0.003*"time" + 0.003*"political"'),
 (1,
  '0.016*"2010" + 0.012*"gov" + 0.005*"hrod17@clintonemail" + 0.005*"fw" + 0.004*"cheryl" + 0.004*"may" + 0.004*"sunday" + 0.004*"2009" + 0.004*"party" + 0.004*"10"'),
 (2,
  '0.005*"un" + 0.004*"obama" + 0.004*"4(d)" + 0.003*"women" + 0.003*"b1" + 0.003*"people" + 0.003*"2010" + 0.003*"haitian" + 0.003*"israel" + 0.003*"work"'),
 (3,
  '0.018*"secretary" + 0.017*"office" + 0.014*"30" + 0.012*"meeting" + 0.010*"10" + 0.010*"15" + 0.009*"room" + 0.009*"state" + 0.008*"department" + 0.007*"time"'),
 (4,
  '0.009*"state" + 0.006*"2010" + 0.004*"percent" + 0.004*"secretary" + 0.003*"may" + 0.003*"work" + 0.003*"department" + 0.003*"house" + 0.003*"tomorrow" + 0.003*"lona"')]

#### 20 topics

In [830]:
model_20 = models.LdaModel(corpus, id2word=dictionary, num_topics=20)

In [831]:
model_20.print_topics(20)

[(0,
  '0.008*"obama" + 0.006*"president" + 0.004*"policy" + 0.004*"even" + 0.003*"government" + 0.003*"public" + 0.003*"time" + 0.003*"much" + 0.003*"diplomacy" + 0.003*"part"'),
 (1,
  '0.013*"qddr" + 0.012*"bibi" + 0.012*"work" + 0.011*"settlements" + 0.011*"prince" + 0.005*"settlement" + 0.005*"speech" + 0.005*"ashton" + 0.004*"korea" + 0.004*"north"'),
 (2,
  '0.030*"b6" + 0.021*"print" + 0.016*"part" + 0.013*"release" + 0.008*"pis" + 0.008*"thanks" + 0.007*"2010" + 0.006*"sbwhoeop" + 0.006*"ll" + 0.006*"may"'),
 (3,
  '0.008*"book" + 0.006*"clips" + 0.005*"time" + 0.005*"tomorrow" + 0.004*"woodward" + 0.003*"cables" + 0.003*"love" + 0.003*"senate" + 0.003*"2010" + 0.003*"obama"'),
 (4,
  '0.006*"party" + 0.005*"senate" + 0.004*"clinton" + 0.004*"obama" + 0.004*"secretary" + 0.004*"people" + 0.004*"house" + 0.004*"iraq" + 0.004*"treaty" + 0.004*"ed"'),
 (5,
  '0.011*"4(d)" + 0.010*"b1" + 0.008*"don" + 0.007*"4(b)" + 0.006*"want" + 0.005*"let" + 0.005*"going" + 0.005*"think" + 0.00

#### 35 topics

In [832]:
model_35 = models.LdaModel(corpus, id2word=dictionary, num_topics=35)

In [833]:
model_35.print_topics(35)

[(0,
  '0.023*"tomorrow" + 0.022*"hikers" + 0.017*"sbwhoeop" + 0.014*"negotiating" + 0.014*"note" + 0.012*"b6" + 0.011*"give" + 0.010*"read" + 0.010*"sunday" + 0.010*"copy"'),
 (1,
  '0.049*"richards" + 0.028*"bloomberg" + 0.021*"mayor" + 0.012*"lona" + 0.010*"haven" + 0.008*"high" + 0.008*"images" + 0.008*"valmoro" + 0.008*"assistant" + 0.008*"tonight"'),
 (2,
  '0.036*"nuclear" + 0.009*"treaty" + 0.007*"pakistan" + 0.007*"trucks" + 0.007*"state" + 0.007*"states" + 0.006*"weapons" + 0.006*"china" + 0.006*"chinese" + 0.005*"indian"'),
 (3,
  '0.036*"state" + 0.021*"department" + 0.020*"secretary" + 0.012*"force" + 0.012*"office" + 0.009*"647" + 0.009*"air" + 0.009*"andrews" + 0.008*"assistant" + 0.007*"conference"'),
 (4,
  '0.012*"bill" + 0.010*"women" + 0.006*"voices" + 0.006*"camps" + 0.006*"holiday" + 0.005*"state" + 0.005*"obama" + 0.004*"first" + 0.004*"many" + 0.004*"camp"'),
 (5,
  '0.012*"mod" + 0.011*"miliband" + 0.008*"play" + 0.008*"gordon" + 0.008*"letter" + 0.007*"ed" + 0

#### 50 topics

In [834]:
model_50 = models.LdaModel(corpus, id2word=dictionary, num_topics=50)

In [835]:
model_50.print_topics(50)

[(0,
  '0.018*"israel" + 0.017*"party" + 0.013*"israeli" + 0.011*"settlements" + 0.010*"obama" + 0.007*"jewish" + 0.006*"settlement" + 0.006*"government" + 0.006*"talks" + 0.006*"afghan"'),
 (1,
  '0.016*"gov" + 0.012*"qddr" + 0.009*"huma" + 0.009*"tomorrow" + 0.009*"development" + 0.008*"cheryl" + 0.008*"talked" + 0.008*"abedin" + 0.007*"planning" + 0.006*"pih"'),
 (2,
  '0.010*"obama" + 0.009*"david" + 0.008*"labour" + 0.008*"party" + 0.008*"percent" + 0.007*"president" + 0.005*"republicans" + 0.005*"campaign" + 0.005*"americans" + 0.005*"tax"'),
 (3,
  '0.046*"richards" + 0.021*"book" + 0.017*"secure" + 0.014*"books" + 0.014*"assume" + 0.014*"sounds" + 0.012*"rolling" + 0.007*"cost" + 0.007*"kerry" + 0.007*"hoping"'),
 (4,
  '0.074*"2010" + 0.031*"b6" + 0.027*"sullivan" + 0.023*"gov" + 0.018*"sullivanjj@state" + 0.018*"kurdistan" + 0.016*"hrod17@clintonemail" + 0.015*"jacob" + 0.010*"part" + 0.010*"sunday"'),
 (5,
  '0.019*"holbrooke" + 0.017*"mod" + 0.013*"via" + 0.011*"blackberry"

In [823]:
len(texts)

6742

In [824]:
#texts

In [793]:
dictionary = corpora.Dictionary(texts)
dictionary.save("part3/deerwester-email.dict")

In [794]:
print(dictionary)

Dictionary(17663 unique tokens: ['145', 'pouch', 'capabilities', 'rooms', 'mirage']...)


In [795]:
corpus = [dictionary.doc2bow(text) for text in texts]

In [796]:
#print(corpus)

In [797]:
corpora.MmCorpus.serialize("part3/deerwester-email.mm", corpus)

In [798]:
dictionary = corpora.Dictionary.load("part3/deerwester-email.dict")


In [799]:
model = models.LdaModel(corpus, id2word=dictionary, num_topics=50)

In [800]:
model.print_topics(50)

[(0,
  '0.014*"gov" + 0.013*"plane" + 0.013*"huma" + 0.012*"bill" + 0.012*"traveling" + 0.011*"latin" + 0.011*"test" + 0.009*"climate" + 0.009*"friday" + 0.008*"abedin"'),
 (1,
  '0.020*"percent" + 0.014*"2010" + 0.011*"voters" + 0.009*"palin" + 0.009*"obama" + 0.007*"poll" + 0.007*"republican" + 0.006*"unfavorable" + 0.006*"favorable" + 0.006*"opinion"'),
 (2,
  '0.026*"border" + 0.015*"holiday" + 0.012*"mother" + 0.011*"witnesses" + 0.009*"covered" + 0.009*"sarkozy" + 0.008*"residents" + 0.007*"taab" + 0.007*"condolence" + 0.006*"report"'),
 (3,
  '0.033*"sent" + 0.026*"via" + 0.026*"blackberry" + 0.016*"trying" + 0.014*"mail" + 0.013*"document" + 0.013*"morning" + 0.012*"sometime" + 0.011*"w/" + 0.011*"deadline"'),
 (4,
  '0.036*"israeli" + 0.016*"israel" + 0.013*"arab" + 0.009*"settlements" + 0.008*"jerusalem" + 0.008*"sanctions" + 0.006*"jewish" + 0.005*"march" + 0.005*"security" + 0.005*"civilians"'),
 (5,
  '0.025*"holbrooke" + 0.013*"list" + 0.012*"trip" + 0.011*"traffic" + 0.0

In [801]:
print(model)

LdaModel(num_terms=17663, num_topics=50, decay=0.5, chunksize=2000)


In [802]:
#print(corpus)