## 1. Word Cloud
Generate a raw wordcloud and a wordcloud with processes data and see the differences.

In [1]:
import pandas as pd
import nltk
from nltk.book import FreqDist
from nltk.corpus import stopwords
import os
from os import path
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from helpers import *
from gensim import models, corpora
import pyLDAvis.gensim
%load_ext autoreload
%autoreload 2
%matplotlib inline

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


# Topic Modelling
We try to find the important topics in the email database using LDA.

We start by getting a list of the emails concatenated with the email subjects, because the subject might also potentially might have some useful keywords.

In [78]:
emails_df = pd.read_csv('hillary-clinton-emails/Emails.csv')
emails_df['email'] = emails_df['MetadataSubject'] + emails_df['ExtractedBodyText']

# list of emails with subjects
email_list = emails_df['email'].dropna().reset_index(drop=True)

We preprocess the text as we did in the first part so that we have a meaningful bag of words to work with, without the stopwords and other useless information. Also, we retrospectively remove the word 'state' as it seems to be coming up in every topic.

In [93]:
email_text = []
for text in email_list:
    text = preprocess_pipeline(text)
    for word in text: 
        if word in (hillary_stopwords + ['state']) or word in email_stopwords: 
            text.remove(word)
    email_text.append(text)

We make the dictionary and corpus in a format suitable to be used to run the LDA topic modelling using gensim.


In [94]:
dictionary = corpora.Dictionary(email_text)
corpus = [dictionary.doc2bow(text) for text in email_text]

We finally run the topic modelling for different number of topics and explore the results using pyLDAvis. We try to have a number in which the topic clusters are disjoint and meaningful and we also do not have too many topics, while covering as many of the words as possible. We start with just 2 topics and gradually go upto 30. We give a general analysis of how the trend is as we increase the number of topics. We have not shown the result for each number of topics here.

In [102]:
topics = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics = 7,  passes = 10)
email_data = pyLDAvis.gensim.prepare(topics, corpus, dictionary)
pyLDAvis.display(email_data)

When there are only 2 topics, they are well separated, but one of them is almost solely consisting of numbers and the other one has common political terms. The two topics seem to be 'relevant' and 'not relevant'.
When there are 3 topics, the topics are still well separated, but it seems difficult to assign a name to the topics. They are quite general. One of the clusters seems to consist of just irrelevant terms again.
The topics are slightly clearer when there are 4 topics. One of the topics seems to about the existing government with terms like 'obama', 'govern', 'american', 'president' and one about the elections, which has terms like 'democrat', 'republican', etc.
In 5 and 6 topics, there is one cluster completely inside the other therefore does not look like a good candidate. 
7 topics seems to hit the sweet spot, with each cluster looking meaningful and being almost disjoint. We conclude that the best number of topics seems to be 7. The topics are broadly 'secretary office', 'obama', 'israel/palestine', 'work', 'diplomacy', 'government' and one which is difficult to put a name to. Putting the number of countries to be more than this seems to be leading to a lot of overlaps between the topics, though the topics become even more specific.