# Homework 05 - Topic modeling

_Goal_ :

**We want to run topic modeling over the corpus to "discover" the main topics of the emails.**

_Tools_ :

**The tools used are :**

* pandas
* [gensim](https://radimrehurek.com/gensim/index.html)

_Contents_ :

* [1 - Loading data](#1---Loading-data)
* [2 - Topic modeling](#2---Topic-modeling)
* [3 - Tweaking the number of topics](#3---Tweaking-the-number-of-topics)

---

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from gensim import corpora
from gensim.models.ldamodel import LdaModel

# For DataFrame pretty-printing
from IPython.display import display

%matplotlib inline
sns.set_context('notebook')
%config InlineBackend.figure_format = 'retina'

# 1 - Loading data

In the first place, as usual, we load the `Emails.csv` file with pandas to form a DataFrame.

In [3]:
emails = pd.read_csv('hillary-clinton-emails/Emails.csv')
emails.head()

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\nU.S. Department of State\nCase N...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\nFriday, March 11,...",B6\nUNCLASSIFIED\nU.S. Department of State\nCa...


# 2 - Topic modeling

Here, we use `ExtractedBodyText` as our raw corpus

In [4]:
raw_texts = emails.ExtractedBodyText.dropna()
raw_texts[:5]

1    B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...
2                                                  Thx
4    H <hrod17@clintonemail.com>\nFriday, March 11,...
5    Pis print.\n-•-...-^\nH < hrod17@clintonernail...
7    H <hrod17@clintonemail.corn>\nFriday, March 11...
Name: ExtractedBodyText, dtype: object

Since this data is raw, we first have to pre-process it so that our results reflect meaningful topics. To do so we simple apply the same steps as in the first assignment (tokenization, stopwords removal and stemming).

In [5]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

tokens = [nltk.word_tokenize(text) for text in raw_texts]
stop_words = set(stopwords.words('english'))
punctuation = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}', '-', '--', '...', '•', "''", '""', '``', '@', '<', '>', "'s"]
stop_words.update(punctuation)
filtered_tokens = [[t for t in tok if (t not in stop_words and len(t)>1)] for tok in tokens]

ps = PorterStemmer()
stemmed_tokens = [[ps.stem(t) for t in tok] for tok in filtered_tokens]

texts = [[t.lower() for t in tok] for tok in stemmed_tokens]

In [6]:
texts[1]

['thx']

The previous command shows that some emails are probably very short and do not help for topic modeling (for example if the body is just the word "thx"). That's why we decide to exclude emails that contains less than a fixed number of words.

*NB : Another possiblity could be to consider **threads** (ie. emails and replies) as documents instead of single emails. We do **not** follow this approach here since it is rather time consuming to link emails together.*

In [7]:
print("Before removing short emails :", len(texts), "emails")

threshold = 5
long_texts = [text for text in texts if len(text) > threshold]

print("After removing short emails :", len(long_texts), "emails")

Before removing short emails : 6742 emails
After removing short emails : 4256 emails


Now we can build a dictionary containing all the words in our raw corpus. Then, we build the actual corpus that will be used to do topic modeling : it is represented as a list of lists (one per document), and each inner list contains words ids (wrt. the previously built dictionary) and corresponding number of occurences in the document.

In [8]:
dictionary = corpora.Dictionary(long_texts)
corpus = [dictionary.doc2bow(text) for text in long_texts]

Before running LDA, let's just define a helper function to print the results in a nicer way.

In [9]:
# Helper function to print topics in a nicer way
def print_lda_topics(lda_model, numtopics, with_probabilities=True):
    for topic_id, topic_words in lda_model.show_topics(num_topics=numtopics, formatted=False):
        words = list(map(lambda x: x[0], topic_words))
        probabilities = list(map(lambda x: round(x[1],4), topic_words))
        if with_probabilities:
            print("Topic", topic_id)
            display(pd.DataFrame([probabilities], columns=words))
        else:
            print("Topic", topic_id, end=' : ')
            print(' | '.join(words))

Finally, we can run LDA and print the resulting topics. We choose to focus on 5 topics for this first example.

In [10]:
lda = LdaModel(corpus, num_topics=5, id2word=dictionary)

print_lda_topics(lda, 5)

Topic 0


Unnamed: 0,pm,offic,secretari,depart,state,room,meet,arriv,rout,confer
0,0.0362,0.0184,0.0184,0.0179,0.0134,0.0111,0.0106,0.0067,0.0065,0.006


Topic 1


Unnamed: 0,see,n't,the,would,work,know,call,state,go,also
0,0.0059,0.0057,0.0055,0.0052,0.0052,0.0049,0.0048,0.0048,0.0048,0.0046


Topic 2


Unnamed: 0,2010,the,state,pm,obama,new,would,american,clintonemail.com,said
0,0.0086,0.0075,0.0064,0.0057,0.0054,0.005,0.0048,0.0041,0.004,0.004


Topic 3


Unnamed: 0,the,state,would,new,american,presid,diplomaci,obama,support,nation
0,0.0125,0.0052,0.005,0.0042,0.0041,0.0039,0.0038,0.0037,0.0037,0.0036


Topic 4


Unnamed: 0,pm,1.4,secretari,meet,offic,call,2010,work,time,b1
0,0.0112,0.0083,0.0082,0.0063,0.0052,0.005,0.0046,0.0044,0.0042,0.0041


From this first example, we cannot really distinguish topics at first sight. In the following section, we try different parameters for the number of topics and discuss the results.

# 3 - Tweaking the number of topics

Here, we try to explore a bit the results with different number of topics for LDA. For the sake of simplicity, we don't print the probabilities.

In [11]:
num_topics = [10, 20, 30]

for n in num_topics:
    # Print parameter choice
    print("\t\tnum_topics =",n)
    
    # Run LDA
    lda = LdaModel(corpus, num_topics=n, id2word=dictionary)
    # Print results
    print_lda_topics(lda, n, with_probabilities=False)
    
    # Print separator
    print("===============================================")

		num_topics = 10
Topic 0 : state | diplomaci | the | u.s. | new | unit | fco | depart | forc | secretari
Topic 1 : 2010 | pm | re | b6 | bloomberg | am | state.gov | know | happi | want
Topic 2 : the | state | new | would | work | need | presid | make | n't | u.s.
Topic 3 : the | american | palin | n't | israel | would | in | like | 2010 | said
Topic 4 : call | the | work | email | today | pleas | state.gov | need | like | if
Topic 5 : pm | secretari | offic | depart | meet | room | state | arriv | rout | confer
Topic 6 : n't | call | state | talk | u.s. | would | want | the | bibi | he
Topic 7 : the | would | new | obama | support | govern | state | american | presid | said
Topic 8 : 1.4 | pm | get | b1 | call | b6 | today | see | want | 2010
Topic 9 : labour | the | pm | 2010 | parti | new | ed | would | miliband | david
		num_topics = 20
Topic 0 : bloomberg | go | said | get | know | last | 'm | would | the | need
Topic 1 : state | the | new | govern | log | in | da | greek | beaut

**Discussion on the results :**

Again, due to short "words" not excluded and to Porter stemming, this is a bit hard to read. Still, we can distinguish some topics like *Topic 11* for `num_topics = 20` which is clearly about Israeli–Palestinian conflict or *Topic 2* for the same parameter which seems to be about diplomacy.

However, in most cases, a lot of topics seem irrelevant. Maybe a better pre-processing would help, since there are a lot or short "meaningless" words left, as well as common "communication vocabulary".