# Word clouds

Generate a word cloud based on the raw corpus -- I recommend you to use the Python word_cloud library. With the help of nltk (already available in your Anaconda environment), implement a standard text pre-processing pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and cons (if any) of the two word clouds you generated.

In [1]:
import pandas as pd                                     
import numpy as np                                      
import os                         


import matplotlib.pyplot as plt

from datetime import datetime

%matplotlib inline
import seaborn as sns                               


from os import path
from wordcloud import WordCloud
from PIL import Image

# Getting the text

We first start by reading the mails. We extract them in a DataFrame to have an idea of how it is structured.

In [28]:
df = pd.read_csv("hillary-clinton-emails/Emails.csv")
df.ix[:,:8].head(3)

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00


In [29]:
df.ix[:,8:13].head(3)

Unnamed: 0,MetadataPdfLink,MetadataCaseNumber,MetadataDocumentClass,ExtractedSubject,ExtractedTo
0,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,HRC_Email_296,FW: Wow,
1,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,HRC_Email_296,,
2,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,HRC_Email_296,Re: Chris Stevens,B6


In [30]:
df.ix[:,13:18].head(3)

Unnamed: 0,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber
0,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545
1,,,,F-2015-04841,C05739546
2,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547


In [31]:
df.ix[:,18:].head(3)

Unnamed: 0,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...


We observe that there are a lot of metadata columns. 

What we are interested to create our wordcloud is the text written by Hillary Clinton. We will therefore focus on the Subjects and bodies of the mails. But as we can see on the first raw, it seems that some extracted body text are empty. We will quantify it and see if it can cause problems.

In [32]:
print("Total number of mails", df.shape[0])
print("Number of mails without extracted body text:", df[df["ExtractedBodyText"].isnull()].shape[0])
print("Number of mails without raw text:", df[df["RawText"].isnull()].shape[0])

Total number of mails 7945
Number of mails without extracted body text: 1203
Number of mails without raw text: 0


We have **15.2%** (1203/7905) of mails with an extracted text. We decide to use them and not proceed to extract them from raw text. The wordcloud gives an idea of the words used but it is not a strict process and we can afford to loose some data.

In [33]:
#Extracting all bodies
body = df["ExtractedBodyText"].str.cat().replace("\n", " ")
#Extracting all subjects - Remove all keywords before ":" like RE:, FW:, FVV: etc...
df['ExtractedSubject'] = df['ExtractedSubject'].apply(lambda x: str(x).split(":")[-1])
subjects = df["ExtractedSubject"].str.cat().replace("\n", " ")
#Joining all texts
text = body + " " + subjects

In [34]:
mask = np.array(Image.open("images/hillary.png"))

# Generate a word cloud image
wordcloud = WordCloud(background_color="black", max_words=1000, mask=mask, stopwords=[])


wordcloud.generate(text)

# store to file
wordcloud.to_file("images/cloud_brute.png")

<wordcloud.wordcloud.WordCloud at 0x7fd9d20b2780>

This gives us the following restult: 

<img src="images/cloud_brute.png" alt="Drawing" style="width: 500px;"/>

In [35]:
from wordcloud import STOPWORDS
stopwords = set(STOPWORDS)

# Generate a word cloud image
wordcloud = WordCloud(background_color="black", max_words=1000, mask=mask,
               stopwords=stopwords)

wordcloud.generate(text)

# store to file
wordcloud.to_file("images/cloud_standard_stop_words.png")

<wordcloud.wordcloud.WordCloud at 0x7fd9d2095438>

This gives us the following restult: 

<img src="images/cloud_standard_stop_words.png" alt="Drawing" style="width: 500px;"/>

In [36]:
from nltk.stem import *

stemmer = SnowballStemmer("english")
stemmed_text = ' '.join([stemmer.stem(t) for t in text.split(" ")])
# Generate a word cloud image
wordcloud = WordCloud(background_color="black", max_words=1000, mask=mask,
               stopwords=stopwords)

wordcloud.generate(stemmed_text)

# store to file
wordcloud.to_file("images/cloud_stemmed.png")

<wordcloud.wordcloud.WordCloud at 0x7fd9d2099080>

This gives us the following restult: 

<img src="images/cloud_stemmed.png" alt="Drawing" style="width: 500px;"/>