# ADA / Applied Data Analysis
<h2 style="color:#a8a8a8">Homework 5 - Taming text<br>
Aimée Montero, Alfonso Peterssen, Cyriaque Brousse</h2>

## Background
In this homework you will explore a relatively large corpus of emails released in public during the
[Hillary Clinton email controversy](https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy).
You can find the corpus in the `hillary-clinton-emails` directory of this repository, while more detailed information 
about the [schema is available here](https://www.kaggle.com/kaggle/hillary-clinton-emails).

## Assignment
1. Generate a word cloud based on the raw corpus -- I recommend you to use the [Python word_cloud library](https://github.com/amueller/word_cloud).
With the help of `nltk` (already available in your Anaconda environment), implement a standard text pre-processing 
pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and
cons (if any) of the two word clouds you generated.<br><br>

2. Find all the mentions of world countries in the whole corpus, using the `pycountry` utility (*HINT*: remember that
there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.)
Perform sentiment analysis on every email message using the demo methods in the `nltk.sentiment.util` module. Aggregate 
the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level)
that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo
methods from the sentiment analysis module -- can you find substantial differences?<br><br>

3. Using the `models.ldamodel` module from the [gensim library](https://radimrehurek.com/gensim/index.html), run topic
modeling over the corpus. Explore different numbers of topics (varying from 5 to 50), and settle for the parameter which
returns topics that you consider to be meaningful at first sight.<br><br>

4. *BONUS*: build the communication graph (unweighted and undirected) among the different email senders and recipients
using the `NetworkX` library. Find communities in this graph with `community.best_partition(G)` method from the 
[community detection module](http://perso.crans.org/aynaud/communities/index.html). Print the most frequent 20 words used
by the email authors of each community. Do these word lists look similar to what you've produced at step 3 with LDA?
Can you identify clear discussion topics for each community? Discuss briefly the obtained results.

## Part 1 - Word clouds

Let's first import the required libraries for the homework:

In [None]:
from os import path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Let's take a look at our data:

In [None]:
data = pd.read_csv("./hillary-clinton-emails/Emails.csv")
data.head(5)

We want to keep only the fields that contain text data. That is, `ExtractedSubject` and `ExtractedBodyText`.<br>
We drop the `NA` values, since they contain no text.

In [None]:
raw_mails = ' '.join(list(data.ExtractedSubject.dropna())
                   + list(data.ExtractedBodyText.dropna()))
raw_mails[:130]

### First wordcloud attempt

We display a first wordcloud without doing any preprocessing step:

In [None]:
from wordcloud import WordCloud
cloud = WordCloud().generate(raw_mails)

In [None]:
plt.subplots(figsize=(10,15))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

A first problem is that the cloud contains many parasite tokens that are not words that bring much information to us: for instance, "Fw", "Re", "pm", "am", etc.<br>
We also notice that there are also actual words like "new", "call", "one", that don't bring any information either. These so-called **stopwords** need to be removed.

### Tokenization

For now, `raw_mails` is just a long string where every email and its subject is concatenated:

In [None]:
print(type(raw_mails), len(raw_mails))

What we want is to tokenize this string into separate words. This is done with a **tokenizer**:

In [None]:
from nltk import word_tokenize

In [None]:
tokens = word_tokenize(raw_mails)
tokens[:10]

### Lowerization

We want to transform all the tokens to lowercase, with the notable exception of `US` (the country) that risks being matched to `us` (the pronoun).

In [None]:
lowercase_tokens = [w.lower() for w in tokens if w not in ['US', 'U.S.']]

### Punctuation and stopwords

To remove the stopwords, we use a **stopword dictionary** provided by the NLTK API.<br>
Additionally, we remove the punctuation marks.

In [None]:
from string import punctuation
from nltk.corpus import stopwords

In [None]:
custom_sw = ['pm', 'am', 're', 'fw', 'fvv', '…', 'n\'t']
stop = stopwords.words('english') + punctuation + custom_sw

In [None]:
filtered_tokens = list(filter(lambda w: w not in stop, lowercase_tokens))
filtered_tokens[:10]

### Stemming

We are now trying to map each token to its stem. For example:
- `reading => read`
- `reader => read`
- `read => read`

In [None]:
stemmer = nltk.PorterStemmer()
stems = [porter.stem(token) for token in filtered_token]

In [None]:
wordcloud = WordCloud().generate(' '.join(stems))
plt.subplots(figsize=(10,15))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

One problem now, is that we have words that are not actual words. For exemple, `secretary` was stemmed to `secretari`, which doesn't exist.

### Lemmatization

We apply the same concept of reducing each token to a more general form as in stemming. However, we will reduce to the root form of the token, not to its stem.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]

In [None]:
wordcloud = WordCloud(max_font_size=40).generate(' '.join(lemmas))
plt.subplots(figsize=(10,15))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

We can see that this is much more satisfying.

## Part 2 - Countries

In [None]:
import pycountry
import difflib

In [None]:
pycountry.countries

In [None]:
country_names = [ x.name.lower() for x in pycountry.countries]

In [None]:
country_codes = [[lower_case(x.alpha_2), lower_case(x.alpha_3)] for x in pycountry.countries]
country_codes = [item for country_code_sublist in country_codes for item in country_code_sublist]

In [None]:
country_dic_list = [ {lower_case(x.name)    :x.name.lower(),
                      lower_case(x.alpha_2) :x.name.lower(), 
                      lower_case(x.alpha_3) :x.name.lower()} for x in pycountry.countries ]

country_dic = {k: v for dic in country_dic_list for k, v in dic.items()}

In [None]:
test = [item for item in lemmatised_words if item in country_dic ]
len(test) / len(lemmatised_words)

In [None]:
countries = [item if item not in country_dic else country_dic.get(item) for item in lemmatised_words]