# Sentiments analysis

Find all the mentions of world countries in the whole corpus, using the pycountry utility (HINT: remember that there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.) Perform sentiment analysis on every email message using the demo methods in the nltk.sentiment.util module. Aggregate the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level) that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo methods from the sentiment analysis module -- can you find substantial differences?


In [325]:
import pandas as pd                                     
import numpy as np                                      
import os                         


import matplotlib.pyplot as plt

from datetime import datetime

%matplotlib inline
import seaborn as sns

import pycountry

# Create the country list

The first thing to do is to establish a list with all countries, and for each country, we want to obtain a maximum amount of different form (e.g. Switzerland, CH, Swiss Confederation, etc...).

## Getting list from pycountry

We therefore use the pycountry utility and create a DataFrame with all existing names for each countries

In [368]:
countries = []
no_official = 0
tot = 0
for country in pycountry.countries:
    tot += 1
    official_name = None
    try:
        official_name = country.official_name
    except:
        no_official += 1
        
    countries.append([country.alpha_2, country.alpha_3, country.name, official_name])
    
countries_df = pd.DataFrame(countries)
countries_df.columns = ["Alpha2", "Alpha3", "English Name", "Official Name"]

print(no_official, "countries without official name, on a total of", tot, "countries")
countries_df.head()

76 countries without official name, on a total of 249 countries


Unnamed: 0,Alpha2,Alpha3,English Name,Official Name
0,AW,ABW,Aruba,
1,AF,AFG,Afghanistan,Islamic Republic of Afghanistan
2,AO,AGO,Angola,Republic of Angola
3,AI,AIA,Anguilla,
4,AX,ALA,Åland Islands,


## Data Cleaning

We observed that some countries, like the Åland Islands, have accents in their names, we therefore decide to remove all accents. (And we'll do so for the mails content).

In [369]:
countries_df[countries_df["Alpha2"] == "AX"]

Unnamed: 0,Alpha2,Alpha3,English Name,Official Name
4,AX,ALA,Åland Islands,


In [370]:
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

In [371]:
for country in countries_df:
    for i in range (len(countries_df[country])):
        if countries_df[country][i] != None:
            countries_df[country][i] = remove_accents(countries_df[country][i])
countries_df.head()

Unnamed: 0,Alpha2,Alpha3,English Name,Official Name
0,AW,ABW,Aruba,
1,AF,AFG,Afghanistan,Islamic Republic of Afghanistan
2,AO,AGO,Angola,Republic of Angola
3,AI,AIA,Anguilla,
4,AX,ALA,Aland Islands,


In [372]:
countries_df[countries_df["Alpha2"] == "AX"]

Unnamed: 0,Alpha2,Alpha3,English Name,Official Name
4,AX,ALA,Aland Islands,


## Official Names Evaluation

We observe that a large amount of countries does not have an official name in pycountry. We want to evaluate the official names we obtained in order to decide wether to keep them, and in this case, how to treat them.

We then compute a DataFrame containing only countries for which we obtained the official name.

In [373]:
countries_with_official_names = countries_df[countries_df["Official Name"].notnull()]
print("Number of countries with official names", len(countries_with_official_names))
countries_with_official_names.head()

Number of countries with official names 173


Unnamed: 0,Alpha2,Alpha3,English Name,Official Name
1,AF,AFG,Afghanistan,Islamic Republic of Afghanistan
2,AO,AGO,Angola,Republic of Angola
5,AL,ALB,Albania,Republic of Albania
6,AD,AND,Andorra,Principality of Andorra
8,AR,ARG,Argentina,Argentine Republic


We expect most of the official names of the countries to contains only a copy of the simple name with some prefix like "Republic of", "State of", etc...

But maybe some official names may contains different information, we therefore compute the countries with an official names that do not contains the simple english name.

In [374]:
contained_filter = countries_with_official_names.apply(lambda x: x['English Name'].lower() not in x['Official Name'].lower(), axis=1)
print("Number of official name with new information", len(countries_with_official_names[contained_filter]), "/", len(countries_with_official_names))
countries_with_official_names[contained_filter]

Number of official name with new information 23 / 173


Unnamed: 0,Alpha2,Alpha3,English Name,Official Name
8,AR,ARG,Argentina,Argentine Republic
31,BO,BOL,"Bolivia, Plurinational State of",Plurinational State of Bolivia
41,CH,CHE,Switzerland,Swiss Confederation
58,CZ,CZE,Czechia,Czech Republic
75,FR,FRA,France,French Republic
77,FM,FSM,"Micronesia, Federated States of",Federated States of Micronesia
89,GR,GRC,Greece,Hellenic Republic
107,IR,IRN,"Iran, Islamic Republic of",Islamic Republic of Iran
111,IT,ITA,Italy,Italian Republic
118,KG,KGZ,Kyrgyzstan,Kyrgyz Republic


We observe that most of this official names contains in fact adjective of the country. We now have to chose wether to keep this new information, or discard them.

If we chose to consider the usage of a country adjective in Hillary Clinton's mail as a reference for the country, this could lead to large bias. For example, if she mention "swiss cheese" or "swiss knife" in one of her mail, this would be associated to the country Switzerland, even if the content of the mail does not represent the sentiment toward the country.

We decide then to **discard the adjective**, and we'll use only the simple english name of country.

*ps: Note that some official names contains interesting info, like **Hellenic Republic for Greece** for example, which is a total different expression. But we chose to discard these informations because we think that these expression are not use in a regular basis and won't be used in the mail to denote the countries*

In [375]:
countries_df = countries_df.drop(["Official Name"],1)

## Countries Names Analysis

We now want to analyse the simple name of the countries, in order to now how to treat them and how to use their words to associate the mails to the countries.

For this purpose, we start by looking at the composed names.

In [376]:
composed_filter = countries_df.apply(lambda x: len(x['English Name'].split(" ")) > 1, axis=1)
print("Number of composed names", len(countries_df[composed_filter]), "/", len(countries_df))
countries_df[composed_filter]

Number of composed names 80 / 249


Unnamed: 0,Alpha2,Alpha3,English Name
4,AX,ALA,Aland Islands
7,AE,ARE,United Arab Emirates
10,AS,ASM,American Samoa
12,TF,ATF,French Southern Territories
13,AG,ATG,Antigua and Barbuda
20,BQ,BES,"Bonaire, Sint Eustatius and Saba"
21,BF,BFA,Burkina Faso
26,BA,BIH,Bosnia and Herzegovina
27,BL,BLM,Saint Barthelemy
31,BO,BOL,"Bolivia, Plurinational State of"


We notice some things that need to be treated in the countries names : 
* Some names contains comma
* Some names contains parenthesis
* Some names contain "Republic" in it

### Names with comma

We noticed that some names contain comma. We want to take a look at them and treat them properly.

In [377]:
comma_index = countries_df["English Name"].str.contains(",")
countries_df[comma_index]

Unnamed: 0,Alpha2,Alpha3,English Name
20,BQ,BES,"Bonaire, Sint Eustatius and Saba"
31,BO,BOL,"Bolivia, Plurinational State of"
46,CD,COD,"Congo, The Democratic Republic of the"
77,FM,FSM,"Micronesia, Federated States of"
107,IR,IRN,"Iran, Islamic Republic of"
122,KR,KOR,"Korea, Republic of"
139,MD,MDA,"Moldova, Republic of"
144,MK,MKD,"Macedonia, Republic of"
181,KP,PRK,"Korea, Democratic People's Republic of"
184,PS,PSE,"Palestine, State of"


We observe that these names (except for Bonaire, Sint Eustatius and Saba) contains only adjective after the comma. We can simply keep the first part and discard the rest.

*We consider that "Bonaire, Sint Eustatius and Saba" can be discarded and therefore do not treat it differently*

In [378]:
countries_df["English Name"] = countries_df["English Name"].map(lambda x: x.split(",")[0])
countries_df[comma_index].head()

Unnamed: 0,Alpha2,Alpha3,English Name
20,BQ,BES,Bonaire
31,BO,BOL,Bolivia
46,CD,COD,Congo
77,FM,FSM,Micronesia
107,IR,IRN,Iran


### Country with parenthesis

We decide to remove what is inside parenthesis of the countries names.

In [379]:
import re
par_index = countries_df["English Name"].apply(lambda x: "(" in str(x))
countries_df["English Name"] = countries_df["English Name"].map(lambda x: re.sub(r'\([^)]*\)', '', x))
countries_df[par_index].head()

Unnamed: 0,Alpha2,Alpha3,English Name
40,CC,CCK,Cocos Islands
74,FK,FLK,Falkland Islands
136,MF,MAF,Saint Martin
212,SX,SXM,Sint Maarten
236,VA,VAT,Holy See


## Countries with same names

Unfortunately for us, there are countries that share the same name, or a same part.

* North Korea and South Korea
* Republic of Congo and Democratic Republic of Congo
* Ireland and Northern Ireland
* US Virgin Islands and UK Virgin Islands


### Koreas

In [380]:
countries_df[countries_df["English Name"].str.contains("Korea")]

Unnamed: 0,Alpha2,Alpha3,English Name
122,KR,KOR,Korea
181,KP,PRK,Korea


In the case of Koreas, after our cleaning, we see that both countries have exactly the same name. Therefore, we decide to replace them by their most commonly used name : North Korea and South Korea.

In [381]:
countries_df.loc[122]["English Name"] = "South Korea"
countries_df.loc[181]["English Name"] = "North Korea"
countries_df[countries_df["English Name"].str.contains("Korea")]

Unnamed: 0,Alpha2,Alpha3,English Name
122,KR,KOR,South Korea
181,KP,PRK,North Korea


## Congos

Unfortunately, unless like Koreas, we expect to have no way to distinguish Republic of Congo and Democratic Republic of Congo, since people usually simply call them both "Congo". We therefore chose to let them both as "Congo" and hope that it doesn't appear too often in the mails. If it is not the case, we'll try to treat them in a more clever way later on.

In [382]:
countries_df[countries_df["English Name"].str.contains("Congo")]

Unnamed: 0,Alpha2,Alpha3,English Name
46,CD,COD,Congo
47,CG,COG,Congo


## Virgin Islands

As for Congos, we will let them both as "Virgin Islands" and expect not to have troubles with them.

In [383]:
countries_df[countries_df["English Name"].str.contains("Virgin")]

Unnamed: 0,Alpha2,Alpha3,English Name
239,VG,VGB,Virgin Islands
240,VI,VIR,Virgin Islands


## The curious case of British Isles

We all know it, Brits are a little bit special. We can see on the following map that the administration of the British isles is a bit complex. 

Since the countries in these Isles (England, Ireland, Wales, Northern Ireland, Scottland, etc...) have a relevant impact on the political world, we want to be sure to detect them properly in the mails.

<img src="images/British_Isles_terms.gif" alt="Drawing" style="width: 420px;"/>

In [384]:
countries_df[countries_df["Alpha2"] == "UK"]

Unnamed: 0,Alpha2,Alpha3,English Name


In [385]:
countries_df[countries_df["Alpha2"] == "GB"]

Unnamed: 0,Alpha2,Alpha3,English Name
79,GB,GBR,United Kingdom


In [386]:
pycountry.countries.get(alpha_2 = "GB")

Country(alpha_2='GB', alpha_3='GBR', name='United Kingdom', numeric='826', official_name='United Kingdom of Great Britain and Northern Ireland')

We notice that the Alpha 2 "UK" is not used to denote the United Kingdom, but "GB" is used, event if the country really is the United Kingdom, containing England, Wales, Scottland and Northern Ireland.
We will need to add manually keywords if we want to be sure to detect all mentions of these countries in the mails.

In [387]:
countries_df[countries_df["English Name"].str.contains("Ireland")]

Unnamed: 0,Alpha2,Alpha3,English Name
106,IE,IRL,Ireland


We observe that only (Southern) Ireland is accociated to word Ireland

## Other weird stuffs

We still have 69 composed names, there must be unclean names in this list. We will try to see which one shoud not be composed and replace them by the most common simple name (e.g. Russian Federation must be replaced by Russia, because it is the most common way of calling it).

In [388]:
composed_filter = countries_df.apply(lambda x: len(x['English Name'].split(" ")) > 1, axis=1)
print("Number of composed names", len(countries_df[composed_filter]), "/", len(countries_df))

Number of composed names 69 / 249


In [389]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Republic")]

Unnamed: 0,Alpha2,Alpha3,English Name
38,CF,CAF,Central African Republic
63,DO,DOM,Dominican Republic
124,LA,LAO,Lao People's Democratic Republic
214,SY,SYR,Syrian Arab Republic


In [390]:
#We choose to keep Central African Republic and Dominican Republic
countries_df.loc[124]["English Name"] = "Laos"
countries_df.loc[214]["English Name"] = "Syria"

In [391]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Kingdom")]

Unnamed: 0,Alpha2,Alpha3,English Name
79,GB,GBR,United Kingdom


In [392]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Federation")]

Unnamed: 0,Alpha2,Alpha3,English Name
189,RU,RUS,Russian Federation


In [393]:
countries_df.loc[189]["English Name"] = "Russia"

In [410]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Arab")]

Unnamed: 0,Alpha2,Alpha3,English Name
7,AE,ARE,United Arab Emirates
191,SA,SAU,Saudi Arabia


In [414]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Fr")]

Unnamed: 0,Alpha2,Alpha3,English Name
12,TF,ATF,French Southern Territories
93,GF,GUF,French Guiana
185,PF,PYF,French Polynesia


In [412]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Amer")]

Unnamed: 0,Alpha2,Alpha3,English Name
10,AS,ASM,American Samoa


In [413]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Brit")]

Unnamed: 0,Alpha2,Alpha3,English Name
105,IO,IOT,British Indian Ocean Territory


In [417]:
countries_df[composed_filter][countries_df[composed_filter]["English Name"].str.contains("Iv")]

Unnamed: 0,Alpha2,Alpha3,English Name
44,CI,CIV,Cote d'Ivoire


In [418]:
countries_df.loc[44]["English Name"] = "Ivory Coast"

In [420]:
composed_filter = countries_df.apply(lambda x: len(x['English Name'].split(" ")) > 1, axis=1)
print("Number of composed names", len(countries_df[composed_filter]), "/", len(countries_df))

Number of composed names 66 / 249


We consider now the remaining composed countries name correct. This means that we assume that we can search for the complete composed name in the mails, and won't have to use each part of the name.

e.g. We will look for occurences of "United Kingdom", this way, and won't consider occurence of "Kingdom" or "United" as reference to this country.

# Conversion into dictionary

## Adding keywords