# An analysis of the american political scene from a mediatic point of view

## Context

In the 21st century media coverage is a crucial factor for political figures. By studying the number of times a certain politician is quoted in media outlets (in our case New York Times), we can have a rough measure of how much interest does the media address to this politician. In our analysis we will study the evolution of the number of citations of some of the most important american politicians over the last few years and we will compare their evolution to the most important events in their carreer in order to see if there is any causation or correlation. We will then add some more analysis distinguishing the speakers (who quoted a certain politician) by religion, nationality and political party in order to have a better and fragmented view of the causal effects. In the end we will compare our work with Google Trends data in order to see if the conventional media outlets caption the online interest well.


## The data

We are provided with a compressed `.bz2` json file containing one row per quote. 
The `.json` has the following fields:

 - `quoteID`: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
 - `quotation`: Text of the longest encountered original form of the quotation
 - `date`: Earliest occurrence date of any version of the quotation
 - `phase`: Corresponding phase of the data in which the quotation first occurred (A-E)
 - `probas`: Array representing the probabilities of each speaker having uttered the quotation.
      The probabilities across different occurrences of the same quotation are summed for
      each distinct candidate speaker and then normalized
      - `proba`: Probability for a given speaker
      - `speaker`: Most frequent surface form for a given speaker in the articles where the quotation occurred
 - `speaker`: Selected most likely speaker. This matches the the first speaker entry in `probas`
 - `qids`: Wikidata IDs of all aliases that match the selected speaker
 - `numOccurrences`: Number of time this quotation occurs in the articles
 - `urls`: List of links to the original articles containing the quotation 

In [3]:
# Imports we may need
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import ujson as json
import bz2

### Load Quotes and Speaker Attributes

Load quotes related to Donald Trump.

In [None]:
df_Trump = pd.read_csv('df_Trump_cleaned.csv')

df_Trump

Load the parquet dataframe with attributes of each author

In [4]:
speaker_attributes_updated = pd.read_parquet("data/speaker_attributes_updated.parquet")


In [10]:
speaker_attributes_updated

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Great Britain, United States of America]",[male],1395141751,,W000178,"[politician, military officer, farmer, cartogr...",[independent politician],,Q23,George Washington,"[1792 United States presidential election, 178...",item,[Episcopal Church]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[United Kingdom],[male],1395737157,[White British],,"[playwright, screenwriter, novelist, children'...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Belgium],[male],1380367296,,,"[writer, lawyer, librarian, information scient...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[United States of America],[male],1395142029,,,"[politician, motivational speaker, autobiograp...",[Republican Party],,Q207,George W. Bush,"[2000 United States presidential election, 200...",item,"[United Methodist Church, Episcopal Church, Me..."
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Spain],[male],1391704596,,,[painter],,,Q297,Diego Velázquez,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9055976,[Barker Howard],,[United States of America],[male],1397399351,,,[politician],,,Q106406560,Barker B. Howard,,item,
9055977,[Charles Macomber],,[United States of America],[male],1397399471,,,[politician],,,Q106406571,Charles H. Macomber,,item,
9055978,,[+1848-04-01T00:00:00Z],,[female],1397399751,,,,,,Q106406588,Dina David,,item,
9055979,,[+1899-03-18T00:00:00Z],,[female],1397399799,,,,,,Q106406593,Irma Dexinger,,item,


### Merge Chunk with Speaker Attributtes

For an arbitrary chunk add a column with the row corresponding to the speaker

In [117]:
chunk = chunks_YYYY[0].__next__()

chunk

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
80000,2015-08-23-067323,We're hearing anecdotally that there are more ...,,[],2015-08-23 15:18:46,1,"[[None, 0.7867], [Deb Miller, 0.2133]]",[http://gazette.com/teller-county-tourism-indu...,E
80001,2015-01-06-009207,caught in a web of rather irritating circumsta...,Frank Davies,"[Q16728410, Q5486161]",2015-01-06 19:55:54,2,"[[Frank Davies, 0.7485], [None, 0.2345], [Alha...",[http://www.myjoyonline.com/news/2015/January-...,E
80002,2015-02-20-099750,We're honored and humbled to lead a great prog...,Chris Holtmann,[Q16227417],2015-02-20 16:46:46,1,"[[Chris Holtmann, 0.6839], [None, 0.3012], [Br...",[http://wthr.com/story/28159357/butlers-holtma...,E
80003,2015-03-26-011760,"causing hurt by means of poison, etc. with int...",Girish Bapat,[Q18206155],2015-03-26 09:00:09,1,"[[Girish Bapat, 0.5421], [None, 0.2372], [Amit...",[http://www.downtoearth.org.in/content/maharsh...,E
80004,2015-08-24-106561,"We're hoping a couple of weeks of rest, get on...",Jason Motte,[Q2351140],2015-08-24 13:41:58,14,"[[Jason Motte, 0.6073], [None, 0.3423], [Joe M...",[http://jg-tc.com/news/state-and-regional/cubs...,E
...,...,...,...,...,...,...,...,...,...
89995,2015-01-01-025574,"Sam is a big, big part of our team,",Benoit Groulx,"[Q16223641, Q2896540]",2015-01-01 04:24:01,1,"[[Benoit Groulx, 0.8004], [None, 0.1871], [Sam...",[http://sabres.buffalonews.com/2014/12/31/sabr...,E
89996,2015-03-03-045146,It might be residential properties along the r...,Jeff Avery,[Q16150478],2015-03-03 19:10:20,1,"[[Jeff Avery, 0.8468], [None, 0.0948], [Orland...",[http://saultstar.com/2015/03/03/avery-examine...,E
89997,2015-03-11-068188,"Sam was a tough guy,",David Silverman,"[Q29514507, Q5239795, Q919608]",2015-03-11 16:05:16,1,"[[David Silverman, 0.8316], [None, 0.1684]]",[http://www.washingtonpost.com/news/comic-riff...,E
89998,2015-03-02-039728,"It obviously will be a challenge,",William Porterfield,[Q3530649],2015-03-02 06:04:56,5,"[[William Porterfield, 0.8407], [None, 0.1355]...",[http://www.espncricinfo.com/icc-cricket-world...,E


Add the id of the special_attribute row to each quote

In [162]:
def search_author_index(QID, authors):
    res = authors[authors["id"] == QID]
    if res.empty:
        return -1
    return res.index[0]

In [163]:
def assign_value(chunk, index, key, QID, authors):
    chunk.at[index, key] = search_author_index(QID, authors)

In [164]:
def add_author_id(chunk, authors):
    import concurrent.futures
    executor = concurrent.futures.ThreadPoolExecutor(30)
    if "authorId" not in 
    chunk.insert(chunk.shape[1], "authorId", -1)
    for index, row in chunk.iterrows():
        executor.submit(assign_value, chunk, index, "authorId", row["qids"][0], authors)
    executor.shutdown()
    return chunk[chunk["authorId"] != -1]

In [169]:
import time

start = time.time()

processed_chunk = add_author_id(filtered_chunk[:100], speaker_attributes_updated)

end = time.time()

print((end - start))

processed_chunk

22.410353660583496


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,authorId
80002,2015-02-20-099750,We're honored and humbled to lead a great prog...,Chris Holtmann,[Q16227417],2015-02-20 16:46:46,1,"[[Chris Holtmann, 0.6839], [None, 0.3012], [Br...",[http://wthr.com/story/28159357/butlers-holtma...,E,4829064
80003,2015-03-26-011760,"causing hurt by means of poison, etc. with int...",Girish Bapat,[Q18206155],2015-03-26 09:00:09,1,"[[Girish Bapat, 0.5421], [None, 0.2372], [Amit...",[http://www.downtoearth.org.in/content/maharsh...,E,2594464
80004,2015-08-24-106561,"We're hoping a couple of weeks of rest, get on...",Jason Motte,[Q2351140],2015-08-24 13:41:58,14,"[[Jason Motte, 0.6073], [None, 0.3423], [Joe M...",[http://jg-tc.com/news/state-and-regional/cubs...,E,7996970
80010,2015-11-12-140570,"We're hoping to raise $30,000.",Kim King,[Q16730868],2015-11-12 22:37:00,1,"[[Kim King, 0.9275], [None, 0.0725]]",[http://bramptonguardian.com/community-story/6...,E,4840480
80013,2015-04-27-006966,Certainly as they hire staff and they get peop...,Andy McGuire,[Q29446253],2015-04-27 18:26:00,1,"[[Andy McGuire, 0.9314], [None, 0.0543], [Hill...",[http://nationaljournal.com/2016-elections/iow...,E,2692534
...,...,...,...,...,...,...,...,...,...,...
80248,2015-02-06-100508,We've had some good series wins away against E...,Brendon McCullum,[Q3345051],2015-02-06 09:10:05,1,"[[Brendon McCullum, 0.9622], [None, 0.0378]]",[http://www.stuff.co.nz/sport/cricket/65894255...,E,2361390
80250,2015-09-18-121224,We've had some great matches with Leith in rec...,Raymond Carr,[Q3179517],2015-09-18 15:59:18,1,"[[Raymond Carr, 0.9036], [None, 0.0964]]",[http://www.edinburghnews.scotsman.com/sport/f...,E,90693
80251,2015-11-05-019111,Crouching Tiger: What China's Militarism Means...,Peter Navarro,[Q7176052],2015-11-05 20:29:06,79,"[[Peter Navarro, 0.5457], [None, 0.3198], [Don...",[http://scpr.org/programs/airtalk/2015/11/05/4...,E,192795
80252,2015-03-01-055128,We've had three unbelievable games that went t...,Quinn Cook,[Q7272263],2015-03-01 03:47:55,1,"[[Quinn Cook, 0.9418], [None, 0.0582]]",[http://acc.blogs.starnewsonline.com/46675/duk...,E,1327475


### Random Sampling

Get a random sampling from the data, the distribution should be the same and the dataset will be manageable given the computation restrictions

In [170]:
def random_sample(dataset, sample_size):
    return dataset.sample(n=sample_size)

### After 

In [None]:
df_Trump = pd.read_csv('df_Trump_cleaned.csv', header =0, index_col=0, parse_dates=True, squeeze = True )
df_Trump.index = df_Trump.index.map(lambda x: str(x)[:-7]) #we transform the quote_ID in a format date
df_Trump.to_csv('df_Trumps_with_dates.csv')

In [None]:
df_Trump = pd.read_csv('df_Trumps_with_dates.csv',header =0, index_col=0, parse_dates=True, squeeze = True )

In [None]:
df_Trump.head(10) #as we can see now we have the dates

In [None]:
# we plot the Trump timeseries of the number of quotes over the past few years
plt.rcParams["figure.figsize"] = (18,6)
ax = df_Trump.plot()
plt.title('timeseries of the number of occurences of quotes referring to Donald Trump')
plt.xlabel('time')
plt.show()

The time serie is quite interesting, as we can see there are some peaks in it, one big peak at the end of 2017. In milestone 3 we will try to link these peaks with the political events in his career

### We could try to filter the dataset finding just the quotes in the two months before and after the elections of november 2016 in order to understand if this event correspond to a major number of quotes referring to Trump

In [None]:
df_Trump_2_months_before = df_Trump[('2016-11-09'>=df_Trump.index) & (df_Trump.index >= '2016-09-01')] 

In [None]:
df_Trump_2_months_before

In [None]:
# we plot the Trump timeseries of the number of quotes over the past two months before the elections
plt.rcParams["figure.figsize"] = (12,6)
ax = df_Trump_2_months_before.plot()
plt.title('timeseries of the number of occurences of quotes referring to Donald Trump two months before the elections')
plt.xlabel('time')
plt.show()

In [None]:
df_Trump_2_months_after = df_Trump[('2016-11-09'<=df_Trump.index) & (df_Trump.index <= '2017-01-01')] 

In [None]:
df_Trump_2_months_after

In [None]:
# we plot the Trump timeseries of the number of quotes over the past two months before the elections
plt.rcParams["figure.figsize"] = (12,6)
ax = df_Trump_2_months_after.plot()
plt.title('timeseries of the number of occurences of quotes referring to Donald Trump two months before the elections')
plt.xlabel('time')
plt.show()

In [None]:
def major_speakers(df, politician):
    print('the people who are speaking the most about '+politician+' are\n', df['speaker'].value_counts()[:10].index.tolist())
    df['speaker'].value_counts()[:10].plot(kind='bar', logy=True)
    plt.title('people who are speaking the most about '+politician)
    plt.ylabel('number of quotes')
    plt.show()

In [None]:
major_speakers(df_Trump, 'Trump')

As we can see Hillary Clinton and Joe Biden are present in the list. This is quite obvious since they have been the two contenders at the last political elections and it is clear that they had to quote Trump a lot in order to discredit his opinion.

In [None]:
del df_Trump # we don't want to store it in memory

### we do the same for Clinton

In [None]:
df_Clinton = pd.read_csv('df_Clinton_cleaned.csv',header =0, index_col=0, parse_dates=True, squeeze =True )

In [None]:
df_Clinton.index = df_Clinton.index.map(lambda x: str(x)[:-7]) # we transform the quote_id in a format date
df_Clinton.to_csv('df_Clinton_with_dates.csv') #checkpoint

In [None]:
import matplotlib.ticker as plticker

plt.rcParams["figure.figsize"] = (18,6)
ax = df_Clinton.plot()
plt.title('timeseries of the number of occurences of quotes referring to Hillary Clinton')
plt.xlabel('time')
plt.show()

As we can see one of the peaks is the one corresponding to the elections of november 2016

### It could be interseting to know who are the people who quote Clinton the most, we will try to understand it

In [None]:
major_speakers(df_Clinton, 'Clinton')

As we can see Trump is quoting a lot Clinton, this could be related to the fact that a big part of his political campaign of 2016 was based on discrediting his contender (Clinton)