# An analysis of the american political scene from a mediatic point of view

## Context

In the 21st century media coverage is a crucial factor for political figures. By studying the number of times a certain politician is quoted in media outlets (in our case New York Times), we can have a rough measure of how much interest does the media address to this politician. In our analysis we will study the evolution of the number of citations of some of the most important american politicians over the last few years and we will compare their evolution to the most important events in their carreer in order to see if there is any causation or correlation. We will then add some more analysis distinguishing the speakers (who quoted a certain politician) by religion, nationality and political party in order to have a better and fragmented view of the causal effects. In the end we will compare our work with Google Trends data in order to see if the conventional media outlets capture the online interest well.


## The data

We are provided with two `.bz2` compressed json file.

The first one `quotes-YYYY.json.bz2` containing a `.json` file which each row has information related to a specific quote. 
This `.json` has the following fields:

 - `quoteID`: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
 - `quotation`: Text of the longest encountered original form of the quotation
 - `date`: Earliest occurrence date of any version of the quotation
 - `phase`: Corresponding phase of the data in which the quotation first occurred (A-E)
 - `probas`: Array representing the probabilities of each speaker having uttered the quotation.
      The probabilities across different occurrences of the same quotation are summed for
      each distinct candidate speaker and then normalized
      - `proba`: Probability for a given speaker
      - `speaker`: Most frequent surface form for a given speaker in the articles where the quotation occurred
 - `speaker`: Selected most likely speaker. This matches the the first speaker entry in `probas`
 - `qids`: Wikidata IDs of all aliases that match the selected speaker
 - `numOccurrences`: Number of time this quotation occurs in the articles
 - `urls`: List of links to the original articles containing the quotation 
 
The second one `quotebank-YYYY.json.bz2` contains a `.json` file which each row has information related to a entire article.
This `.json` has the following fields:

 - `articleID`: Primary key
 - `articleLength`: Length of the article in PTB tokens
 - `date`: Publication date of the article
 - `phase`: Corresponding phase in which the article appeared (A-E)
 - `title`: Title of the article
 - `url`: Link to the original article
 - `names`: List of all extracted speakers that occur in the article
      - `name`: Surface form of the first occurrence of each speaker in the article
      - `ids`: List of Wikidata IDs that have `name` as a possible alias
      - `offsets`: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
 - `quotations`: List of all the quotations that appear in the article
      - `quoteID`: Foreign key of the quotation (from the quotation-centric dataset)
      - `quotation`: Text of the quotation as it occurs in this article
   	  - `quotationOffset`: Index where the quotation starts in the article
      - `leftContext`: Text in the left context window of the quotation (used for the attribution)
      - `rightContext`: Text in the right context window (used for the attribution)
      - `globalProbas`: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
      - `globalTopSpeaker`: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID` 
      - `localProbas`: Array representing the probabilities of each speaker having said the quote *given this article context*.
           - `proba`: Probability for a given speaker
           - `speaker`: Name of the speaker as it first occurs in this article
      - `localTopSpeaker`: Selected speaker. Same name as the first entry in `localProbas`
      - `numOccurrences`: Number of times this quotation occurs in any article  

In [2]:
# Imports we may need
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import scipy.stats as stats
import pandas as pd
import numpy as np
import ujson as json
import bz2

import scipy as sc
from scipy.stats import ttest_ind
import plotly.io as pio

### Load Quotes and Speaker Attributes

Load quotes related to Donald Trump.

In [3]:
df_Trump = pd.read_pickle("data/df_Trump_cleaned.pkl")

df_Trump

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
12,2018-07-16-000103,[ Ensuring ] the orchestrating and timing of M...,Corey Lewandowski,[Q20740735],2018-07-16 14:05:34,2,"[[Corey Lewandowski, 0.7179], [None, 0.2754], ...",[http://www.theweek.co.uk/95082/donald-trump-s...,E
66,2018-05-09-001003,300-plus years of them cold shoulders... Obama...,Charlamagne Tha God,[Q16203002],2018-05-09 11:00:00,1,"[[Charlamagne Tha God, 0.4806], [None, 0.2924]...",[https://www.portlandmercury.com/music/2018/05...,E
212,2018-08-02-002115,A politically connected contractor made a $500...,,[],2018-08-02 06:24:59,1,"[[None, 0.7942], [President Trump, 0.1406], [K...",[https://sunlightfoundation.com/2018/08/02/tod...,E
264,2018-08-19-001084,A Wider Danger: Trump's Troubling Attacks on J...,,[],2018-08-19 02:05:39,3,"[[None, 0.9052], [Queen Elizabeth II, 0.0948]]","[http://www.vnews.com/Forum-Aug-19-19549635, h...",E
333,2018-09-26-004202,"After the commercial, she says, `They told me ...",,[],2018-09-26 00:00:00,2,"[[None, 0.4595], [Tom Arnold, 0.3248], [Julie ...",[http://feeds.foxnews.com/~r/foxnews/entertain...,E
...,...,...,...,...,...,...,...,...,...
5243994,2020-02-05-103219,Trump offends and disrespects the Venezuelan p...,Jorge Arreaza,[Q6623799],2020-02-05 00:00:00,11,"[[Jorge Arreaza, 0.9164], [None, 0.0726], [Pre...",[https://www.rawstory.com/2020/02/imwithfred-t...,E
5243995,2020-02-05-103235,"Trump survived, but he is the most unpopular p...",,[],2020-02-05 23:11:42,3,"[[None, 0.8786], [Donald Trump, 0.1214]]",[https://www.wellsvilledaily.com/zz/news/20200...,E
5243996,2020-03-13-071475,"Trump tried to mitigate the issue, saying it i...",Hassan Nasrallah,[Q181182],2020-03-13 22:15:06,1,"[[Hassan Nasrallah, 0.922], [None, 0.0741], [P...",[http://israelnationalnews.com/News/News.aspx/...,E
5243997,2020-03-15-037086,Trump's do-over approach -- he unlocked $50 bi...,Newt Gingrich,[Q182788],2020-03-15 00:00:00,40,"[[Newt Gingrich, 0.5146], [None, 0.3958], [Don...",[http://uspolitics.einnews.com/article/5120893...,E


Load the parquet dataframe with attributes of each author

In [4]:
speaker_attributes_updated = pd.read_parquet("data/speaker_attributes_updated.parquet")


In [5]:
speaker_attributes_updated

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Great Britain, United States of America]",[male],1395141751,,W000178,"[politician, military officer, farmer, cartogr...",[independent politician],,Q23,George Washington,"[1792 United States presidential election, 178...",item,[Episcopal Church]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[United Kingdom],[male],1395737157,[White British],,"[playwright, screenwriter, novelist, children'...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Belgium],[male],1380367296,,,"[writer, lawyer, librarian, information scient...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[United States of America],[male],1395142029,,,"[politician, motivational speaker, autobiograp...",[Republican Party],,Q207,George W. Bush,"[2000 United States presidential election, 200...",item,"[United Methodist Church, Episcopal Church, Me..."
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Spain],[male],1391704596,,,[painter],,,Q297,Diego Velázquez,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9055976,[Barker Howard],,[United States of America],[male],1397399351,,,[politician],,,Q106406560,Barker B. Howard,,item,
9055977,[Charles Macomber],,[United States of America],[male],1397399471,,,[politician],,,Q106406571,Charles H. Macomber,,item,
9055978,,[+1848-04-01T00:00:00Z],,[female],1397399751,,,,,,Q106406588,Dina David,,item,
9055979,,[+1899-03-18T00:00:00Z],,[female],1397399799,,,,,,Q106406593,Irma Dexinger,,item,


### Plot time series

We will plot the timeseries of the number of occurrences of quotes referring Hillary Clinton, and compare them with google trends in order to check if the two plots look similar or not

In [6]:
#Load file containing the occurence dates of each quote
Clinton_dataframe = pd.read_pickle("data/Clinton_with_dates.pkl")

In [179]:
Clinton_dataframe

Unnamed: 0,quoteID,quotation,speaker,qids,numOccurrences,probas,urls,phase,date
26,2015-10-25-000242,"' It is not now, nor has it ever been, the gol...",Bernie Sanders,[Q359442],1,"[[Bernie Sanders, 0.5395], [None, 0.3128], [Hi...",[http://examiner.com/article/bernie-sanders-sl...,E,[2015-10-25 14:12:35]
888,2015-09-16-006359,And I'm just pointing out the absurd on both s...,Kathleen Madigan,[Q6376814],1,"[[Kathleen Madigan, 0.8025], [None, 0.1975]]",[http://northjersey.com/arts-and-entertainment...,E,[2015-09-16 05:44:37]
6930,2015-10-22-051493,If [ Democratic presidential candidate former ...,Marco Rubio,[Q324546],1,"[[Marco Rubio, 0.93], [None, 0.0685], [Hillary...",[http://breitbart.com/video/2015/10/22/rubio-i...,E,[2015-10-22 20:04:16]
7374,2015-12-31-032035,I'm electable. I was elected in a purple state...,Jeb Bush,[Q221997],7,"[[Jeb Bush, 0.8392], [None, 0.0925], [Hillary ...",[http://www.postandcourier.com/article/2015123...,E,[2015-12-31 03:29:00]
9855,2015-11-12-104266,The incentive to invent episodes of discrimina...,Glenn Reynolds,[Q4392664],2,"[[Glenn Reynolds, 0.3454], [Ed Driscoll, 0.322...","[http://pjmedia.com/instapundit/218734/, http:...",E,[2015-11-12 00:00:00]
...,...,...,...,...,...,...,...,...,...
5222109,2020-03-06-025712,"I think that would have been a mistake, becaus...",Jennifer Palmieri,[Q18209402],1,"[[Jennifer Palmieri, 0.9117], [None, 0.056], [...",[https://www.rollingstone.com/politics/politic...,E,[2020-03-06 14:38:07]
5231803,2020-01-18-006266,"Chief Justice Rehnquist, when he presided over...",Dick Durbin,[Q434804],1,"[[Dick Durbin, 0.898], [None, 0.079], [Charlie...",[https://www.washingtonexaminer.com/news/impea...,E,[2020-01-18 05:01:08]
5235860,2020-01-06-061256,The main difference between Lindsey and his De...,David Woodard,[Q1177254],6,"[[David Woodard, 0.7544], [None, 0.1797], [Lin...",[http://chron.com/entertainment/article/How-Li...,E,[2020-01-06 00:00:00]
5235869,2020-04-09-052373,The model of Obama asking Bush and Clinton to ...,Bill Haslam,[Q862186],1,"[[Bill Haslam, 0.905], [None, 0.0837], [Barack...",[http://www.nytimes.com/2020/04/09/us/politics...,E,[2020-04-09 23:04:21]


In [184]:
#Load file containing the occurence dates of each quote
Trump_dataframe = pd.read_pickle("data/Trump_with_dates.pkl")

In [185]:
Trump_dataframe

Unnamed: 0,quoteID,quotation,speaker,qids,numOccurrences,probas,urls,phase,date
12,2018-07-16-000103,[ Ensuring ] the orchestrating and timing of M...,Corey Lewandowski,[Q20740735],2,"[[Corey Lewandowski, 0.7179], [None, 0.2754], ...",[http://www.theweek.co.uk/95082/donald-trump-s...,E,"[2018-07-17 06:00:00, 2018-07-16 14:05:34]"
66,2018-05-09-001003,300-plus years of them cold shoulders... Obama...,Charlamagne Tha God,[Q16203002],1,"[[Charlamagne Tha God, 0.4806], [None, 0.2924]...",[https://www.portlandmercury.com/music/2018/05...,E,[2018-05-09 11:00:00]
366,2018-01-07-002036,All I can say is it's not a hoax. The Russians...,Lindsey Graham,[Q22212],1,"[[Lindsey Graham, 0.5251], [None, 0.2936], [Ch...",[http://postandcourier.com/politics/would-lind...,E,[2018-01-07 15:40:00]
1024,2018-01-16-011608,being too nonchalant about Mr. Trump's rants.,Floyd Abrams,[Q3365171],3,"[[Floyd Abrams, 0.7512], [None, 0.2421], [Hill...",[http://www.washingtontimes.com/news/2018/jan/...,E,"[2018-01-16 20:06:13, 2018-01-17 05:01:43]"
1091,2018-09-24-011280,Brett Kavanaugh is poised to join Neil Gorsuch...,Raul Labrador,[Q555393],1,"[[Raul Labrador, 0.6471], [None, 0.2436], [Pre...",[http://www.spokesman.com/stories/2018/sep/25/...,E,[2018-09-24 23:21:10]
...,...,...,...,...,...,...,...,...,...
5243971,2020-04-05-029136,To say that I'm infuriated with the recent act...,Dwight Ball,[Q5318112],1,"[[Dwight Ball, 0.6293], [None, 0.3336], [Justi...",[https://www.cbc.ca/news/politics/trudeau-will...,E,[2020-04-05 23:11:52]
5243994,2020-02-05-103219,Trump offends and disrespects the Venezuelan p...,Jorge Arreaza,[Q6623799],11,"[[Jorge Arreaza, 0.9164], [None, 0.0726], [Pre...",[https://www.rawstory.com/2020/02/imwithfred-t...,E,"[2020-02-05 19:09:04, 2020-02-05 19:25:17, 202..."
5243996,2020-03-13-071475,"Trump tried to mitigate the issue, saying it i...",Hassan Nasrallah,[Q181182],1,"[[Hassan Nasrallah, 0.922], [None, 0.0741], [P...",[http://israelnationalnews.com/News/News.aspx/...,E,[2020-03-13 22:15:06]
5243997,2020-03-15-037086,Trump's do-over approach -- he unlocked $50 bi...,Newt Gingrich,[Q182788],40,"[[Newt Gingrich, 0.5146], [None, 0.3958], [Don...",[http://uspolitics.einnews.com/article/5120893...,E,"[2020-03-15 00:00:00, 2020-03-15 00:00:00, 202..."


In [186]:
Politician_dataframes=[Clinton_dataframe,Trump_dataframe]

In [205]:
#merge all dates in one single list to make it easier to plot

    
list_Clinton_dates = []

list_Clinton_dates = [item for sublist in Clinton_dataframe["date"].values for item in sublist]

times_series_Clinton = pd.DataFrame(index = list_Clinton_dates) # we create a Df whose index are the dates 
times_series_Clinton.index = times_series_Clinton.index.map(lambda x: str(x)[:-9]) #we are not interested in the hour but just in the date, we remove the
#hour information from the index

In [188]:
politician_list= ["Clinton","Trump"]
Politicians_time_series = []

for i in range(len(politician_list)):
    list_dates = []
    list_dates = [item for sublist in Politician_dataframes[i]["date"].values for item in sublist]
    
    Politicians_time_series.append(pd.DataFrame(index = list_dates))
    Politicians_time_series[i].index = Politicians_time_series[i].index.map(lambda x: str(x)[:-9])

In [279]:
Politicians_time_series_plot = []
for i in range(len(politician_list)):
    #Plot time series
    Politicians_time_series[i]["numOccurrences"] = 1
    Politicians_time_series_plot = Politicians_time_series[i].groupby(by = Politicians_time_series[i].index).count()
    
    # create figure
    fig = go.Figure()

    # Add surface trace
    fig.add_trace(go.Scatter(x = Politicians_time_series_plot.index, y=Politicians_time_series_plot["numOccurrences"]))


    fig.update_layout(
        title = "Timeseries of the number of occurrences of quotes referring to "+politician_list[i],
        xaxis_title = "time",
        xaxis=dict(
            rangeslider = dict(visible = True),
                type ="date"
        )
    )
    
    fig.show()

Plot time series of quotes referring Hillary Clinton

In [227]:
#maximum number of occurences
for i in range(len(politician_list)):
    print("Maximum number of occurrences of",politician_list[i],"in NYT quotes:", Politicians_time_series[i].groupby(by = Politicians_time_series[i].index).count().max()[0],"\n")

Maximum number of occurrences of Clinton in NYT quotes: 2104 

Maximum number of occurrences of Trump in NYT quotes: 3101 



As we can see the number of the maximum occurrences is 2104, which corresponds to the peak seen in the plot, let us see what is the date associated to that.

In [243]:
# date associated with maximum number of occurencer
for i in range(len(politician_list)):
    print("Date which maximum number of occurrences of",politician_list[i],"in NYT quotes happened:", Politicians_time_series[i].groupby(by = Politicians_time_series[i].index).count().idxmax()[0],"\n")

Date which maximum number of occurrences of Clinton in NYT quotes happened: 2016-07-26 

Date which maximum number of occurrences of Trump in NYT quotes happened: 2019-03-25 



The date associated to the peak is 26/7/2016. Let us try to understand why. Since the date of the official nominee was 26 july of 2016, the peak simply reflects the interest of media on the new candidate.

### Timeseries before and after the election

In this section we want to plot the time series before and after the election. We will consider the time period of 3 month after and before the elections. In this case, since the election day for the United States presedensials of 2016 occur at 8 of November, we will consider the period from 8 of August of 2016 to 8 of February of 2017.

In [275]:
timer_series_BA_election=[]
for i in range(len(politician_list)):
    timer_series_BA_election.append(pd.DataFrame())
    timer_series_BA_election[i]["date"] = Politicians_time_series[i].index
    timer_series_BA_election[i] = timer_series_BA_election[i][(timer_series_BA_election[i]["date"] > '2016-08-08') & (timer_series_BA_election[i]["date"] < '2017-02-08')]

    timer_series_BA_election[i] = pd.DataFrame(index = list(timer_series_BA_election[i]["date"]))

Plot time series of quotes referring Hillary Clinton during the election period

In [282]:
for i in range (len(politician_list)):
    #Plot time series
    timer_series_BA_election[i]["numOccurrences"] = 1
    times_series_BA_election_plot = timer_series_BA_election[i].groupby(by = timer_series_BA_election[i].index).count()
    
    # create figure
    fig = go.Figure()

    # Add surface trace
    fig.add_trace(go.Scatter(x = times_series_BA_election_plot.index, y=times_series_BA_election_plot["numOccurrences"], name= "Number of occurrences in NYT"))

    fig.add_trace(go.Scatter(x=["2016-11-08", "2016-11-08"], y=[0,700], mode="lines", name="Election day"))
    fig.update_layout(
        title = "Timeseries of the number of occurrences of quotes referring to "+politician_list[i]+" during the election period",
        xaxis_title = "time",
        xaxis=dict(
            rangeslider = dict(visible = True),
                type ="date"
        )
    )

    fig.show()


As we can see there is a huge decrease in the citations referred to Clinton in the period after the elections. While there is a huge number before. Indeed, the period before the election is simply debate period so lots of speaker are probably referring to her because of that. On the other hand, after the elections we assist to a huge decrease in popularity (from a mediatic point of view) due to the result in the elections.

### Compare with google trends

In [286]:
from plotly.subplots import make_subplots
google_trend_list = []
for i in range(len(politician_list)):
    #plot google trends time series referred to the number of times Hillary Clinton was search
    google_trend_list.append(pd.read_csv('data/'+politician_list[i]+'_google_trends.csv'))
    
    google_trend_list[i].set_index("Week", inplace = True)
    google_trend_list[i].index.name = None
    
    google_trends_norm = (google_trend_list[i]-google_trend_list[i].min()) / (google_trend_list[i].max()-google_trend_list[i].min())
    
    ny_times_data = timer_series_BA_election[i].groupby(by = timer_series_BA_election[i].index).count()
    ny_times_data_norm = (ny_times_data-ny_times_data.min()) / (ny_times_data.max()-ny_times_data.min())

    
    google_trends_trace = go.Scatter(
        x=ny_times_data_norm.index,
        y=google_trends_norm["Searchs"],
        name='Normalized number of searches in Google',
        yaxis='y1'
    )
    
    ny_quotes_trace = go.Scatter(
        x=ny_times_data_norm.index,
        y=ny_times_data_norm["numOccurrences"],
        name='Normalized number of occurrences in NYT',
        yaxis='y2'
    )
    V_trace = go.Scatter(x=["2016-11-08", "2016-11-08"], y=[0,1], mode="lines", name="Election day")
    
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    fig.add_trace(google_trends_trace)
    fig.add_trace(ny_quotes_trace,secondary_y=True)
    fig.add_trace(V_trace)

    fig.update_layout(
        title = "Timeseries of the number of times that "+politician_list[i]+" was search",
        xaxis_title = "time",
        xaxis=dict(
            rangeslider = dict(visible = True),
                type ="date"
        )
    )

    fig.show()
    break

Note: It was added a normalization factor for both the New York times data and the Google trends data for meaningful comparison.

As we can see in the plots of google trends and the time series of the number of occurrences of quotes related to Hillary Clinton, there is a spike on both plots in the year of 2016. This year match the United States presidencial elections year where Hillary Clinton was the presumptive nominee of the Democratic Party and running against Donald Trump.

Now, comparing the spikes of both plots, we see that the spike of number of occurences of quotes refering Hillary Clinton happens around the month of July and the spike of google trends happens arround November. A possible explanation for this deviation, is that we can assume that one of the main motivations for people to google a certain topic is because they read something about it, so it is plausible that the spike for the number of occurences of quotes related to a certain topic happens before the spike of google trends. In addition, probably people have googled Clinton during the period of elections which captured the global interest while the newspaper started speaking about Clinton before (because of debate periods).

### Relation between attribute and speakers

Our objective here is to realize if there is any relation between a specific attribute and the speakers of a certain politician

First to do this type of analisys, we load file with the authorID merged. This column contain the index of the speaker attributes file. We merge it with the speaker attribute file (it is very fast since the real merge has already been carried out).

In [75]:
#Load Clinton's speakers attributes file
Clinton_attributes = pd.read_pickle("data/Clinton_with_attributes.pkl")

In [76]:
Clinton_attributes.qids = Clinton_attributes.qids.transform(lambda x: x[0])
Clinton_with_att = pd.merge(Clinton_attributes, speaker_attributes_updated, left_on = 'qids', right_on = 'id' ) 

In [77]:
Trump_attributes = pd.read_pickle("data/Trump_with_attributes.pkl")

In [78]:
Trump_attributes.qids = Trump_attributes.qids.transform(lambda x: x[0])
Trump_with_att = pd.merge(Trump_attributes, speaker_attributes_updated, left_on = 'qids', right_on = 'id' ) 

### Relation between age and speakers 

Our objective here is to analise the age distribution of the Clinton's and Trump's speakers

We get the age of each speaker based in their date of birth and then filter the outliers which we consider as ages larger than 100 years old.

In [79]:
age_serie_Clinton = Clinton_with_att.groupby(by = 'speaker').date_of_birth.head(1)
age_serie_Clinton = age_serie_Clinton.dropna().transform(lambda x: 2021 - int(x[0][1:5]))
age_serie_Clinton = age_serie_Clinton[age_serie_Clinton <=100]
age_serie_Clinton = age_serie_Clinton.rename("age")

In [80]:
fig = px.histogram(age_serie_Clinton, color_discrete_sequence=['indianred'], marginal="box")
fig.update_layout(title_text='Distribution of the ages of Clinton speakers') # title of plot
fig.show(render = 'svg')

As we can see, most of the speakers are quite old, what about the distribution of the age of Trump speakers ? We repeat the same for Trump's speakers.

In [81]:
age_serie_Trump = Trump_with_att.groupby(by = 'speaker').date_of_birth.head(1)
age_serie_Trump = age_serie_Trump.dropna().transform(lambda x: 2021 - int(x[0][1:5]))
age_serie_Trump = age_serie_Trump[age_serie_Trump <=100]
age_serie_Trump = age_serie_Trump.rename("age")

In [82]:
fig = px.histogram(age_serie_Trump, color_discrete_sequence=['royalblue'], marginal="box")
fig.update_layout(title_text='Distribution of the ages of Trump speakers') # title of plot
fig.show(render = 'svg')

As we can see, the distribution looks similar to a normal distribution around the value of 60 years old a part from a peak at 21 years old and some noise, it is interesting to compare such distribution with Clinton's one on the same plot.

In [83]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Histogram(x = age_serie_Trump, name = 'Trump'))
fig.add_trace(go.Histogram(x = age_serie_Clinton, name = 'Clinton'))

# Overlay both histograms
fig.update_layout(barmode='overlay', title_text='Distribution of the ages of Trump and Clinton speakers together')

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show(render = 'svg')

In [84]:
import plotly.figure_factory as ff

# Add histogram data
x1 = age_serie_Trump
x2 = age_serie_Clinton

# Group data together
hist_data = [x1, x2]

group_labels = ['Trump', 'Clinton']
colors = ['royalblue', 'indianred']
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, colors = colors)
fig.update_layout(
    autosize=False,
    width=1000,
    height=800,)

fig.show()

From the distribution it could seem that Clinton's speakers are a bit older in general with respect to Trump's speakers,we will compute the mean for both groups.

In [85]:
print('the mean of the age for Trump speakers is ', age_serie_Trump.mean())
print('the mean of the age for Clinton speakers is ', age_serie_Clinton.mean())

the mean of the age for Trump speakers is  58.19625944131399
the mean of the age for Clinton speakers is  59.473767466385446


Is the age different with statistical significance ? First of all we visualize with confidence intervals.

In [86]:
def bootstrap_CI(data, nbr_draws):
    means = np.zeros(nbr_draws)
    data = np.array(data)

    for n in range(nbr_draws):
        indices = np.random.randint(0, len(data), len(data))
        data_tmp = data[indices] 
        means[n] = np.nanmean(data_tmp)

    return [np.nanpercentile(means, 2.5),np.nanpercentile(means, 97.5)]

In [87]:
error_Trump = bootstrap_CI(age_serie_Trump, 1000)[1]-bootstrap_CI(age_serie_Trump, 1000)[0]
error_Clinton = bootstrap_CI(age_serie_Clinton, 1000)[1]-bootstrap_CI(age_serie_Clinton, 1000)[0]

In [88]:
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Trump',
    x=['Trump'], y=[age_serie_Trump.mean()],
    error_y=dict(type='data', array=[error_Trump]),
    marker=dict(color='royalblue')
))
fig.add_trace(go.Bar(
    name='Clinton',
    x=['Clinton'], y=[age_serie_Clinton.mean()],
    error_y=dict(type='data', array=[error_Clinton]),
    marker=dict(color='indianred')
))
fig.update_layout(barmode='group')
fig.show(render = 'svg')

In [89]:
stats = sc.stats.ttest_ind(age_serie_Trump, age_serie_Clinton, equal_var = False) #equal var =False

print('the p value for the t test with null hypothesis stating that the two populations have the same mean is ', stats.pvalue)

the p value for the t test with null hypothesis stating that the two populations have the same mean is  2.308340322052276e-05


With a p-value below 0.05 (our standard confidence level) we can conclude that we reject the null hypothesis that the two populations have the same mean. Therefore the Clinton speakers are order with statistical significance.

### Relation between ethnic group and speakers

Our objective here is to undestand at which ethinic groups belong Trump's and Clinton's speakers and if there is any relevant difference between the two.

In [90]:
Clinton_ethnics = Clinton_with_att.groupby(by='id').first().explode('ethnic_group').reset_index(drop=True)['ethnic_group'].dropna()

In [91]:
major_ethnicities_Clinton = Clinton_ethnics.groupby(Clinton_ethnics).count().sort_values(ascending = False)[0:8]
major_ethnicities_Clinton

ethnic_group
African Americans    211
Jewish people         30
American Jews         23
Italian American      16
Armenian American     13
Irish Americans       10
Irish people           7
Indian American        6
Name: ethnic_group, dtype: int64

In [92]:
x = major_ethnicities_Clinton.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Clinton',
    x=x, y=[major_ethnicities_Clinton[el] for el in x]
))
fig.update_layout(barmode='group', title = 'Ethnicities of Clinton speakers')
fig.update_traces(marker=dict(color="Indianred"))
fig.show(render="svg")

How we can see from the plot, the majority of the speakers are African Americans but also other ethnicities have quoted Clinton.

It is interesting to analyse the difference with the enthnicities of Trump's speakers.

In [93]:
Trump_ethnics = Trump_with_att.groupby(by='id').first().explode('ethnic_group').reset_index(drop=True)['ethnic_group'].dropna()
major_ethnicities_Trump = Trump_ethnics.groupby(Trump_ethnics).count().sort_values(ascending = False)[0:8]
major_ethnicities_Trump

ethnic_group
African Americans    338
Jewish people         56
American Jews         34
Armenian American     17
Italian American      13
Irish people          11
English people        10
Yoruba people         10
Name: ethnic_group, dtype: int64

In [94]:
x = major_ethnicities_Trump.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Trump',
    x=x, y=[major_ethnicities_Trump[el] for el in x]
))
fig.update_layout(barmode='group', title='Ethnicities of Trump speakers')
fig.update_traces(marker=dict(color="royalblue"))
fig.show(render='svg')

In [95]:
x0 = Trump_ethnics[Trump_ethnics.isin(major_ethnicities_Trump.index.tolist())]
x1 = Clinton_ethnics[Clinton_ethnics.isin(major_ethnicities_Clinton.index.tolist())]

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=x0,
    histnorm='percent',
    name='Trump', # name used in legend and hover labels,
    marker_color='royalblue'
))
fig.add_trace(go.Histogram(
    x=x1,
    histnorm='percent',
    name='Clinton',
    marker_color='indianred'
))

fig.update_layout(
    title_text='Major ethnicities comparison', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)

fig.show()

As we can see there are some differences between the major ethnicities of the two groups, for example there is a higher percentage of italian american speaking about Clinton than Trump. On the other hand for instance, the number of jewish people speaking about Trump are more (in percentage) than the number of people speaking about Clinton.

### Relation between sex and speakers

In [96]:
Clinton_genders = Clinton_with_att.groupby(by='id').first()['gender'].dropna().transform(lambda x: x[0])
Clinton_genders = Clinton_genders[Clinton_genders.isin(['male', 'female'])]

In [97]:
fig = px.histogram(Clinton_genders, color_discrete_sequence=['indianred'])
fig.update_layout(title_text='Distribution of the sex of Clinton speakers') # title of plot
fig.show(render = 'svg')

What is the ratio male/female ?

In [98]:
male = Clinton_genders[Clinton_genders == 'male']
male_size = male.size
female_size = Clinton_genders.size - male.size
print('the ratio number of males/number of females for Clinton speakers is ', male_size/female_size)

the ratio number of males/number of females for Clinton speakers is  2.804247460757156


In [99]:
Trump_genders = Trump_with_att.groupby(by='id').first()['gender'].dropna().transform(lambda x: x[0])
Trump_genders = Trump_genders[Trump_genders.isin(['male', 'female'])]

In [100]:
fig = px.histogram(Trump_genders)
fig.update_layout(title_text='Distribution of the sex of Trump speakers') # title of plot
fig.update_traces(marker=dict(color="royalblue"))
fig.show(render = 'svg')

In [101]:
male = Trump_genders[Trump_genders == 'male']
male_size = male.size
female_size = Trump_genders.size - male.size
print('the ratio number of males/number of females for Trump speakers is ', male_size/female_size)

the ratio number of males/number of females for Trump speakers is  3.2282913165266107


In [102]:
x0 = Trump_genders
x1 = Clinton_genders

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=x0,
    histnorm='percent',
    name='Trump', # name used in legend and hover labels,
    marker_color='royalblue'
))
fig.add_trace(go.Histogram(
    x=x1,
    histnorm='percent',
    name='Clinton',
    marker_color='indianred'
))

fig.update_layout(
    title_text='Sampled Results', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)

fig.show()

There is a remarkable difference between the two ratios, it seems that there are more men whose quotes are referred to Trump. How can we show that this difference in ratio is statistically significant ?

We add confidence intervals to the values of the ratios.

In [103]:
def bootstrap_CI_ratios(data, nbr_draws):
    ratios = np.zeros(nbr_draws)
    data = np.array(data)

    for n in range(nbr_draws):
        indices = np.random.randint(0, len(data), len(data))
        data_tmp = data[indices] 
        male = data_tmp[data_tmp=='male']
        male_size = male.size
        female_size = data_tmp.size - male.size
        ratios[n] = male_size/female_size
    
    return [np.nanpercentile(ratios, 2.5),np.nanpercentile(ratios, 97.5)]

In [104]:
confidence_interval_Clinton = bootstrap_CI_ratios(Clinton_genders, 1000)
confidence_interval_Trump = bootstrap_CI_ratios(Trump_genders, 1000)
print('the confidence intervals for Clinton ratio of gender is', confidence_interval_Clinton)
print('the confidence intervals for Trump ratio of gender is',confidence_interval_Trump)

the confidence intervals for Clinton ratio of gender is [2.620307179084419, 3.011782268611049]
the confidence intervals for Trump ratio of gender is [3.0778928410625843, 3.3987372510927636]


Since the two confidence intervals do not overlap we can conclude that this difference in ratios is statistically significant.

### Relation between nationality and speakers

In [105]:
Clinton_nationalities = Clinton_with_att.groupby(by='id').first().explode('nationality').reset_index(drop=True)['nationality'].dropna()

In [106]:
major_nationalities_Clinton = Clinton_nationalities.groupby(Clinton_nationalities).count().sort_values(ascending=False)[0:10]
print(major_nationalities_Clinton.to_string())

nationality
United States of America    2928
United Kingdom               187
Canada                        75
Australia                     65
Israel                        30
New Zealand                   26
India                         25
Russia                        23
France                        22
Germany                       22


In [107]:
x = Clinton_nationalities.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Clinton',
    x=x, y=[Clinton_nationalities[el] for el in x]
))
fig.update_layout(barmode='group', title = 'Nationalities of Clinton speakers')
fig.update_traces(marker=dict(color="Indianred"))
fig.update_yaxes(type = 'log')
fig.show()

As we can see the majority of the speakers are from United States which is exactly what we would expect. However there are several other speaker nationality which are quite frequents. What about Trump ?

In [108]:
Trump_nationalities = Trump_with_att.groupby(by='id').first().explode('nationality').reset_index(drop=True)['nationality'].dropna()

In [109]:
major_nationalities_Trump = Trump_nationalities.groupby(Trump_nationalities).count().sort_values(ascending=False)[0:10]
print(major_nationalities_Trump.to_string())

nationality
United States of America    5083
United Kingdom               667
Canada                       258
Australia                    171
Israel                       138
India                        120
Germany                      104
France                        84
Russia                        72
Ireland                       68


In [110]:
x = Trump_nationalities.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Trump',
    x=x, y=[Trump_nationalities[el] for el in x]
))
fig.update_layout(barmode='group', title = 'Nationalities of Trump speakers')
fig.update_traces(marker=dict(color="royalblue"))
fig.update_yaxes(type = 'log')
fig.show()

Despite most of the major nationalities of the speakers are the same, there are some differences between the two distributions. We will visualize in a single plot the two distributions together.

In [111]:
x0 = Trump_nationalities[Trump_nationalities.isin(major_nationalities_Trump.index.tolist())]
x1 = Clinton_nationalities[Clinton_nationalities.isin(major_nationalities_Clinton.index.tolist())]

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=x0,
    histnorm='percent',
    name='Trump', # name used in legend and hover labels,
    marker_color='royalblue'
))
fig.add_trace(go.Histogram(
    x=x1,
    histnorm='percent',
    name='Clinton',
    marker_color='indianred'
))

fig.update_layout(
    title_text='Major nationalities comparison', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='percentage', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.update_yaxes(type = 'log')

fig.show()

There are differences between the distribution of the major nationalities of the people. Indeed, we can see that there are more people with american nationality speaking about Clinton than Trump while for example the people from european union (at least the representative we have in the plot) are more likely to speak about Trump rather than Clinton.

### Relation between academic degreee and speakers

Our objective here is to realize if there is any relation between the academic degree and speaker of a certain politician

For instance lets take Clinton's speakers as our sample of study

In [112]:
df_Clinton_AD = get_dataFrame_of_attribute(speaker_attributes_updated,Clinton_atributes,"academic_degree")

NameError: name 'get_dataFrame_of_attribute' is not defined

Our final goal in this topic is to plot an histogram, so in order to have a better visualization we will just consider relevant the academic degrees with more than 5 counts

In [None]:
Clinton_AD_values, Clinton_AD_plot_values = get_dataframe_for_plot(df_Clinton_AD,"academic_degree",5)
Clinton_AD_values

Plot the academic degree distribution of Clinton's speakers

In [None]:
#histogram for the main academic degree of the speakers  
AD_Clinton_hist = sns.barplot(x=Clinton_AD_plot_values["counts"], y=Clinton_AD_plot_values["academic_degree"].values, data=Clinton_AD_plot_values)
plt.show()

The majority of the speaker have a bachelor of Arts

### Conclusion

In notebook Project2 we have proved the feasibility of the project and we have used the files created in the notebooks in the folder data_transformation_notebooks to show some of the analysis we intend to do in Milestone 3 (plot times series, compare with google Trends, try to identify events by means of the time series,...). We have done it for Clinton even if we could have repeated the same analysis for Trump since we have Trump's data too. In milestone 3 we plan to redo the analysis in a more ordered way, probably we will use a chronological order to tell the story of the American political scene, starting from 2015 to 2020. Probably we will include more politicians such as Biden and Trump and we will try to indentify which events have influenced their mediatic visibility the most. Finally, as explained in the readme and as we have shown for Clinton, we plan to make some analysis on the different features of the speakers trying to compare the features of the speakers between the speakers of different politicians in order to understand if there are any noticeable difference which we can explain.