# An analysis of the american political scene from a mediatic point of view

## Context

In the 21st century media coverage is a crucial factor for political figures. By studying the number of times a certain politician is quoted in media outlets (in our case New York Times), we can have a rough measure of how much interest does the media address to this politician. In our analysis we will study the evolution of the number of citations of some of the most important american politicians over the last few years and we will compare their evolution to the most important events in their carreer in order to see if there is any causation or correlation. We will then add some more analysis distinguishing the speakers (who quoted a certain politician) by religion, nationality and political party in order to have a better and fragmented view of the causal effects. In the end we will compare our work with Google Trends data in order to see if the conventional media outlets capture the online interest well.


## The data

We are provided with two `.bz2` compressed json file.

The first one `quotes-YYYY.json.bz2` containing a `.json` file which each row has information related to a specific quote. 
This `.json` has the following fields:

 - `quoteID`: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
 - `quotation`: Text of the longest encountered original form of the quotation
 - `date`: Earliest occurrence date of any version of the quotation
 - `phase`: Corresponding phase of the data in which the quotation first occurred (A-E)
 - `probas`: Array representing the probabilities of each speaker having uttered the quotation.
      The probabilities across different occurrences of the same quotation are summed for
      each distinct candidate speaker and then normalized
      - `proba`: Probability for a given speaker
      - `speaker`: Most frequent surface form for a given speaker in the articles where the quotation occurred
 - `speaker`: Selected most likely speaker. This matches the the first speaker entry in `probas`
 - `qids`: Wikidata IDs of all aliases that match the selected speaker
 - `numOccurrences`: Number of time this quotation occurs in the articles
 - `urls`: List of links to the original articles containing the quotation 
 
The second one `quotebank-YYYY.json.bz2` contains a `.json` file which each row has information related to a entire article.
This `.json` has the following fields:

 - `articleID`: Primary key
 - `articleLength`: Length of the article in PTB tokens
 - `date`: Publication date of the article
 - `phase`: Corresponding phase in which the article appeared (A-E)
 - `title`: Title of the article
 - `url`: Link to the original article
 - `names`: List of all extracted speakers that occur in the article
      - `name`: Surface form of the first occurrence of each speaker in the article
      - `ids`: List of Wikidata IDs that have `name` as a possible alias
      - `offsets`: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
 - `quotations`: List of all the quotations that appear in the article
      - `quoteID`: Foreign key of the quotation (from the quotation-centric dataset)
      - `quotation`: Text of the quotation as it occurs in this article
   	  - `quotationOffset`: Index where the quotation starts in the article
      - `leftContext`: Text in the left context window of the quotation (used for the attribution)
      - `rightContext`: Text in the right context window (used for the attribution)
      - `globalProbas`: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
      - `globalTopSpeaker`: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID` 
      - `localProbas`: Array representing the probabilities of each speaker having said the quote *given this article context*.
           - `proba`: Probability for a given speaker
           - `speaker`: Name of the speaker as it first occurs in this article
      - `localTopSpeaker`: Selected speaker. Same name as the first entry in `localProbas`
      - `numOccurrences`: Number of times this quotation occurs in any article  

In [1]:
# Imports we may need
import datetime as dt
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import scipy.stats as stats
import pandas as pd
import numpy as np
import ujson as json
import bz2
from plotly.subplots import make_subplots

import scipy as sc
from scipy.stats import ttest_ind
from scipy.signal import find_peaks
import plotly.io as pio

import statsmodels.formula.api as smf

### Load Files

Load quotes related to Donald Trump.

In [2]:
df_Trump = pd.read_pickle("data/df_Trump_timeseries.pkl")
df_Trump = df_Trump[df_Trump.qids.transform(lambda x : len(x)>0)]

df_Trump

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,authorId
1521,2015-12-22-031341,"I promise, I won't talk about Trump again,",Jeb Bush,[Q221997],"[2015-12-22 20:43:59, 2015-12-22 23:16:41, 201...",10,"[[Jeb Bush, 0.7816], [None, 0.1677], [Donald T...",[http://www.politico.com/story/2015/12/jeb-bus...,E,7935110.0
3257,2015-07-21-047379,I'm sure the Republicans are enjoying Mr. Trum...,President Barack Obama,[Q76],"[2015-07-21 23:16:30, 2015-07-21 22:54:29, 201...",205,"[[President Barack Obama, 0.6523], [None, 0.19...",[http://azdailysun.com/entertainment/televisio...,E,7926268.0
3819,2015-07-22-051864,"it in particular thrives on theater, which Tru...",Frank Bruni,[Q1443006],[2015-07-22 13:33:11],1,"[[Frank Bruni, 0.8479], [None, 0.1246], [Donal...",[http://www.adweek.com/tvnewser/bob-kerrey-on-...,E,2313189.0
4261,2015-07-14-074352,it was appalling to hear Donald Trump describe...,Hillary Clinton,[Q6294],[2015-07-14 08:40:17],1,"[[Hillary Clinton, 0.8129], [None, 0.1175], [B...",[http://www.bloomberg.com/politics/articles/20...,E,1135081.0
4521,2015-11-30-060853,It's a coalition meeting. Some of these pastor...,Katrina Pierson,[Q22121130],[2015-11-30 14:25:50],1,"[[Katrina Pierson, 0.4136], [None, 0.3634], [D...",[http://www.politico.com/story/2015/11/trump-b...,E,1513626.0
...,...,...,...,...,...,...,...,...,...,...
5243993,2020-01-18-047371,Trump is the first president to be impeached w...,Steve Holland,"[Q1371916, Q55765510, Q7612868, Q7612869, Q761...",2020-01-18 15:30:39,1,"[[Steve Holland, 0.5318], [None, 0.4271], [Pre...",[http://canadafreepress.com/article/hear-ye-he...,E,
5243994,2020-02-05-103219,Trump offends and disrespects the Venezuelan p...,Jorge Arreaza,[Q6623799],2020-02-05 00:00:00,11,"[[Jorge Arreaza, 0.9164], [None, 0.0726], [Pre...",[https://www.rawstory.com/2020/02/imwithfred-t...,E,
5243996,2020-03-13-071475,"Trump tried to mitigate the issue, saying it i...",Hassan Nasrallah,[Q181182],2020-03-13 22:15:06,1,"[[Hassan Nasrallah, 0.922], [None, 0.0741], [P...",[http://israelnationalnews.com/News/News.aspx/...,E,
5243997,2020-03-15-037086,Trump's do-over approach -- he unlocked $50 bi...,Newt Gingrich,[Q182788],2020-03-15 00:00:00,40,"[[Newt Gingrich, 0.5146], [None, 0.3958], [Don...",[http://uspolitics.einnews.com/article/5120893...,E,


In [3]:
#Load file containing the occurence dates of each quote
df_Clinton = pd.read_pickle("data/df_Clinton_timeseries.pkl")
df_Clinton

Unnamed: 0,quoteID,quotation,speaker,qids,numOccurrences,probas,urls,phase,date
26,2015-10-25-000242,"' It is not now, nor has it ever been, the gol...",Bernie Sanders,[Q359442],1,"[[Bernie Sanders, 0.5395], [None, 0.3128], [Hi...",[http://examiner.com/article/bernie-sanders-sl...,E,[2015-10-25 14:12:35]
888,2015-09-16-006359,And I'm just pointing out the absurd on both s...,Kathleen Madigan,[Q6376814],1,"[[Kathleen Madigan, 0.8025], [None, 0.1975]]",[http://northjersey.com/arts-and-entertainment...,E,[2015-09-16 05:44:37]
6930,2015-10-22-051493,If [ Democratic presidential candidate former ...,Marco Rubio,[Q324546],1,"[[Marco Rubio, 0.93], [None, 0.0685], [Hillary...",[http://breitbart.com/video/2015/10/22/rubio-i...,E,[2015-10-22 20:04:16]
7374,2015-12-31-032035,I'm electable. I was elected in a purple state...,Jeb Bush,[Q221997],7,"[[Jeb Bush, 0.8392], [None, 0.0925], [Hillary ...",[http://www.postandcourier.com/article/2015123...,E,[2015-12-31 03:29:00]
9855,2015-11-12-104266,The incentive to invent episodes of discrimina...,Glenn Reynolds,[Q4392664],2,"[[Glenn Reynolds, 0.3454], [Ed Driscoll, 0.322...","[http://pjmedia.com/instapundit/218734/, http:...",E,[2015-11-12 00:00:00]
...,...,...,...,...,...,...,...,...,...
5222109,2020-03-06-025712,"I think that would have been a mistake, becaus...",Jennifer Palmieri,[Q18209402],1,"[[Jennifer Palmieri, 0.9117], [None, 0.056], [...",[https://www.rollingstone.com/politics/politic...,E,[2020-03-06 14:38:07]
5231803,2020-01-18-006266,"Chief Justice Rehnquist, when he presided over...",Dick Durbin,[Q434804],1,"[[Dick Durbin, 0.898], [None, 0.079], [Charlie...",[https://www.washingtonexaminer.com/news/impea...,E,[2020-01-18 05:01:08]
5235860,2020-01-06-061256,The main difference between Lindsey and his De...,David Woodard,[Q1177254],6,"[[David Woodard, 0.7544], [None, 0.1797], [Lin...",[http://chron.com/entertainment/article/How-Li...,E,[2020-01-06 00:00:00]
5235869,2020-04-09-052373,The model of Obama asking Bush and Clinton to ...,Bill Haslam,[Q862186],1,"[[Bill Haslam, 0.905], [None, 0.0837], [Barack...",[http://www.nytimes.com/2020/04/09/us/politics...,E,[2020-04-09 23:04:21]


Load the parquet dataframe with attributes of each author

In [4]:
speaker_attributes_updated = pd.read_parquet("data/speaker_attributes_updated.parquet")


In [5]:
speaker_attributes_updated

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Great Britain, United States of America]",[male],1395141751,,W000178,"[politician, military officer, farmer, cartogr...",[independent politician],,Q23,George Washington,"[1792 United States presidential election, 178...",item,[Episcopal Church]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[United Kingdom],[male],1395737157,[White British],,"[playwright, screenwriter, novelist, children'...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Belgium],[male],1380367296,,,"[writer, lawyer, librarian, information scient...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[United States of America],[male],1395142029,,,"[politician, motivational speaker, autobiograp...",[Republican Party],,Q207,George W. Bush,"[2000 United States presidential election, 200...",item,"[United Methodist Church, Episcopal Church, Me..."
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Spain],[male],1391704596,,,[painter],,,Q297,Diego Velázquez,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9055976,[Barker Howard],,[United States of America],[male],1397399351,,,[politician],,,Q106406560,Barker B. Howard,,item,
9055977,[Charles Macomber],,[United States of America],[male],1397399471,,,[politician],,,Q106406571,Charles H. Macomber,,item,
9055978,,[+1848-04-01T00:00:00Z],,[female],1397399751,,,,,,Q106406588,Dina David,,item,
9055979,,[+1899-03-18T00:00:00Z],,[female],1397399799,,,,,,Q106406593,Irma Dexinger,,item,


### Plot time series

We will plot the timeseries of the number of occurrences of quotes referring Hillary Clinton, and compare them with google trends in order to check if the two plots look similar or not

### We create the data for the time series

In [4]:
Trump_dates_by_day = df_Trump.date.explode().transform(lambda x: str(x)[:-9]).groupby(by = df_Trump.date.explode().transform(lambda x: str(x)[:-9])).count()
Trump_dates_by_month = df_Trump.date.explode().transform(lambda x: str(x)[:-12]).groupby(by = df_Trump.date.explode().transform(lambda x: str(x)[:-12])).count()
Trump_dates_by_day

date
2015-01-02      3
2015-01-04      1
2015-01-07      4
2015-01-08      1
2015-01-10      1
             ... 
2020-04-12     61
2020-04-13    198
2020-04-14    182
2020-04-15    274
2020-04-16    141
Name: date, Length: 1861, dtype: int64

In [5]:
Clinton_dates_by_day = df_Clinton.date.explode().transform(lambda x: str(x)[:-9]).groupby(by = df_Clinton.date.explode().transform(lambda x: str(x)[:-9])).count()
Clinton_dates_by_month = df_Clinton.date.explode().transform(lambda x: str(x)[:-12]).groupby(by = df_Clinton.date.explode().transform(lambda x: str(x)[:-12])).count()
Clinton_dates_by_day

date
2015-01-01     9
2015-01-02    10
2015-01-03     3
2015-01-04     7
2015-01-05    13
              ..
2020-04-11     1
2020-04-13     5
2020-04-14     4
2020-04-15     6
2020-04-16     9
Name: date, Length: 1864, dtype: int64

We select the years of interest.

In [6]:
Trump_dates_by_day = Trump_dates_by_day[(Trump_dates_by_day.index<'2018-01-01') & (Trump_dates_by_day.index >='2015-01-01')]
Trump_dates_by_month = Trump_dates_by_month[(Trump_dates_by_month.index<'2018-01-01') & (Trump_dates_by_month.index >='2015-01-01')]
Clinton_dates_by_day = Clinton_dates_by_day[(Clinton_dates_by_day.index<'2018-01-01') & (Clinton_dates_by_day.index >='2015-01-01')]
Clinton_dates_by_month = Clinton_dates_by_month[(Clinton_dates_by_month.index<'2018-01-01') & (Clinton_dates_by_month.index >='2015-01-01')]

### General timeseries

In [18]:
# create figure
fig = go.Figure()

# Add surface trace
fig.add_trace(go.Scatter(x = Trump_dates_by_month.index, y = Trump_dates_by_month, name ='Trump'))
fig.add_trace(go.Scatter(x = Clinton_dates_by_month.index, y = Clinton_dates_by_month, name = 'Clinton'))

fig.update_layout(
    title = "Number of quotes referred to Trump and Clinton over time",
    xaxis_title = "time",
    xaxis=dict(
    rangeslider = dict(visible = True),
    type ="date"
    )
)
    
fig.show()
pio.write_html(fig, file='General_timeseries_comparison.html', auto_open=True)

![Clinton_Trump](Images/Clinton_Trump.png)

From the plot we can have useful information about Trump "popularity" in the quotes of newspapers. We notice that until June 2015 there were very few people speaking about him. From that moment forward, the number of quotes referring to him has grown drastically. We observe that the peaks of quotes referring to him are in the summer of 2016 (from July to September) and also from March to September 2017 there is a huge amount of quotes about him. After this period his "popularity" in newspapers has decreased slowly.

From the plot we can have useful information about Clinton "popularity" in the quotes of newspapers. We notice that, unlike Trump, there were already few people speaking about Clinton in the earliest 2015, but it was only in 2016 (presidential election's year) that Hillary Clinton's stood out. From that moment forward, the number of quotes referring to her decreased drastically.

We will now analyze which events caused a peak in the number of quotes period by period.

## Year 2015

#### Trump

In [8]:
def create_labels_of_events(peaks):
    labels=[]
    for date in peaks:
        labels.append(date[-2:]+'/'+date[-5:-3])
    return labels

In [9]:
def plot_year_timeseries(df, politician, year, politician_peaks):
    series = df[(year +'-01-01' <= df.index) & (df.index < str(int(year)+1)+'-01-01')]

    fig = go.Figure()

    fig.add_trace(go.Scatter(
    x = series.index,
    y = series ,
    mode='lines',
    name=politician
    ))
    
    labels = create_labels_of_events(politician_peaks)
        
    fig.add_trace(go.Scatter(
    x = politician_peaks,
    y = [series[peak] for peak in politician_peaks],
    mode='markers+text',
    text = labels,
    textposition="top center",
    textfont=dict(size= 15, color='black', family='Arial, sans-serif'),
    marker=dict(
        size=8,
        color='red',
        symbol='cross'
    ),
    name='Detected Peaks'
    ))
    if year == "2016":
        fig.add_trace(go.Scatter(x=["2016-11-08", "2016-11-08"], y=[0,series.max()], mode="lines", name="Election day"))
    fig.update_layout(title_text= politician + " timeseries of occurrences in " + year)

    fig.show()
    return series

In [10]:
Trump_2015 = plot_year_timeseries(Trump_dates_by_day, 'Trump', '2015', ["2015-09-17","2015-12-08"])

![Trump_2015](Images/Trump_2015.png)

As we can see from the plot, before July 2015 there were very few people whose quotes talking about Trump in the newspapers. However, from July 2015 we see a progressive increase of the quotes referring to him which could lead us to think his popularity has increased over that period. Indeed, he announced that he was going to candidate as the president of the United States the 15th June 2015 (small peak in the plot) and the summer of 2015 has been called "The summer of Trump" since his popularity and his consent has grown a lot in that period. The peak detection shows two peaks of quotes in 2015, the first peak is the 17 september 2015 which corresponds (it is one day after) to the date of the second republican debate. The second peak is the one of 8th December 2015 the day in which Trump claimed he wanted to ban all the muslim entering the US. 

It would be interesting to compare the plot with Google Trend one.

In [11]:
def google_trend_comparison(series, politician, year):
    
    #plot google trends time series referred to the number of times Hillary Clinton was search
    google_trend= pd.read_csv('data/'+ politician+ '_trends_' + year +'.csv')
    
    google_trend = google_trend.iloc[1:,:]
    google_trend = pd.to_numeric(google_trend.iloc[:,0])

    google_trends_norm = (google_trend-google_trend.min()) / (google_trend.max()-google_trend.min())

    # we transform in a weekly time series to compare better with google trends
    series.index = pd.to_datetime(series.index)
    series = series.groupby(pd.Grouper(freq='W')).sum()


    series_norm = (series-series.min()) / (series.max()-series.min())

    
    google_trends_trace = go.Scatter(
    x=google_trends_norm.index,
    y=google_trends_norm.values,
    name='Normalized number of searches in Google',
    yaxis='y1'
    )
    
    ny_quotes_trace = go.Scatter(
    x=series_norm.index,
    y=series_norm.values,
    name='Normalized number of occurrences in NYT',
    yaxis='y2'
    )
    
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    fig.add_trace(google_trends_trace)
    fig.add_trace(ny_quotes_trace,secondary_y=True)
    if year == "2016":
        fig.add_trace(go.Scatter(x=["2016-11-08", "2016-11-08"], y=[0,series_norm.max()], mode="lines", name="Election day"))
    fig.update_layout(
        title = "Timeseries of the number of times that "+ politician +" was searched/quoted",
        xaxis_title = "time",
        xaxis=dict(
        rangeslider = dict(visible = True),
                type ="date"
        )
    )

    fig.show()
    return series
    pio.write_html(fig, file='both_trends_'+ year + '.html', auto_open=True)

In [12]:
Trump_2015_weekly = google_trend_comparison(Trump_2015, 'Trump', '2015')

![Trump_Trends_2015](Images/Trump_Trends_2015.png)

As we can observe the match is significant. The peaks of Google Trends time series are exactly the same weeks of the peaks in media quotes and the two time series have exactly the same behaviour. The main difference is the period between June and November 2015, period in which the word Trump has been searched more on the web (always in proportion) with respect to how often it appeared in media quotes.

#### Clinton

In [13]:
Clinton_2015 = plot_year_timeseries(Clinton_dates_by_day, 'Clinton', '2015',["2015-08-07","2015-08-12","2015-09-17","2015-11-11"])

![Clinton_2015](Images/Clinton_2015.png)

First of all we notice that the number of occurrences of the word Clinton in people quotes in medias is remarkably smaller than the Trump's ones. That being said, we notice that, differently from Trump's plot, we have that Clinton already appeared a lot in speakers quotes in medias. We observe that 4 peaks have been detected by the peaks finding algorithm. The two first peak august 7 corresponds to the first republican debate, the second in august 12 correspond to a email scandal, the one in september 17 is common to Trump's one and corresponds to the second republican debate, finally the one of 11th November corresponds to the famous Hillary Clinton speech against terrorism. 

Google trends comparison

In [14]:
Clinton_2015_weekly = google_trend_comparison(Clinton_2015, 'Clinton', '2015')

![Clinton_Trends_2015](Images/Clinton_Trends_2015.png)

Even in this case the match is remarkable, the two lines seems to have the same behaviour overall.

Comparison Trump/Clinton in 2015

In [15]:
def next_weekday(d, weekday):
    days_ahead = weekday - d.weekday()
    if days_ahead <= 0: # Target day already happened this week
        days_ahead += 7
    return d + dt.timedelta(days_ahead)

In [19]:
def comparison_Trump_Cliton(year, Clinton_series, Trump_series, Clinton_peaks, Trump_peaks):
    
    
    peak_indices_Trump = [str(next_weekday(pd.Timestamp(date),-1).date()) for date in Trump_peaks]
    peak_indices_Clinton = [str(next_weekday(pd.Timestamp(date),-1).date()) for date in Clinton_peaks]
    
    labels_Trump = create_labels_of_events(Trump_peaks)
    labels_Clinton = create_labels_of_events(Clinton_peaks)
        
    fig = go.Figure()

    fig.add_trace(go.Scatter(
    x = Clinton_series.index,
    y = Clinton_series ,
    mode='lines',
    name='Clinton '+ year
    ))
    fig.add_trace(go.Scatter(
    x = Trump_series.index,
    y = Trump_series ,
    mode='lines',
    name='Trump '+ year
    ))
    fig.add_trace(go.Scatter(
    x = peak_indices_Trump,
    y = [Trump_series.at[peak] for peak in peak_indices_Trump],
    mode='markers+text',
    text = labels_Trump,
    textposition="top center",
    textfont=dict(size= 15, color='black', family='Arial, sans-serif'),
    marker=dict(
        size=8,
        color='red',
        symbol='cross'
    ),
    name='Detected Peaks Trump'
    ))
    fig.add_trace(go.Scatter(
    x = peak_indices_Clinton,
    y = [Clinton_series.at[peak] for peak in peak_indices_Clinton],
    mode='markers+text',
    text = labels_Clinton,
    textposition="top center",
    textfont=dict(size= 15, color='black', family='Arial, sans-serif'),
    marker=dict(
        size=8,
        color='blue',
        symbol='cross'
    ),
    name='Detected Peaks Clinton'
    ))
    if year == "2016":
        fig.add_trace(go.Scatter(x=["2016-11-08", "2016-11-08"], y=[0,Trump_series.max()], mode="lines", name="Election day"))
    fig.update_layout(title_text="Clinton and Trump timeseries of occurrences in " +year)

    fig.show()
    pio.write_html(fig, file='both_timeseries_'+ year + '.html', auto_open=True)

In [20]:
comparison_Trump_Cliton('2015', Clinton_2015_weekly, Trump_2015_weekly,["2015-08-07","2015-08-12","2015-09-17","2015-11-11"],["2015-09-17","2015-12-08"])

![Clinton_Trump_2015](Images/Clinton_Trump_2015.png)

## Year 2016

#### Trump

In [21]:
Trump_2016 = plot_year_timeseries(Trump_dates_by_day, 'Trump', '2016',["2016-05-02","2016-05-31","2016-07-28","2016-09-27"])

![Trump_2016](Images/Trump_2016.png)

In 2016 we observe a distribution of quotes related to Trump quite varied. Indeed, periods in which the number of quotes is high alternate with periods in which the number of quotes is small. As we can see, we observe a substantial number of quotes from the beginning of february to the end of february, from the half of April to the beginning of June, from the end of June to the beginning of october and from the end of november to the new year. This year is marked by US political elections of 8th november 2016 in which Donald Trump defeated the opponent Hillary Clinton. At first sight it seems strange that in november we have only a few quotes related to Trump despite the elections, but this is probably due to the fact that in that period his names appeared in several articles but nobody pronounced quotes about him directly. We will now try to understand why the number of quotes alternates in this way and to which events the peaks correspond. In May (precisely the second day of May) Trump accused China of 'raping' US with her trade policy. The peak of 31th May corresponds however to the conference he took part in to answer questions about his fund-raising for charities that benefit military veterans and in which he scolded journalists. The biggest peak of the year was reached the 28th July of 2016. That day Trump asked to Russian hackers to find the 30000 emails deleted, he refers to the private server that Clinton used as secretary of state, the emails were deleted because they were deemed "personal" and not turned over to State Department investigators. Another big peak detected by the algorithm is the one of 27th september 2016 which corresponds to the first presidential debate in which Trump notably loosed his control.

In [22]:
Trump_2016_weekly = google_trend_comparison(Trump_2016, 'Trump', '2016')

![Trump_Trends_2016](Images/Trump_Trends_2016.png)

The peaks are in this year differents as well as the behaviour of the two plots. Indeed, we can explain this difference by the particular event which took place in this year: the political elections. Indeed, while the main peak of google trend is the one of the week of the elections (november 2016), the peak in our media distribution is before the elections. This difference is due to the fact that the two sources of data are different, in our case the majority of the Trump quotes were related to the pre election period since lots of people quoted Trump to criticize his political ideas while the majority of google users looked for the word Trump online only during or slightly after political elections, one possible explanation is the usage of google to see the live results of the election.

**Clinton**

In [23]:
Clinton_2016 = plot_year_timeseries(Clinton_dates_by_day, 'Clinton', '2016',["2016-07-26","2016-09-27"])

![Clinton_2016](Images/Clinton_2016.png)

We observe a similar effect already observed in Trump's plot of 2016. Indeed, even in this case we don't observe many quotes during the election period while we observe a huge number of quotes in the period which goes from the beginning of July to the end of september mainly the pre election period. The reason are the same explained for Trump. We will now try to identify which events are causing the peaks in the timeseries. The biggest peak (26th July 2016) is the day in which Democrats officially nominated Hillary Clinton as their standard-bearer in the presidential contest, sealing her position as the first female nominee of a major party in US history at the Democratic national convention in Philadelphia. Another peak is the one of 27th september which corresponds once again to the first presidential debate. 

In [24]:
Clinton_2016_weekly = google_trend_comparison(Clinton_2016, 'Clinton', '2016')

![Clinton_Trends_2016](Images/Clinton_Trends_2016.png)

The same discussion apply to Trump vs google trends plot.

In [25]:
comparison_Trump_Cliton('2016', Clinton_2016_weekly, Trump_2016_weekly,["2016-07-26","2016-09-27"],["2016-05-02","2016-05-31","2016-07-28","2016-09-27"])

![Clinton_Trump_2016](Images/Clinton_Trump_2016.png)

As we could remark that the behaviour of the two plots is the same but Trump has a higher number of quotes globally.

## Year 2017

**Trump**

In [26]:
Trump_2017 = plot_year_timeseries(Trump_dates_by_day, 'Trump', '2017',["2017-02-01","2017-04-07","2017-05-10"])

![Trump_2017](Images/Trump_2017.png)

In 2017 we observe a distribution of quotes related to Trump with a lot of spikes and with a higher mean value compared with the previous ones. A consistent and substantial number of quotes can be observed throughout the year, making this high ratio of quotes representative of a variety of controversies envolving Tump in the first year of presidency. Let's take a look at some of his controversially events. One day after being sworn in as the 45th president of United States (21th Jan), the Women’s march took place. This worldwide protest was prompted by the fact that several of Trump's statements were considered by many as anti-women or offensive to women, shortly after, at 27th of Jan, Trump sign the first travel ban executive order, halting Syrian refugees and barring citizens from seven countries for 90 days. This events, plus other controversies in the first two weeks, led to the first peak (2nd Feb) in the timeseries. Two months and two days after (7th April), we observe a peak in the timeseries that corresponds to a missile strike on Syrian airfield ordered by Trump. In the 9th of May, Trump fires the director of FBI, James Comey, abruptly terminating the top official leading a criminal investigation into whether Mr. Trump’s advisers colluded with the Russian government to steer the outcome of the 2016 presidential election and one day after (10th May), Trump meets with the Russian Foreign Minister Ambassador and is accused of revealing highly classified information, criticizing the FBI investigation and discrediting Comey. This last event matches the highest peak in the timeseries of 2017.

In [27]:
Trump_weekly_2017 = google_trend_comparison(Trump_2017, 'Trump', '2017')

![Trump_Trends_2017](Images/Trump_Trends_2017.png)

The peaks and the behaviour of this two plots differ a lot. We observe a substancial peak in the google trends at 22th of Jan, one day after Women’s March protest, and in fact we can explain the main diferences in the two normalized plots based on this event. This march was set with the goal of advocate legislation and policies regarding human rights and other issues, including women's rights, immigration reform, reproductive rights, the environment, LGBTQ rights, racial equality, freedom of religion, workers' rights, etc, topics that are considered as mainstream in the digital and social medias leading to a maximum in web searches. 

**Clinton**

In [28]:
Clinton_2017 = plot_year_timeseries(Clinton_dates_by_day, 'Clinton', '2017',["2017-05-03","2017-05-10"])

![Clinton_2017](Images/Clinton_2017.png)

In 2017, a year after Clinton lose the presidencial elections, we observe a significantly decrease in the quotes refering Hillary Clinton. This decrease of popularity in the media can be explain by the simple fact that the main event that was boosting Hillary Clinton's popularity (the presidencial elections) was over, with the addition of her lost in the elections, and by not being involved in much controversialities, she simply started to be less relevant for the media. Nevertheless, we can still observe some peaks in the timeserie. The more relevant ones were 3rd of May, day which Hillary Clinton was the featured speaker at the Women for Women International were she among other topics, blamed Russian interference in the US election for her loss and 10th of May peak matching the day where the meeting of Trump and the Russian Foreign Minister Ambassador took date

In [29]:
Clinton_weekly_2017 = google_trend_comparison(Clinton_2017, 'Clinton', '2017')

As Markdown:
![Clinton_Trends_2017](Images/Clinton_Trends_2017.png)

We observe a similar effect already observed in Trump's plot of 2017. Indead we observe a significant peak in Clinton's google trend plot roughly in the same week as Trump's, powered by the hype and intense promotion of Women’s March, protest which envolved the same issues that Hillary Clinton focused on her presidential campaign (i.e women's rights, LGBTQ rights, improving healthcare, etc). In fact, this march borrows one of Clinton's most famous slogans — “Women’s Rights are Human Rights and Human Rights are Women’s Rights” — from her 1995 speech on women’s rights in Beijing.

In [30]:
comparison_Trump_Cliton('2017', Clinton_weekly_2017, Trump_weekly_2017,["2017-05-03","2017-05-10"] ,["2017-02-01","2017-04-07","2017-05-10"])

![Clinton_Trump_2017](Images/Clinton_Trump_2017.png)

### Relation between attribute and speakers

Our objective here is to realize if there is any relation between a specific attribute and the speakers of a certain politician

First to do this type of analisys, we load file with the authorID merged. This column contain the index of the speaker attributes file. We merge it with the speaker attribute file (it is very fast since the real merge has already been carried out).

In [31]:
#Load Clinton's speakers attributes file
Clinton_attributes = pd.read_pickle("data/Clinton_with_attributes.pkl")

In [32]:
Clinton_attributes.qids = Clinton_attributes.qids.transform(lambda x: x[0])
Clinton_with_att = pd.merge(Clinton_attributes, speaker_attributes_updated, left_on = 'qids', right_on = 'id' ) 

In [33]:
Trump_attributes = pd.read_pickle("data/Trump_with_attributes.pkl")

In [34]:
Trump_attributes = Trump_attributes[Trump_attributes.qids.transform(lambda x : len(x)>0)]
Trump_attributes.qids = Trump_attributes.qids.transform(lambda x: x[0])
Trump_with_att = pd.merge(Trump_attributes, speaker_attributes_updated, left_on = 'qids', right_on = 'id' ) 

### Relation between age and speakers 

Our objective here is to analise the age distribution of the Clinton's and Trump's speakers

We get the age of each speaker based in their date of birth and then filter the outliers which we consider as ages larger than 100 years old.

In [35]:
age_serie_Clinton = Clinton_with_att.groupby(by = 'speaker').date_of_birth.head(1)
age_serie_Clinton = age_serie_Clinton.dropna().transform(lambda x: 2021 - int(x[0][1:5]))
age_serie_Clinton = age_serie_Clinton[age_serie_Clinton <=100]
age_serie_Clinton = age_serie_Clinton.rename("age")

In [None]:
fig = px.histogram(age_serie_Clinton, color_discrete_sequence=['indianred'], marginal="box")
fig.update_layout(title_text='Distribution of the ages of Clinton speakers') # title of plot
fig.show(render = 'svg')

![Clinton_Age](Images/Clinton_Age.png)

As we can see, most of the speakers are quite old, what about the distribution of the age of Trump speakers ? We repeat the same for Trump's speakers.

In [37]:
age_serie_Trump = Trump_with_att.groupby(by = 'speaker').date_of_birth.head(1)
age_serie_Trump = age_serie_Trump.dropna().transform(lambda x: 2021 - int(x[0][1:5]))
age_serie_Trump = age_serie_Trump[age_serie_Trump <=100]
age_serie_Trump = age_serie_Trump.rename("age")

In [None]:
fig = px.histogram(age_serie_Trump, color_discrete_sequence=['royalblue'], marginal="box")
fig.update_layout(title_text='Distribution of the ages of Trump speakers') # title of plot
fig.show(render = 'svg')

![Trump_Age](Images/Trump_Age.png)

As we can see, the distribution looks similar to a normal distribution around the value of 60 years old a part from a peak at 21 years old and some noise, it is interesting to compare such distribution with Clinton's one on the same plot.

In [39]:
age_serie_speaker_distribution = speaker_attributes_updated.date_of_birth
age_serie_speaker_distribution = age_serie_speaker_distribution.dropna().transform(lambda x: 2021 - int(x[0][1:5]))
age_serie_speaker_distribution = age_serie_speaker_distribution[((age_serie_speaker_distribution <=100) & (age_serie_speaker_distribution > 0))]
age_serie_speaker_distribution = age_serie_speaker_distribution.rename("age")

In [None]:
import plotly.figure_factory as ff

# Add histogram data
x1 = age_serie_Trump
x2 = age_serie_Clinton
x3 = age_serie_speaker_distribution.sample(10000, replace="True")

# Group data together
hist_data = [x1, x2, x3]

group_labels = ['Trump', 'Clinton', 'Authors Distribution']
colors = ['royalblue', 'indianred', 'lightslategrey']
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, colors = colors)
fig.update_layout(
    autosize=False,
    width=1000,
    height=800,)

fig.show()

![Clinton_Trump_Age](Images/Clinton_Trump_Age.png)

From the distribution it could seem that Clinton's speakers are a bit older in general with respect to Trump's speakers,we will compute the mean for both groups. Additionally it's also notable that the age distribution deviates from the normal distribution, maybe we can reason that older people are more prone to talk about politics, since it's a subject which requires some knowledge of the history of the state.

In [41]:
print('the mean of the age for Trump speakers is ', age_serie_Trump.mean())
print('the mean of the age for Clinton speakers is ', age_serie_Clinton.mean())

the mean of the age for Trump speakers is  58.19625944131399
the mean of the age for Clinton speakers is  59.40873533246415


Is the age different with statistical significance ? First of all we visualize with confidence intervals.

In [42]:
def bootstrap_CI(data, nbr_draws):
    means = np.zeros(nbr_draws)
    data = np.array(data)

    for n in range(nbr_draws):
        indices = np.random.randint(0, len(data), len(data))
        data_tmp = data[indices] 
        means[n] = np.nanmean(data_tmp)

    return [np.nanpercentile(means, 2.5),np.nanpercentile(means, 97.5)]

In [43]:
error_Trump = bootstrap_CI(age_serie_Trump, 1000)[1]-bootstrap_CI(age_serie_Trump, 1000)[0]
error_Clinton = bootstrap_CI(age_serie_Clinton, 1000)[1]-bootstrap_CI(age_serie_Clinton, 1000)[0]

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Trump',
    x=['Trump'], y=[age_serie_Trump.mean()],
    error_y=dict(type='data', array=[error_Trump]),
    marker=dict(color='royalblue')
))
fig.add_trace(go.Bar(
    name='Clinton',
    x=['Clinton'], y=[age_serie_Clinton.mean()],
    error_y=dict(type='data', array=[error_Clinton]),
    marker=dict(color='indianred')
))
fig.update_layout(barmode='group')
fig.show(render = 'svg')

![Age_Ratio](Images/Age_Ratio.png)

In [45]:
stats = sc.stats.ttest_ind(age_serie_Trump, age_serie_Clinton, equal_var = False) #equal var =False

print('the p value for the t test with null hypothesis stating that the two populations have the same mean is ', stats.pvalue)

the p value for the t test with null hypothesis stating that the two populations have the same mean is  8.770378545499955e-07


With a p-value smaller than 0.05 (our standard confidence level) we can conclude that we reject the null hypothesis that the two populations have the same mean. Therefore the Clinton speakers are order with statistical significance.

### Relation between ethnic group and speakers

Our objective here is to undestand at which ethinic groups belong Trump's and Clinton's speakers and if there is any relevant difference between the two.

In [46]:
Clinton_ethnics = Clinton_with_att.groupby(by='id').first().explode('ethnic_group').reset_index(drop=True)['ethnic_group'].dropna()

In [47]:
ethnicities_Clinton = Clinton_ethnics.groupby(Clinton_ethnics).count().sort_values(ascending = False)
major_ethnicities_Clinton= ethnicities_Clinton[0:8]
major_ethnicities_Clinton["Others"] = ethnicities_Clinton[8:].sum()

In [None]:
x = major_ethnicities_Clinton.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Clinton',
    x=x, y=[major_ethnicities_Clinton[el] for el in x]
))
fig.update_layout(barmode='group', title = 'Ethnicities of Clinton speakers')
fig.update_traces(marker=dict(color="Indianred"))
fig.show(render="svg")

![Clinton_Eth](Images/Clinton_Eth.png)

How we can see from the plot, the majority of the speakers are African Americans but also other ethnicities have quoted Clinton.

It is interesting to analyse the difference with the enthnicities of Trump's speakers.

In [49]:
Trump_ethnics = Trump_with_att.groupby(by='id').first().explode('ethnic_group').reset_index(drop=True)['ethnic_group'].dropna()
ethnicities_Trump = Trump_ethnics.groupby(Trump_ethnics).count().sort_values(ascending = False)
major_ethnicities_Trump = ethnicities_Trump[0:8]
major_ethnicities_Trump["Others"] = ethnicities_Trump[8:].sum()
major_ethnicities_Trump

ethnic_group
African Americans    338
Jewish people         56
American Jews         34
Armenian American     17
Italian American      13
Irish people          11
English people        10
Yoruba people         10
Others               238
Name: ethnic_group, dtype: int64

In [None]:
x = major_ethnicities_Trump.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Trump',
    x=x, y=[major_ethnicities_Trump[el] for el in x]
))
fig.update_layout(barmode='group', title='Ethnicities of Trump speakers')
fig.update_traces(marker=dict(color="royalblue"))
fig.show(render='svg')


![Trump_Eth](Images/Trump_Eth.png)

In [51]:
Authors_distribution_ethnics = speaker_attributes_updated.ethnic_group.dropna().transform(lambda x: x[0])
major_ethnicities_speaker_distribution = Authors_distribution_ethnics
major_ethnicities_speaker_distribution = major_ethnicities_speaker_distribution.value_counts().sort_values(ascending = False)[0:8]
major_ethnicities_speaker_distribution 

Han Chinese people    37677
African Americans     19021
Armenians             11418
Greeks                 5266
Albanians              4051
Bulgarians             3620
Czechs                 2984
Ukrainians             2901
Name: ethnic_group, dtype: int64

In [None]:
x0 = Trump_ethnics[Trump_ethnics.isin(major_ethnicities_Trump.index.tolist())]
x1 = Clinton_ethnics[Clinton_ethnics.isin(major_ethnicities_Clinton.index.tolist())]
x2 = Authors_distribution_ethnics[Authors_distribution_ethnics.isin(major_ethnicities_speaker_distribution.index.tolist())]

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=x0,
    histnorm='percent',
    name='Trump', # name used in legend and hover labels,
    marker_color='royalblue'
))
fig.add_trace(go.Histogram(
    x=x1,
    histnorm='percent',
    name='Clinton',
    marker_color='indianred'
))
fig.add_trace(go.Histogram(
    x=x2,
    histnorm='percent',
    name='Authors Distribution',
    marker_color='lightslategrey'
))

fig.update_layout(
    title_text='Major ethnicities comparison', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)

fig.show()

![Clinton_Trump_Eth](Images/Clinton_Trump_Eth.png)

In [None]:
Pie_major_ethnicities_Clinton = major_ethnicities_Clinton.copy().add(major_ethnicities_Trump*0, fill_value=0)
Pie_major_ethnicities_Clinton = Pie_major_ethnicities_Clinton.copy().add(major_ethnicities_speaker_distribution*0, fill_value=0)
Pie_major_ethnicities_Trump = major_ethnicities_Trump.copy().add(Pie_major_ethnicities_Clinton*0, fill_value=0)
Pie_major_ethnicities_General = major_ethnicities_speaker_distribution.copy().add(Pie_major_ethnicities_Clinton*0, fill_value=0)

Pie_General = Pie_major_ethnicities_General.index.tolist()

fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'} ]],subplot_titles=['Clinton', 'Trump', 'Speaker Distribution'])

fig.add_trace(go.Pie(
    name='Clinton',
    labels=Pie_General, values=[Pie_major_ethnicities_Clinton[el] for el in Pie_General],textinfo='label+percent', insidetextorientation='radial'
),1,1)

fig.add_trace(go.Pie(
    name='Trump',
    labels=Pie_General, values=[Pie_major_ethnicities_Trump[el] for el in Pie_General],textinfo='label+percent', insidetextorientation='radial'
),1,2)

fig.add_trace(go.Pie(
    name='General',
    labels=Pie_General, values=[Pie_major_ethnicities_General[el] for el in Pie_General],textinfo='label+percent', insidetextorientation='radial'
),1,3)

fig.update_traces(textposition='inside')
fig.update_layout(title_text='Top Ethnicity Speakers')
fig.show()
pio.write_html(fig, file='Top_Ethnicity_Speakers_Pie_Chart'+'.html', auto_open=True)

![Eth_Pie_Chart](Images/Eth_Pie_Chart.png)

As we can see there are some differences between the major ethnicities of the two groups, for example there is a higher percentage of African Americans speaking about Trump while there is a higher percentage of Irish Americans, American Jews, Jewish people, Italian Americans and Armenian Americans speaking about Clinton. Additionally one can also see that some nationalities like Han Chinese People and Armenians are not represented, whereas they have a substantial number of authors in the original author distribution.

### Relation between gender and speakers

In [54]:
Clinton_genders = Clinton_with_att.groupby(by='id').first()['gender'].dropna().transform(lambda x: x[0])
Clinton_genders = Clinton_genders[Clinton_genders.isin(['male', 'female'])]

In [None]:
fig = px.histogram(Clinton_genders, color_discrete_sequence=['indianred'])
fig.update_layout(title_text='Distribution of the gender of Clinton speakers') # title of plot
fig.show(render = 'svg')

![Clinton_gender](Images/Clinton_gender.png)

What is the ratio male/female ?

In [56]:
male = Clinton_genders[Clinton_genders == 'male']
male_size = male.size
female_size = Clinton_genders.size - male.size
print('the ratio number of males/number of females for Clinton speakers is ', male_size/female_size)

the ratio number of males/number of females for Clinton speakers is  3.483306836248013


In [57]:
Trump_genders = Trump_with_att.groupby(by='id').first()['gender'].dropna().transform(lambda x: x[0])
Trump_genders = Trump_genders[Trump_genders.isin(['male', 'female'])]

In [None]:
fig = px.histogram(Trump_genders)
fig.update_layout(title_text='Distribution of the gender of Trump speakers') # title of plot
fig.update_traces(marker=dict(color="royalblue"))
fig.show(render = 'svg')

![Trump_gender](Images/Trump_gender.png)

In [59]:
male = Trump_genders[Trump_genders == 'male']
male_size = male.size
female_size = Trump_genders.size - male.size
print('the ratio number of males/number of females for Trump speakers is ', male_size/female_size)

the ratio number of males/number of females for Trump speakers is  3.2282913165266107


In [60]:
Author_distribution_genders = speaker_attributes_updated.gender.dropna().transform(lambda x: x[0])
Author_distribution_genders = Author_distribution_genders[Author_distribution_genders.isin(['male', 'female'])]

In [None]:
x0 = Trump_genders
x1 = Clinton_genders
x2 = Author_distribution_genders

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=x0,
    histnorm='percent',
    name='Trump', # name used in legend and hover labels,
    marker_color='royalblue'
))
fig.add_trace(go.Histogram(
    x=x1,
    histnorm='percent',
    name='Clinton',
    marker_color='indianred'
))
fig.add_trace(go.Histogram(
    x=x2,
    histnorm='percent',
    name='Author distribution',
    marker_color='lightslategrey'
))

fig.update_layout(
    title_text='Sampled Results', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)

fig.show()

![Clinton_Trump_gender](Images/Clinton_Trump_gender.png)

There is a remarkable difference. Despite most of the speakers for the two groups are men, there is a higher percentage of men between the people whose quotes are referred to Trump. Is this difference statistically significant ? We should also add that it's the Clinton data that deviates mostly from the authors distribution, maybe due to the female support.

We add confidence intervals to the values of the ratios.

In [62]:
def bootstrap_CI_ratios(data, nbr_draws):
    ratios = np.zeros(nbr_draws)
    data = np.array(data)

    for n in range(nbr_draws):
        indices = np.random.randint(0, len(data), len(data))
        data_tmp = data[indices] 
        male = data_tmp[data_tmp=='male']
        male_size = male.size
        female_size = data_tmp.size - male.size
        ratios[n] = male_size/female_size
    
    return [np.nanpercentile(ratios, 2.5),np.nanpercentile(ratios, 97.5)]

In [63]:
confidence_interval_Clinton = bootstrap_CI_ratios(Clinton_genders, 1000)
confidence_interval_Trump = bootstrap_CI_ratios(Trump_genders, 1000)
print('the confidence intervals for Clinton ratio of gender is', confidence_interval_Clinton)
print('the confidence intervals for Trump ratio of gender is',confidence_interval_Trump)

the confidence intervals for Clinton ratio of gender is [3.311926605504587, 3.6792682770053524]
the confidence intervals for Trump ratio of gender is [3.0722549379670037, 3.398790685704657]


Since the two confidence intervals do not overlap we can conclude that this difference in ratios is statistically significant.

### Relation between nationality and speakers

In [64]:
Clinton_nationalities = Clinton_with_att.groupby(by='id').first().explode('nationality').reset_index(drop=True)['nationality'].dropna()

In [65]:
nationalities_Clinton = Clinton_nationalities.groupby(Clinton_nationalities).count().sort_values(ascending=False)

major_nationalities_Clinton = nationalities_Clinton[0:10]
major_nationalities_Clinton["Others"] = nationalities_Clinton[10:].sum()

print(major_nationalities_Clinton.to_string())

nationality
United States of America    5256
United Kingdom               502
Canada                       214
Australia                    181
Germany                       92
Israel                        64
India                         63
Russia                        61
Mexico                        59
New Zealand                   55
Others                       890


In [None]:
x = major_nationalities_Clinton.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Clinton',
    x=x, y=[major_nationalities_Clinton[el] for el in x]
))
fig.update_layout(barmode='group', title = 'Nationalities of Clinton speakers')
fig.update_traces(marker=dict(color="Indianred"))
fig.update_yaxes(type = 'log')
fig.show()

![Clinton_Nat](Images/Clinton_Nat.png)

As we can see the majority of the speakers are from United States which is exactly what we would expect. However there are several other speaker nationalities which are quite frequent. What about Trump ?

In [67]:
Trump_nationalities = Trump_with_att.groupby(by='id').first().explode('nationality').reset_index(drop=True)['nationality'].dropna()

In [68]:
nationalities_Trump = Trump_nationalities.groupby(Trump_nationalities).count().sort_values(ascending=False)

major_nationalities_Trump = nationalities_Trump[0:10]
major_nationalities_Trump["Others"] = nationalities_Trump[10:].sum()

print(major_nationalities_Trump.to_string())

nationality
United States of America    5083
United Kingdom               667
Canada                       258
Australia                    171
Israel                       138
India                        120
Germany                      104
France                        84
Russia                        72
Ireland                       68
Others                      1231


In [None]:
x = major_nationalities_Trump.index.tolist()
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Trump',
    x=x, y=[major_nationalities_Trump[el] for el in x]
))
fig.update_layout(barmode='group', title = 'Nationalities of Trump speakers')
fig.update_traces(marker=dict(color="royalblue"))
fig.update_yaxes(type = 'log')
fig.show()

![Trump_Nat](Images/Trump_Nat.png)

Despite most of the major nationalities of the speakers are the same, there are some differences between the two distributions. We will visualize in a single plot the two distributions together.

In [70]:
Authors_distribution_nationalities = speaker_attributes_updated.nationality.dropna().transform(lambda x: x[0])
major_nationalities_speaker_distribution = Authors_distribution_nationalities
major_nationalities_speaker_distribution = major_nationalities_speaker_distribution.value_counts().sort_values(ascending = False)[0:8]
major_nationalities_speaker_distribution

United States of America    430539
France                      261891
Germany                     249958
Japan                       177541
United Kingdom              162522
Spain                       125939
Ming dynasty                125906
Italy                        95149
Name: nationality, dtype: int64

In [None]:
x0 = Trump_nationalities[Trump_nationalities.isin(major_nationalities_Trump.index.tolist())]
x1 = Clinton_nationalities[Clinton_nationalities.isin(major_nationalities_Clinton.index.tolist())]
x2 = Authors_distribution_nationalities[Authors_distribution_nationalities.isin(major_nationalities_speaker_distribution.index.tolist())]

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=x0,
    histnorm='percent',
    name='Trump', # name used in legend and hover labels,
    marker_color='royalblue'
))
fig.add_trace(go.Histogram(
    x=x1,
    histnorm='percent',
    name='Clinton',
    marker_color='indianred'
))

fig.add_trace(go.Histogram(
    x=x2,
    histnorm='percent',
    name='Authors distribution',
    marker_color='lightslategrey'
))

fig.update_layout(
    title_text='Major nationalities comparison', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='percentage', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.update_yaxes(type = 'log')

fig.show()

![Clinton_Trump_Nat](Images/Clinton_Trump_Nat.png)

In [None]:
Pie_major_nationalities_Clinton = major_nationalities_Clinton.copy().add(major_nationalities_Trump*0, fill_value=0)
Pie_major_nationalities_Clinton = Pie_major_nationalities_Clinton.copy().add(major_nationalities_speaker_distribution*0, fill_value=0)
Pie_major_nationalities_Trump = major_nationalities_Trump.copy().add(Pie_major_nationalities_Clinton*0, fill_value=0)
Pie_major_nationalities_General = major_nationalities_speaker_distribution.copy().add(Pie_major_nationalities_Clinton*0, fill_value=0)

Pie_General = Pie_major_nationalities_General.index.tolist()

fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'} ]],subplot_titles=['Clinton', 'Trump', 'Speaker Distribution'])

fig.add_trace(go.Pie(
    name='Clinton',
    labels=Pie_General, values=[Pie_major_nationalities_Clinton[el] for el in Pie_General],textinfo='label+percent', insidetextorientation='radial'
),1,1)

fig.add_trace(go.Pie(
    name='Trump',
    labels=Pie_General, values=[Pie_major_nationalities_Trump[el] for el in Pie_General],textinfo='label+percent', insidetextorientation='radial'
),1,2)

fig.add_trace(go.Pie(
    name='General',
    labels=Pie_General, values=[Pie_major_nationalities_General[el] for el in Pie_General],textinfo='label+percent', insidetextorientation='radial'
),1,3)

fig.update_traces(textposition='inside')
fig.update_layout(title_text='Top Nationality Speakers')
fig.show()
pio.write_html(fig, file='Top_Nationality_Speakers_Pie_Chart'+'.html', auto_open=True)

![Nat_Pie_Chart](Images/Nat_Pie_Chart.png)

There are differences between the distribution of the major nationalities of the people whose quotes are referrred to Trump and Clinton respectively. Indeed, we can see that there are more people with american nationality speaking about Clinton than Trump while for example the people from european union (at least the representative countries we have in the plot) are more likely to speak about Trump rather than Clinton. Also in Canada, Australia and Israel people are speaking more about Trump than Clinton. This suggests that Trump character is more known outside United States of America while Clinton seems to be more popular in her origin country. Additionally when consider the original author distribution, it's clear that some countries like Spain, Italy and France are not represented as much.

### Machine learning model to interpret the data

Since we have observed meaningful differences in the distribution of some of the speaker attributes it would be interesting to create a supervised machine learning model to help us interpret these differences. For example, we could concatenate Clinton_with_att and Trump_with_att files in a unique dataframe labelling each the rows which belonged to Clinton_with_att by one and the ones which belonged to Trump_with_att by zero. Our task then could be to create a machine learning model easily interpretable (such as logistic regression) to classify each quote (each row) as zero or one based just on some of the speaker attribute values. We don't expect to have a great fit of the data since the speaker attributes are quite general but despite the model fit we could use it to understand if a speaker with a given attribute is more likely to speak about Clinton or Trump.

First of all, we preprocess the data.

In [73]:
Clinton_with_att.columns

Index(['quoteID', 'quotation', 'speaker', 'qids', 'numOccurrences', 'probas',
       'urls', 'phase', 'date', 'authorId', 'aliases', 'date_of_birth',
       'nationality', 'gender', 'lastrevid', 'ethnic_group',
       'US_congress_bio_ID', 'occupation', 'party', 'academic_degree', 'id',
       'label', 'candidacy', 'type', 'religion'],
      dtype='object')

In [74]:
# we concatenate the data

Clinton_with_att['target_label'] = 0
Trump_with_att['target_label'] = 1
ml_df = pd.concat([Clinton_with_att, Trump_with_att])
ml_df

Unnamed: 0,quoteID,quotation,speaker,qids,numOccurrences,probas,urls,phase,date,authorId,...,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion,target_label
0,2016-09-28-013218,"Better deal how? What exactly is your idea, Do...",Elizabeth Warren,Q434706,2,"[[Elizabeth Warren, 0.5615], [None, 0.3314], [...",[http://www.politico.com/story/2016/09/warren-...,E,[2016-09-29 18:37:33],4544778,...,W000817,"[jurist, politician, teacher, university teach...",[Democratic Party],"[bachelor's degree, Juris Doctor]",Q434706,Elizabeth Warren,[2018 United States Senate election in Massach...,item,[Methodism],0
1,2016-05-25-007919,"Anytime someone calls out @realDonaldTrump, he...",Elizabeth Warren,Q434706,8,"[[Elizabeth Warren, 0.8315], [None, 0.1426], [...",[http://4029tv.com/politics/donald-trump-has-a...,E,"[2016-05-25 05:47:42, 2016-05-25 19:33:06]",4544778,...,W000817,"[jurist, politician, teacher, university teach...",[Democratic Party],"[bachelor's degree, Juris Doctor]",Q434706,Elizabeth Warren,[2018 United States Senate election in Massach...,item,[Methodism],0
2,2016-05-14-023842,I think that Donald Trump is a truly dangerous...,Elizabeth Warren,Q434706,7,"[[Elizabeth Warren, 0.5091], [None, 0.4092], [...",[http://www.bostonglobe.com/metro/2016/05/14/g...,E,"[2016-05-16 13:35:58, 2016-05-15 03:41:45, 201...",4544778,...,W000817,"[jurist, politician, teacher, university teach...",[Democratic Party],"[bachelor's degree, Juris Doctor]",Q434706,Elizabeth Warren,[2018 United States Senate election in Massach...,item,[Methodism],0
3,2016-09-21-023843,even a one percent chance of Donald Trump winn...,Elizabeth Warren,Q434706,2,"[[Elizabeth Warren, 0.632], [None, 0.2327], [D...",[http://nbcnews.com/politics/2016-election/lid...,E,[2016-09-21 21:54:45],4544778,...,W000817,"[jurist, politician, teacher, university teach...",[Democratic Party],"[bachelor's degree, Juris Doctor]",Q434706,Elizabeth Warren,[2018 United States Senate election in Massach...,item,[Methodism],0
4,2016-05-07-041051,Republicans waited way too long to stand up an...,Elizabeth Warren,Q434706,1,"[[Elizabeth Warren, 0.7691], [None, 0.1233], [...",[http://www.bostonglobe.com/news/nation/2016/0...,E,[2016-05-07 16:03:00],4544778,...,W000817,"[jurist, politician, teacher, university teach...",[Democratic Party],"[bachelor's degree, Juris Doctor]",Q434706,Elizabeth Warren,[2018 United States Senate election in Massach...,item,[Methodism],0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56148,2020-04-16-015658,"However, (President Donald) Trump has aggressi...",Peter Hotez,Q7174738,1,"[[Peter Hotez, 0.8022], [None, 0.1978]]",[https://www.sunnewsonline.com/covid-19-world-...,E,[2020-04-16 00:54:10],6985288,...,,"[researcher, pediatrician]",,,Q7174738,Peter J. Hotez,,item,,1
56149,2020-03-18-028719,I was taken aback when Trump said: `We are doi...,Muqtada Al-Sadr,Q216826,1,"[[Muqtada Al-Sadr, 0.4177], [President Donald ...",[https://mickhartley.typepad.com/blog/2020/03/...,E,[2020-03-18 18:23:09],6801425,...,,[politician],[Sadrist Movement],,Q216826,Muqtada al-Sadr,,item,[Shia Islam],1
56150,2020-01-14-018173,for assurances from Trump that there will be n...,Jamie Shea,Q930267,1,"[[Jamie Shea, 0.7682], [None, 0.215], [Preside...",[http://www.aljazeera.com/indepth/features/nat...,E,[2020-01-14 17:09:00],31534,...,,"[diplomat, politician]",,,Q930267,Jamie Shea,,item,,1
56151,2020-02-24-073035,We welcome Trump in our country. I don't think...,Ashwani Mahajan,Q38459319,1,"[[Ashwani Mahajan, 0.5342], [None, 0.3725], [P...",[http://aninews.in/news/national/general-news/...,E,[2020-02-24 11:49:19],3844195,...,,[economist],,,Q38459319,Ashwani Mahajan,,item,,1


We keep only the columns that we want to use to interpret the speaker attributes.

In [75]:
ml_df = ml_df[['date_of_birth', 'nationality', 'gender', 'ethnic_group', 'occupation', 'religion', 'party', 'target_label', 'academic_degree', 'qids']]

In [76]:
ml_df = ml_df[ml_df['date_of_birth'].notna()]
ml_df['date_of_birth'] = ml_df['date_of_birth'].transform(lambda x: 2021 - int(x[0][1:5]))
ml_df.rename(columns = {'date_of_birth': 'age'}, inplace = True)
ml_df = ml_df.groupby(by='qids').first()
ml_df

Unnamed: 0_level_0,age,nationality,gender,ethnic_group,occupation,religion,party,target_label,academic_degree
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Q1000592,33,"[United Kingdom, Ireland]",[male],,[boxer],[Eastern Orthodoxy],,1,
Q10068,37,[United States of America],[female],,"[alpine skier, television presenter]",,,1,
Q100749,53,[Germany],[male],,"[economist, university teacher]",,,1,"[doctorate, habilitation]"
Q1017117,52,[United States of America],[male],,"[singer, musician, composer, guitarist]",,,0,
Q102124,72,[United States of America],[female],,"[film actor, voice actor, television actor, st...",,,0,
...,...,...,...,...,...,...,...,...,...
Q993472,52,[France],[male],,"[politician, diplomat]",,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature]
Q99484,60,[Germany],[male],,"[journalist, writer]",,,0,"[Diplom, Master of Arts]"
Q99695,78,[Germany],[male],,"[journalist, historian, non-fiction writer]",,,0,
Q99710,95,[Germany],[male],,"[stage actor, film actor, television actor]",,,1,


Most of the attributes included are categorical, this means that we should do one-hot encoding of the features. Let us see how many different values each column contains.

In [77]:
for column in ml_df.columns:
    print('The number of different values for the column', column, 'is ', ml_df[column].explode().unique().shape[0])

The number of different values for the column age is  129
The number of different values for the column nationality is  189
The number of different values for the column gender is  11
The number of different values for the column ethnic_group is  189
The number of different values for the column occupation is  1086
The number of different values for the column religion is  130
The number of different values for the column party is  607
The number of different values for the column target_label is  2
The number of different values for the column academic_degree is  71


As we can see we have lots of columns with a huge number of different values. This means that the resulting dataframe would have a huge feature dimensionality with respect to the number of rows if we would perform one hot encoding on all the features. In addition there could be lots of missing values in each column too, let us quantify how many missing value we have for each column.

In [78]:
for column in ml_df.columns:
    print('The number of missing values', column, 'is ', ml_df[column].isna().sum())

print('Over a total number of rows equal to ', ml_df.shape[0])

The number of missing values age is  0
The number of missing values nationality is  1436
The number of missing values gender is  10
The number of missing values ethnic_group is  11558
The number of missing values occupation is  372
The number of missing values religion is  10940
The number of missing values party is  8266
The number of missing values target_label is  0
The number of missing values academic_degree is  12195
Over a total number of rows equal to  12519


We need to decide how to treat missing values. We will treat each column separately since some of them have few missing values and other columns have a huge number of missing values.

* for the column age we don't have any missing value and the column is numerical so we don't need to do anything.

* for the column nationality we have a great number of missing values and many different values. We will try to reduce the dimensionality of the values in the column nationality by grouping by different nationalities based on the continent.

#### Nationality

In [79]:
ml_df['nationality'].explode().unique()

array(['United Kingdom', 'Ireland', 'United States of America', 'Germany',
       'India', 'Spain', 'France', 'Brazil', 'Iran', 'Australia',
       'Canada', 'Malaysia', 'Kingdom of the Netherlands', 'Nigeria',
       'Italy', 'Colombia', "People's Republic of China", 'Austria',
       'Denmark', 'Sweden', 'Russia', 'Israel', 'State of Palestine',
       'East Germany', 'Zimbabwe', 'Vietnam',
       'Ukrainian Soviet Socialist Republic', 'Ukraine', 'Indonesia',
       'Egypt', 'Taiwan', 'Cuba', 'Greece', 'Japan', 'Panama', 'Mexico',
       None, 'Switzerland', 'Romania', 'Morocco', 'Poland',
       'South Africa', 'Dominican Republic', 'Finland', 'Norway',
       'Czech Republic', 'Turkey', 'Venezuela', 'Soviet Union', 'Rwanda',
       'Estonia', 'Jamaica', 'South Korea', 'Ghana', 'Saudi Arabia',
       'Iraq', 'Kingdom of Italy', 'Azerbaijan', 'Puerto Rico', 'Haiti',
       'Hungary', 'Belgium', 'England', 'statelessness', 'New Zealand',
       'Pakistan', 'Portugal', 'Latin Americans

In [80]:
import pycountry_convert as pc
def map_to_cont(list_country_names):
    continents = {
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Australia',
    'AF': 'Africa',
    'EU': 'Europe'
    }
    resulting_list = []
    for country_name in list_country_names:
        try:
            country_code = pc.country_name_to_country_alpha2(country_name, cn_name_format="default")
            continent_name = pc.country_alpha2_to_continent_code(country_code)
            resulting_list.append(continents[continent_name])
        except KeyError:
            resulting_list.append(country_name)

    return resulting_list

In [81]:
ml_df['nationality'] = ml_df['nationality'].apply(lambda x: map_to_cont(x) if type(x)==np.ndarray else x)
ml_df

Unnamed: 0_level_0,age,nationality,gender,ethnic_group,occupation,religion,party,target_label,academic_degree
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Q1000592,33,"[Europe, Europe]",[male],,[boxer],[Eastern Orthodoxy],,1,
Q10068,37,[North America],[female],,"[alpine skier, television presenter]",,,1,
Q100749,53,[Europe],[male],,"[economist, university teacher]",,,1,"[doctorate, habilitation]"
Q1017117,52,[North America],[male],,"[singer, musician, composer, guitarist]",,,0,
Q102124,72,[North America],[female],,"[film actor, voice actor, television actor, st...",,,0,
...,...,...,...,...,...,...,...,...,...
Q993472,52,[Europe],[male],,"[politician, diplomat]",,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature]
Q99484,60,[Europe],[male],,"[journalist, writer]",,,0,"[Diplom, Master of Arts]"
Q99695,78,[Europe],[male],,"[journalist, historian, non-fiction writer]",,,0,
Q99710,95,[Europe],[male],,"[stage actor, film actor, television actor]",,,1,


We remove keep in the nationality list only the unique values.

In [82]:
ml_df['nationality'] = ml_df['nationality'].apply(lambda x: np.unique(x) if type(x)==list else x)
ml_df

Unnamed: 0_level_0,age,nationality,gender,ethnic_group,occupation,religion,party,target_label,academic_degree
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Q1000592,33,[Europe],[male],,[boxer],[Eastern Orthodoxy],,1,
Q10068,37,[North America],[female],,"[alpine skier, television presenter]",,,1,
Q100749,53,[Europe],[male],,"[economist, university teacher]",,,1,"[doctorate, habilitation]"
Q1017117,52,[North America],[male],,"[singer, musician, composer, guitarist]",,,0,
Q102124,72,[North America],[female],,"[film actor, voice actor, television actor, st...",,,0,
...,...,...,...,...,...,...,...,...,...
Q993472,52,[Europe],[male],,"[politician, diplomat]",,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature]
Q99484,60,[Europe],[male],,"[journalist, writer]",,,0,"[Diplom, Master of Arts]"
Q99695,78,[Europe],[male],,"[journalist, historian, non-fiction writer]",,,0,
Q99710,95,[Europe],[male],,"[stage actor, film actor, television actor]",,,1,


Now we still have some values of nationalities not mapped to a unique country, let us explore which ones.

In [83]:
country_list = list(ml_df.nationality.explode().unique())
country_list

['Europe',
 'North America',
 'Asia',
 'South America',
 'Australia',
 'Africa',
 'State of Palestine',
 'East Germany',
 'Ukrainian Soviet Socialist Republic',
 None,
 'Soviet Union',
 'Kingdom of Italy',
 'England',
 'statelessness',
 'Latin Americans',
 'Jozef Israëls',
 'United Kingdom of Great Britain and Ireland',
 'Irish Free State',
 'West Germany',
 'Canadians',
 'Wales',
 'Cherokee Nation',
 'Scotland',
 'The Bahamas',
 'British India',
 'The Gambia',
 'Czechoslovakia',
 "People's Republic of Bulgaria",
 'Socialist Federal Republic of Yugoslavia',
 'British Hong Kong',
 'Federal Republic of Yugoslavia',
 'Serbia and Montenegro',
 'Imperial State of Iran',
 'Czechoslovak Republic',
 'Czechoslovak Socialist Republic',
 'Australians',
 'Dominion of India',
 'Republic of China (1912–1949)',
 'Second Syrian Republic',
 "Ba'athist Iraq",
 'Iraqi Republic (1958–68)',
 'Kingdom of Iraq',
 'Flor',
 'Democratic Republic of Afghanistan',
 'Islamic State of Afghanistan',
 'Kingdom of Afg

We will label them manually.

In [84]:
asian_countries = list(filter(lambda x: any([country in str(x) for country in ["Asia", "Palestine", "India", "Hong Kong", "Iran",
                                                        "China", "Afghanistan", "Siria", "Iraq", "Arab", "Palestinian", "Syrian"]]), 
                          country_list))
    
european_countries = list(filter(lambda x: any([country in str(x) for country in ["Europe", "Wales", "Kosovo", "Scotland", "England",
                                                        "Cyprus", "Russia", "Vatican", "Serbia", "Bulgaria", "Germany", "Soviet",
                                                                                 "Czechoslovakia", "Czechoslovak","Weimar Republic", "Ireland",
                                                                                 "Yugoslavia", "Irish", "Jozef Israëls"]]), 
                          country_list))

african_countries = ['Kingdom of Egypt', 'Republic of Egypt (1953–1958)', 'The Gambia','Africa']
south_american_countries = ['Latin Americans','Mexicali', 'South America']                       
north_american_countries = ['Cherokee Nation', 'The Bahamas', 'Flor', 'Canadians', 'Americans', 'North America']
australian_countries = ['Australia']
none_countries = [None, 'statelessness']

Construct a mapping structure

In [85]:
asian_dic = dict(zip(asian_countries, ['Asia']*len(asian_countries)))
euro_dic = dict(zip(european_countries, ['Europe']*len(european_countries)))
africa_dic = dict(zip(african_countries, ['Africa']*len(african_countries)))
south_am_dic = dict(zip(south_american_countries, ['South_America']*len(south_american_countries)))
north_am_dic = dict(zip(north_american_countries, ['North_America']*len(north_american_countries)))
aust_dic = dict(zip(australian_countries, ['Australia']*len(australian_countries)))
none_dic = dict(zip(none_countries, [None]*len(none_countries)))

dic_nat_cont = {**asian_dic ,**euro_dic ,**africa_dic ,**south_am_dic ,**north_am_dic ,**aust_dic ,**none_dic}

Create a mapping function

In [86]:
def map_to_cont2(list_country_names):
    cont_list = []
    for country in list_country_names:
        cont_list.append(dic_nat_cont.get(country))
    return cont_list

Run the mapping for the dataset

In [87]:
ml_df['nationality'] = ml_df['nationality'].apply(lambda x: map_to_cont2(x) if type(x)==np.ndarray else x)
ml_df['nationality'] = ml_df['nationality'].apply(lambda x: x[0] if type(x)==list else x)
ml_df

Unnamed: 0_level_0,age,nationality,gender,ethnic_group,occupation,religion,party,target_label,academic_degree
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Q1000592,33,Europe,[male],,[boxer],[Eastern Orthodoxy],,1,
Q10068,37,North_America,[female],,"[alpine skier, television presenter]",,,1,
Q100749,53,Europe,[male],,"[economist, university teacher]",,,1,"[doctorate, habilitation]"
Q1017117,52,North_America,[male],,"[singer, musician, composer, guitarist]",,,0,
Q102124,72,North_America,[female],,"[film actor, voice actor, television actor, st...",,,0,
...,...,...,...,...,...,...,...,...,...
Q993472,52,Europe,[male],,"[politician, diplomat]",,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature]
Q99484,60,Europe,[male],,"[journalist, writer]",,,0,"[Diplom, Master of Arts]"
Q99695,78,Europe,[male],,"[journalist, historian, non-fiction writer]",,,0,
Q99710,95,Europe,[male],,"[stage actor, film actor, television actor]",,,1,


Now we do one hot encoding, we create a new column for missing values, we decided to do that instead of inputing the value using the median or the mode because there is a big number of missing values. Therefore, replacing with the mode would have increased the bias of the model. We have also considered using KKN or another machine learning technique t replace missing values but the other features don't seem very related to the nationality one.

In [88]:
ml_df = pd.get_dummies(ml_df, columns = ['nationality'], dummy_na = True)
ml_df

Unnamed: 0_level_0,age,gender,ethnic_group,occupation,religion,party,target_label,academic_degree,nationality_Africa,nationality_Asia,nationality_Australia,nationality_Europe,nationality_North_America,nationality_South_America,nationality_nan
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Q1000592,33,[male],,[boxer],[Eastern Orthodoxy],,1,,0,0,0,1,0,0,0
Q10068,37,[female],,"[alpine skier, television presenter]",,,1,,0,0,0,0,1,0,0
Q100749,53,[male],,"[economist, university teacher]",,,1,"[doctorate, habilitation]",0,0,0,1,0,0,0
Q1017117,52,[male],,"[singer, musician, composer, guitarist]",,,0,,0,0,0,0,1,0,0
Q102124,72,[female],,"[film actor, voice actor, television actor, st...",,,0,,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q993472,52,[male],,"[politician, diplomat]",,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature],0,0,0,1,0,0,0
Q99484,60,[male],,"[journalist, writer]",,,0,"[Diplom, Master of Arts]",0,0,0,1,0,0,0
Q99695,78,[male],,"[journalist, historian, non-fiction writer]",,,0,,0,0,0,1,0,0,0
Q99710,95,[male],,"[stage actor, film actor, television actor]",,,1,,0,0,0,1,0,0,0


#### Gender

* For the column gender we have a few missing values and very few values, we will replace the missing values with the mode and we will do one hot encoding after that.

In [89]:
ml_df.gender = ml_df.gender.fillna(value = ml_df.gender.mode()[0][0])
ml_df.gender = ml_df.gender.apply(lambda x: x[0])

In [90]:
ml_df.gender.unique()

array(['male', 'female', 'm', 'cisgender male', 'transgender female',
       'non-binary', 'genderfluid', 'cisgender female',
       'transgender male', 'genderqueer', 'shemale'], dtype=object)

For the sake of simplicity we drop the values different from male and female (more or less 100 rows).

In [91]:
ml_df = ml_df[ml_df.gender.isin(['male', 'female'])]
ml_df = pd.get_dummies(ml_df, columns = ['gender'])
ml_df

Unnamed: 0_level_0,age,ethnic_group,occupation,religion,party,target_label,academic_degree,nationality_Africa,nationality_Asia,nationality_Australia,nationality_Europe,nationality_North_America,nationality_South_America,nationality_nan,gender_female,gender_male
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Q1000592,33,,[boxer],[Eastern Orthodoxy],,1,,0,0,0,1,0,0,0,0,1
Q10068,37,,"[alpine skier, television presenter]",,,1,,0,0,0,0,1,0,0,1,0
Q100749,53,,"[economist, university teacher]",,,1,"[doctorate, habilitation]",0,0,0,1,0,0,0,0,1
Q1017117,52,,"[singer, musician, composer, guitarist]",,,0,,0,0,0,0,1,0,0,0,1
Q102124,72,,"[film actor, voice actor, television actor, st...",,,0,,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q993472,52,,"[politician, diplomat]",,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature],0,0,0,1,0,0,0,0,1
Q99484,60,,"[journalist, writer]",,,0,"[Diplom, Master of Arts]",0,0,0,1,0,0,0,0,1
Q99695,78,,"[journalist, historian, non-fiction writer]",,,0,,0,0,0,1,0,0,0,0,1
Q99710,95,,"[stage actor, film actor, television actor]",,,1,,0,0,0,1,0,0,0,0,1


#### Ethnic Group

* Regarding the column ethnic group, we have a huge number of missing values and a very large number of different values, since we already have the nationality of the speakers in our model, which we consider more important to analyze than the ethnic group, we simply drop the column.

In [92]:
ml_df.drop(columns = ['ethnic_group'], inplace = True)

#### Occupation

* A similar argument can be applied for the column 'occupation', in this case we have too many different values to perform one-hot-encoding so we simply drop the column.

In [93]:
ml_df.drop(columns = ['occupation'], inplace = True)

#### Religion

* For the religion we will try to encode the most important religion and we will add a column religion_None when the value is missing.

In [94]:
# Here are the most popular religions
ml_df['religion'].explode().groupby(ml_df['religion'].explode()).count().sort_values(ascending=False)

religion
Catholic Church                 208
Islam                           168
Judaism                         167
atheism                         165
Catholicism                     133
                               ... 
Crusaders to Save the Nation      1
Hindu                             1
Church of God                     1
Shinto                            1
spiritualism                      1
Name: religion, Length: 128, dtype: int64

In [95]:
ml_df.religion = ml_df['religion'].apply(lambda x: x[0] if type(x)==np.ndarray else x) # we remove list containers
ml_df['religion'].str.lower() # we lower case

qids
Q1000592    eastern orthodoxy
Q10068                   None
Q100749                  None
Q1017117                 None
Q102124                  None
                  ...        
Q993472                  None
Q99484                   None
Q99695                   None
Q99710                   None
Q998489                  None
Name: religion, Length: 12470, dtype: object

In [96]:
ml_df.loc[ml_df.religion.str.contains('christ|cath|anglican|anabapt|calv|luth|latin|orth|evangel|protest|church|bapt|methodism|presbyt|mormon|unitarian universalism|Pentecostalism', 
        case = False, na = False),'religion'] = 'Cristhianity'
ml_df.loc[ml_df.religion.str.contains('islam|muslim', case = False, na = False),
          'religion'] = 'Islam'
ml_df.loc[ml_df.religion.str.contains('secular|nonreligious|agnost|atheis|irreligion|nontheism', case = False, na = False),
          'religion'] = 'Atheist'
ml_df.loc[ml_df.religion.str.contains('hindu', case = False, na = False),
          'religion'] = 'Hinduism'
ml_df.loc[ml_df.religion.str.contains('buddh', case = False, na = False),
          'religion'] = 'Buddhism'
ml_df.loc[ml_df.religion.str.contains('juda', case = False, na = False),
          'religion'] = 'Judaism'
ml_df.loc[~ml_df.religion.str.contains('Cristhianity|Islam|Hinduism|Buddhism|Judaism|Atheist', case = False, na = False) & (~ml_df.religion.isnull()),
          'religion'] = 'Other'

In [97]:
ml_df.religion.unique()

array(['Cristhianity', None, 'Hinduism', 'Islam', 'Atheist', 'Other',
       'Buddhism', 'Judaism'], dtype=object)

In [98]:
# Here are the most popular religions
ml_df['religion'].groupby(ml_df['religion']).count().sort_values(ascending=False)

religion
Cristhianity    900
Islam           213
Atheist         192
Judaism         162
Hinduism         54
Other            31
Buddhism         20
Name: religion, dtype: int64

In [99]:
ml_df

Unnamed: 0_level_0,age,religion,party,target_label,academic_degree,nationality_Africa,nationality_Asia,nationality_Australia,nationality_Europe,nationality_North_America,nationality_South_America,nationality_nan,gender_female,gender_male
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Q1000592,33,Cristhianity,,1,,0,0,0,1,0,0,0,0,1
Q10068,37,,,1,,0,0,0,0,1,0,0,1,0
Q100749,53,,,1,"[doctorate, habilitation]",0,0,0,1,0,0,0,0,1
Q1017117,52,,,0,,0,0,0,0,1,0,0,0,1
Q102124,72,,,0,,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q993472,52,,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature],0,0,0,1,0,0,0,0,1
Q99484,60,,,0,"[Diplom, Master of Arts]",0,0,0,1,0,0,0,0,1
Q99695,78,,,0,,0,0,0,1,0,0,0,0,1
Q99710,95,,,1,,0,0,0,1,0,0,0,0,1


We discard all the rows such that they are not in the main 6 religions after the grouping. We discard in total less than 100 rows

In [100]:
ml_df = ml_df[ml_df['religion'].isin(np.append(ml_df['religion'].groupby(ml_df['religion']).count().sort_values(ascending=False).index, None))]
ml_df

Unnamed: 0_level_0,age,religion,party,target_label,academic_degree,nationality_Africa,nationality_Asia,nationality_Australia,nationality_Europe,nationality_North_America,nationality_South_America,nationality_nan,gender_female,gender_male
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Q1000592,33,Cristhianity,,1,,0,0,0,1,0,0,0,0,1
Q10068,37,,,1,,0,0,0,0,1,0,0,1,0
Q100749,53,,,1,"[doctorate, habilitation]",0,0,0,1,0,0,0,0,1
Q1017117,52,,,0,,0,0,0,0,1,0,0,0,1
Q102124,72,,,0,,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q993472,52,,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature],0,0,0,1,0,0,0,0,1
Q99484,60,,,0,"[Diplom, Master of Arts]",0,0,0,1,0,0,0,0,1
Q99695,78,,,0,,0,0,0,1,0,0,0,0,1
Q99710,95,,,1,,0,0,0,1,0,0,0,0,1


We can now encode the main religions.

In [101]:
ml_df = pd.get_dummies(ml_df, columns = ['religion'], dummy_na = True)
ml_df

Unnamed: 0_level_0,age,party,target_label,academic_degree,nationality_Africa,nationality_Asia,nationality_Australia,nationality_Europe,nationality_North_America,nationality_South_America,...,gender_female,gender_male,religion_Atheist,religion_Buddhism,religion_Cristhianity,religion_Hinduism,religion_Islam,religion_Judaism,religion_Other,religion_nan
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q1000592,33,,1,,0,0,0,1,0,0,...,0,1,0,0,1,0,0,0,0,0
Q10068,37,,1,,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1
Q100749,53,,1,"[doctorate, habilitation]",0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
Q1017117,52,,0,,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,1
Q102124,72,,0,,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q993472,52,"[Union for a Popular Movement, The Republicans...",1,[aggregation of modern literature],0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
Q99484,60,,0,"[Diplom, Master of Arts]",0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
Q99695,78,,0,,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
Q99710,95,,1,,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1


#### Political Party

In [102]:
ml_df['party'].explode().groupby(ml_df['party'].explode()).count().sort_values(ascending=False)[0:10]

party
Republican Party                           1633
Democratic Party                           1457
Labour Party                                 77
Conservative Party                           66
independent politician                       57
Likud                                        31
Minnesota Democratic–Farmer–Labor Party      30
Bharatiya Janata Party                       28
Liberal Party of Australia                   27
Australian Labor Party                       25
Name: party, dtype: int64

We will encode this column as democratic/republican and other.

In [103]:
ml_df.party = ml_df['party'].apply(lambda x : x[0] if type(x)== np.ndarray else x)
ml_df.loc[ml_df.party=='Democratic Party','party'] = 'Democratic'
ml_df.loc[ml_df.party=='Republican Party','party'] = 'Republican'
ml_df.loc[~ml_df.party.isin(['Democratic', 'Republican']), 'party'] = 'Other'
ml_df = pd.get_dummies(ml_df, columns = ['party'])

In [104]:
ml_df

Unnamed: 0_level_0,age,target_label,academic_degree,nationality_Africa,nationality_Asia,nationality_Australia,nationality_Europe,nationality_North_America,nationality_South_America,nationality_nan,...,religion_Buddhism,religion_Cristhianity,religion_Hinduism,religion_Islam,religion_Judaism,religion_Other,religion_nan,party_Democratic,party_Other,party_Republican
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q1000592,33,1,,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
Q10068,37,1,,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0
Q100749,53,1,"[doctorate, habilitation]",0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
Q1017117,52,0,,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0
Q102124,72,0,,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q993472,52,1,[aggregation of modern literature],0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
Q99484,60,0,"[Diplom, Master of Arts]",0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
Q99695,78,0,,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
Q99710,95,1,,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0


#### Academic Degree

* for academic degree the idea is to give to divide in class: bachelor, master, phd, other doing one hot encoding

In [105]:
ml_df.academic_degree = ml_df.academic_degree.apply(lambda x : x[0] if type(x)== np.ndarray else x)
ml_df.loc[ml_df.academic_degree.str.contains('doctorate|doctor|phd|doktor|professor|candidate|cand', 
        case = False, na = False),'academic_degree'] = 'phd'
ml_df.loc[ml_df.academic_degree.str.contains('bachelor', case = False, na = False),
          'academic_degree'] = 'bachelor'
ml_df.loc[ml_df.academic_degree.str.contains('master|laurea magistrale|magister artium|magister', case = False, na = False),
          'academic_degree'] = 'master'


In [106]:
ml_df['academic_degree'].groupby(ml_df['academic_degree']).count().sort_values(ascending=False)

academic_degree
phd                                   220
bachelor                               73
master                                 17
Diplom-Volkswirt                        2
archivist palaeographer                 2
Diplom                                  1
Diploma of Business Administration      1
aggregation of modern literature        1
jurist                                  1
laurea                                  1
political science                       1
psychology                              1
Name: academic_degree, dtype: int64

In [107]:
ml_df.loc[~ml_df.academic_degree.isin(['phd', 'bachelor', 'master']), 'academic_degree'] = 'Other'

In [108]:
ml_df = pd.get_dummies(ml_df, columns = ['academic_degree'])

In [109]:
ml_df

Unnamed: 0_level_0,age,target_label,nationality_Africa,nationality_Asia,nationality_Australia,nationality_Europe,nationality_North_America,nationality_South_America,nationality_nan,gender_female,...,religion_Judaism,religion_Other,religion_nan,party_Democratic,party_Other,party_Republican,academic_degree_Other,academic_degree_bachelor,academic_degree_master,academic_degree_phd
qids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q1000592,33,1,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
Q10068,37,1,0,0,0,0,1,0,0,1,...,0,0,1,0,1,0,1,0,0,0
Q100749,53,1,0,0,0,1,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1
Q1017117,52,0,0,0,0,0,1,0,0,0,...,0,0,1,0,1,0,1,0,0,0
Q102124,72,0,0,0,0,0,1,0,0,1,...,0,0,1,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q993472,52,1,0,0,0,1,0,0,0,0,...,0,0,1,0,1,0,1,0,0,0
Q99484,60,0,0,0,0,1,0,0,0,0,...,0,0,1,0,1,0,1,0,0,0
Q99695,78,0,0,0,0,1,0,0,0,0,...,0,0,1,0,1,0,1,0,0,0
Q99710,95,1,0,0,0,1,0,0,0,0,...,0,0,1,0,1,0,1,0,0,0


We now fit the model and analyze the results.

In [110]:
from sklearn.preprocessing import StandardScaler
ml_df.loc[:, ml_df.columns != 'target_label'] = StandardScaler().fit_transform(ml_df.loc[:, ml_df.columns != 'target_label'])

In [111]:
formula = 'target_label' + ' ~ ' + ' + '.join([col for col in ml_df.columns if not col=='target_label'])

In [112]:
formula

'target_label ~ age + nationality_Africa + nationality_Asia + nationality_Australia + nationality_Europe + nationality_North_America + nationality_South_America + nationality_nan + gender_female + gender_male + religion_Atheist + religion_Buddhism + religion_Cristhianity + religion_Hinduism + religion_Islam + religion_Judaism + religion_Other + religion_nan + party_Democratic + party_Other + party_Republican + academic_degree_Other + academic_degree_bachelor + academic_degree_master + academic_degree_phd'

In [113]:
mod = smf.logit(formula= 'target_label ~ age + C(nationality_Africa)+ C(nationality_Asia+ nationality_Australia) + C(nationality_Europe) + \
+ C(nationality_North_America) + C(nationality_South_America) + C(nationality_nan) + C(gender_female)+ C(gender_male) + \
+ C(religion_Atheist) + C(religion_Buddhism) + C(religion_Cristhianity) + C(religion_Hinduism) + C(religion_Islam) + \
+ C(religion_Judaism) + C(religion_Other)+ C(religion_nan) + C(party_Democratic) + C(party_Other) + C(party_Republican) + C(academic_degree_Other) \
+ C(academic_degree_bachelor) + C(academic_degree_master) + C(academic_degree_phd)'
                , data=ml_df)
res = mod.fit_regularized()
print(res.summary())

Optimization terminated successfully.    (Exit mode 0)
            Current function value: 0.6535194384156819
            Iterations: 130
            Function evaluations: 130
            Gradient evaluations: 130
                           Logit Regression Results                           
Dep. Variable:           target_label   No. Observations:                12470
Model:                          Logit   Df Residuals:                    12444
Method:                           MLE   Df Model:                           25
Date:                Sun, 12 Dec 2021   Pseudo R-squ.:                 0.03143
Time:                        22:23:55   Log-Likelihood:                -8149.4
converged:                       True   LL-Null:                       -8413.9
Covariance Type:            nonrobust   LLR p-value:                 7.581e-96
                                                                       coef    std err          z      P>|z|      [0.025      0.975]
---------------------

Ideally we would want to analyse the coefficients of the logistic regression classifier, however due to the high number of non defined values we can't really make any sort of assumptions about the correctness of the coefficients and the underlying value as a predictor.

In [114]:
import statsmodels.formula.api as smf

mod = smf.ols(formula= formula, data=ml_df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:           target_label   R-squared:                       0.040
Model:                            OLS   Adj. R-squared:                  0.038
Method:                 Least Squares   F-statistic:                     25.96
Date:                Sun, 12 Dec 2021   Prob (F-statistic):           3.67e-95
Time:                        22:23:55   Log-Likelihood:                -8563.3
No. Observations:               12470   AIC:                         1.717e+04
Df Residuals:                   12449   BIC:                         1.732e+04
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept             

Even though typically regression analysis is not used for classification tasks, we are going to look at the sign of the coefficients to understand if they can be seen as a predictor for Trump or Clinton. As we can see, people from Europe and Asia are more likely to talk about Trump (positive coefficient), and we can state this with high confidence due to the low p-value (around 0). On the other hand, curiously Democrats and Republicans speak more of Clinton (negative sign), conversely people from other political parties speak more about Trump (this can also be stated with high confidence, p-value around 0). Another conclusion one can make is that Christian people tend to mention more Clinton then Trump, since the coefficient for religion_Cristhianity is negative, with a small p-value (around 0).

### Conclusion

In notebook Project2 we provide a detailed analysis of the US political landscape from 2015 up to 2017, throughout the notebook we carefully look at the caracteristics of authors that talked about the two key political figures during that time (Donald Trump and Hillary Clinton), by looking at this two distinct groups we can reason about the two different basis around this two politicians and we can therefore think about the underlying differences between the democrat and republican basis. In the end, we create a model to predict wether or not someone is talking about Trump or Clinton, the results of the model can provide insights as to whether a person's caracteristic such as age or gender can be used as a predictor for the task at hand. To sum up our project's goal was to provide a detailed data centric view of the popularity and polarization of the aforementioned political figures  