# Milestone 2

---
By the end of this milestone we expect to be able to:
1. Load data from json file
2. Filter the dataset
3. Do a sentimental analisys of the quotes
4. Do a sensitivity analisys; 
5. Create meaningful visualizations to analyze the data;
---


## Context

In this project, we are going to analyze a data set of quotes from New York Times in order to cluster the authors of the quotes by common opinions. 

Oftentimes, we would like to know someone's opinion about a certain missing topic. In that case, if we have a cluster  made  of people with similar opinions in a variety of topics, we can try to infer their opinion on the missing topic based on the opinion of the majority of his/her cluster.THis is a process similar to the KKN algorithm

The data set is presented as an aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles.



## The data

We are provided with a compressed `.bz2` json file containing one row per quote. 
The `.json` has the following fields:

 - `quoteID`: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
 - `quotation`: Text of the longest encountered original form of the quotation
 - `date`: Earliest occurrence date of any version of the quotation
 - `phase`: Corresponding phase of the data in which the quotation first occurred (A-E)
 - `probas`: Array representing the probabilities of each speaker having uttered the quotation.
      The probabilities across different occurrences of the same quotation are summed for
      each distinct candidate speaker and then normalized
      - `proba`: Probability for a given speaker
      - `speaker`: Most frequent surface form for a given speaker in the articles where the quotation occurred
 - `speaker`: Selected most likely speaker. This matches the the first speaker entry in `probas`
 - `qids`: Wikidata IDs of all aliases that match the selected speaker
 - `numOccurrences`: Number of time this quotation occurs in the articles
 - `urls`: List of links to the original articles containing the quotation 

### Imports 

In [1]:
# Imports we may need
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import ujson as json
import bz2
import pyarrow.parquet as pq

from wikidata.client import Client
import wikipedia

import urllib.request

client = Client()

### Load the quotes files

Loaded the data of each year by chunks. Since we can't fully load the entire set, we will use a "Divide and Conquer" strategie, where we read the first 10000 quotes, work on them, save the results and then we go get 10000 more quotes.


In [2]:
#read YYYY quotes file

chunks_2008 = pd.read_json("data/quotes-2008.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2009 = pd.read_json("data/quotes-2009.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2010 = pd.read_json("data/quotes-2010.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2011 = pd.read_json("data/quotes-2011.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2012 = pd.read_json("data/quotes-2012.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2013 = pd.read_json("data/quotes-2013.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2014 = pd.read_json("data/quotes-2014.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2015 = pd.read_json("data/quotes-2015.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2016 = pd.read_json("data/quotes-2016.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2017 = pd.read_json("data/quotes-2017.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2018 = pd.read_json("data/quotes-2018.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2019 = pd.read_json("data/quotes-2019.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2020 = pd.read_json("data/quotes-2020.json.bz2",lines=True, chunksize = 10000, compression = "bz2")

chunks_YYYY = [chunks_2008, chunks_2009, chunks_2010, chunks_2011, chunks_2012, chunks_2013, chunks_2014, chunks_2015, chunks_2016, chunks_2017, chunks_2018, chunks_2019, chunks_2020]

years_list= ["2008","2009","2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018","2019", "2020"]

Load the extra dataframe with attributes of each author

In [3]:
parquet_frame = pd.read_parquet("data/speaker_attributes.parquet")

QID_columns = ["nationality", "gender", "ethnic_group", "occupation", "party", "candidacy", "religion"]

In [4]:
parquet_frame

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],1395141751,,W000178,"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",[Q327591],,Q23,George Washington,"[Q698073, Q697949]",item,[Q682443]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,Q207,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Q297,Diego Velázquez,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9055976,[Barker Howard],,[Q30],[Q6581097],1397399351,,,[Q82955],,,Q106406560,Barker B. Howard,,item,
9055977,[Charles Macomber],,[Q30],[Q6581097],1397399471,,,[Q82955],,,Q106406571,Charles H. Macomber,,item,
9055978,,[+1848-04-01T00:00:00Z],,[Q6581072],1397399751,,,,,,Q106406588,Dina David,,item,
9055979,,[+1899-03-18T00:00:00Z],,[Q6581072],1397399799,,,,,,Q106406593,Irma Dexinger,,item,


Function to get the english name of a certain QID

In [5]:
# Given QID gets the english name for it
def get_QID_value(QID):
    entity = client.get(QID, load = True)
    entity_DataFrame = pd.DataFrame.from_dict(entity.data)
    return entity_DataFrame["labels"]["en"]["value"]

Function that replace QIDs in a row for the values in english

In [6]:
# Given a row, change QID values to the english names

def replace_QID_by_value(df, row, QID_columns):
    new_QID = []
    for i in QID_columns:
        QID_list = df.at[row,i]
        if QID_list is not None:
            for QID in QID_list:
                new_QID.append(get_QID_name(QID))
            df.at[row, i]= new_QID
            new_QID =[]
    return df
    

In [17]:
df = replace_QID_by_name(parquet_frame, 0, QID_columns)

In [18]:
df

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Great Britain, United States of America]",[male],1395141751,,W000178,"[politician, military officer, farmer, cartogr...",[independent politician],,Q23,George Washington,"[1792 United States presidential election, 178...",item,[Episcopal Church]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,Q207,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Q297,Diego Velázquez,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9055976,[Barker Howard],,[Q30],[Q6581097],1397399351,,,[Q82955],,,Q106406560,Barker B. Howard,,item,
9055977,[Charles Macomber],,[Q30],[Q6581097],1397399471,,,[Q82955],,,Q106406571,Charles H. Macomber,,item,
9055978,,[+1848-04-01T00:00:00Z],,[Q6581072],1397399751,,,,,,Q106406588,Dina David,,item,
9055979,,[+1899-03-18T00:00:00Z],,[Q6581072],1397399799,,,,,,Q106406593,Irma Dexinger,,item,


### Functions to do some tests on the dataset

Quotes that month had in a certain year. At first it seemed a good idea to divide and analyse the quotes by month, but since the date shown in the "date" column is the last date that the quote was mention (i.e if the same quote was mentioned in April and December, the date will be the December one), this makes that we only have quotes in the last months, so the data is not evenly distributed by months as first expected.

In [24]:
def Quotes_Year_month(year, month):
    chunk_month = []
    chunks_Year = pd.read_json("data/quotes-" + year + ".json.bz2",lines=True, chunksize = 10000, compression = "bz2")
    for chunk in chunks_Year:
        chunk_month.append(chunk[pd.DatetimeIndex(chunk['date']).month == month])
    return chunk_month

In [26]:
chunk_month = Quotes_Year_month("2016", 9)

In [31]:
chunk_month_DataFrame = pd.concat(chunk_month)
chunk_month_DataFrame

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
1,2008-09-15-000037,"13th watered heavily up to the ridge, then roc...",gerry byrne,"[Q3104325, Q959780]",2008-09-15 16:29:27,1,"[[gerry byrne, 0.4065], [howard clark, 0.3118]...",[http://belfasttelegraph.co.uk/sport/golf/amer...,A
3,2008-09-16-000159,80 per cent towards the basis of an agreement.,mario cossio,[Q9029123],2008-09-16 01:12:16,2,"[[mario cossio, 0.4073], [evo morales, 0.247],...",[http://channelnewsasia.com/stories/afp_world/...,A
4,2008-09-03-000269,"a challenge for the u.s. and the world,",,[],2008-09-03 19:10:55,2,"[[None, 0.5239], [mikheil saakashvili, 0.4761]]",[http://warisboring.com/?p=1339],A
5,2008-09-22-000262,a child's head is the heaviest part of his bod...,michael turner,"[Q1372347, Q1372443, Q15969753, Q1929622, Q208...",2008-09-22 14:01:20,1,"[[michael turner, 0.5583], [None, 0.4417]]",[http://cnn.com/2008/HEALTH/family/09/22/accid...,A
8,2008-09-25-000330,a core taking over our education system?,thomas muthee,[Q7792644],2008-09-25 20:39:19,1,"[[thomas muthee, 0.5289], [None, 0.4711]]",[http://woai.com/content/blogs/wiccan/story.as...,A
...,...,...,...,...,...,...,...,...,...
4641312,2008-09-16-071984,you never want to go to somebody else's place ...,jim scherr,[Q15820748],2008-09-16 05:17:56,1,"[[jim scherr, 0.431], [peter ueberroth, 0.2946...",[http://gazette.com/sports/usoc_40594___articl...,A
4641317,2008-09-26-069663,you shouldn't look at [ your balance ] every d...,carmen wong ulrich,[Q5043616],2008-09-26 22:04:25,1,"[[carmen wong ulrich, 0.5583], [None, 0.4417]]",[http://usnews.com/blogs/new-money/2008/9/26/o...,A
4641320,2008-09-30-070962,you will be an accountant for the rest of your...,you long,"[Q45476726, Q45678948]",2008-09-30 00:15:04,1,"[[you long, 0.8172], [None, 0.1828]]",[http://messagefromthemuse.typepad.com/message...,A
4641321,2008-09-16-072287,you'd like to think if you're going to make a ...,mike nolan,"[Q3249544, Q6848203]",2008-09-16 14:44:20,1,"[[mike nolan, 0.2222], [shawntae spencer, 0.20...",[http://pressdemocrat.com/article/20080915/spo...,A


Function to get iterator starting in a certain chunk, and the first chunk aplied by a given function

In [8]:
def get_chunk_and_transform(year, chunk_number, chunksize, f):
    iterator = pd.read_json("data/quotes-"+ year +".json.bz2",lines=True, chunksize = chunksize, compression = "bz2")
    for j in range(chunk_number-1):
        iterator.__next__()
    chunk_partition = iterator.__next__()
    specific_chunk = f(chunk_partition)
    return specific_chunk, iterator

In [9]:
chunk, iterator = get_chunk_and_transform("2016", 0, 10000, lambda x : x)

In [17]:
chunk

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2016-12-26-000040,[ ] and Chris [ Jones ] were in there a lot an...,Andy Reid,"[Q2622812, Q27830815, Q470738, Q4761219]",2016-12-26 20:05:00,1,"[[Andy Reid, 0.9432], [None, 0.0541], [Trevor ...",[http://www.kcchiefs.com/news/article-2/How-a-...,E
1,2016-07-31-000006,[ And ] I don't know if we have enough time to...,Mike Howe,[Q6847325],2016-07-31 08:22:12,2,"[[Mike Howe, 0.7118], [None, 0.2882]]",[http://www.peninsuladailynews.com/apps/pbcs.d...,E
2,2016-09-06-000292,... I feel like I was champion long before I l...,,[],2016-09-06 20:54:45,2,"[[None, 0.6877], [John Waters, 0.3123]]",[http://onlineathens.com/breaking-news/2016-09...,E
3,2016-07-11-000226,[ I ] mmigration has been and continues to be ...,Hillary Clinton,[Q6294],2016-07-11 17:26:06,1,"[[Hillary Clinton, 0.9025], [None, 0.0975]]",[http://www.breitbart.com/tech/2016/07/11/hill...,E
4,2016-05-26-000371,[ It is ] the process of understanding what ki...,Bruce Maxwell,[Q26129591],2016-05-26 15:21:37,1,"[[Bruce Maxwell, 0.8178], [None, 0.1822]]",[http://www.scout.com/mlb/athletics/story/1673...,E
...,...,...,...,...,...,...,...,...,...
9995,2016-08-06-031503,It's a great way to take students out into the...,Greg Schneider,[Q5606223],2016-08-06 11:00:00,1,"[[Greg Schneider, 0.7712], [None, 0.2288]]",[http://greensburgdailynews.com/news/local_new...,E
9996,2016-09-01-065324,"It's a happy accident, to me, that nothing has...",Adam Brody,[Q294372],2016-09-01 15:12:37,1,"[[Adam Brody, 0.4867], [None, 0.438], [Leighto...",[http://www.huffingtonpost.com/2016/09/01/adam...,E
9997,2016-07-27-070749,It's a joyous occasion and I'd like to think a...,Bill Miller,"[Q14951130, Q16187209, Q16732604, Q19325928, Q...",2016-07-27 02:43:37,1,"[[Bill Miller, 0.784], [None, 0.216]]",[http://wabi.tv/2016/07/26/22nd-annual-old-fas...,E
9998,2016-08-15-052447,It's a little bit better than yesterday in som...,,[],2016-08-15 11:00:00,1,"[[None, 0.6434], [Larry Richard, 0.3566]]",[http://iberianet.com/news/still-rescues-in-th...,E


Some QIDs have not a wikipidea page associated anymore so it will give a 404 Error, so in some cases we need to skip the error with:

In [12]:
try:
    print(get_QID_value("Q12969354"))
except urllib.error.HTTPError as exception:
    print()




Some speakers are mentioned in the dataframe by different names (i.e barack obama, president obama, president barack obama), since this is not pratical to analyse the dataset the next 2 functions will filter this ambiguity

In [21]:
def get_qids_based_on_aliases(parquet_frame):
    qids_list = []
    aux_list = []
    i = 0
    aliases_column = parquet_frame["aliases"]
    for aliases_list in aliases_column:
        if aliases_list is not None:
            if aliases_list.size > 1:
                aux_list.append(parquet_frame.at[i,"id"])
                aux_list.append(parquet_frame.at[i,"label"])
                qids_list.append(aux_list)
                aux_list = []
        i = i + 1
    return qids_list

In [22]:
qids_list = get_qids_based_on_aliases(parquet_frame)

In [35]:
get_QID_value("Q14951130")

'Bill Miller'

In [36]:
for qids in qids_list:
    if qids[1] == "Bill Miller":
        print(qids)

['Q4910198', 'Bill Miller']
['Q16187209', 'Bill Miller']


In [24]:
def Filter_aliases(chunk, ids_list):
    qids_column = chunk["qids"]
    i = chunk.index.start
    new_speaker = []
    for qids in qids_column:
        for ids in ids_list:
            if ids[0] in qids:
                chunk.at[i,"speaker"] = ids[1]
        i = i + 1
    return chunk



In [25]:
new_chunk = Filter_aliases(chunk, qids_list)

In [34]:
new_chunk.at[9997, "qids"]

['Q14951130',
 'Q16187209',
 'Q16732604',
 'Q19325928',
 'Q28421889',
 'Q30507980',
 'Q43379343',
 'Q438279',
 'Q4910185',
 'Q4910193',
 'Q4910195',
 'Q860960',
 'Q862326']

Histogram based on a sample of 10% of the data set, and in order to have a cleaner visualization of the main speakers, I only ploted the bars of speakers with more than 500 quotes 

In [77]:
#count occurencies of speakers
init = True
i = 0
chunk_sample_list = []
for chunk in chunks_2008:
    chunk_sample = chunk.sample(frac = 0.1)
    chunk_sample_list.append(chunk_sample["speaker"])
    i = i+ 1
    print(i)
    
speakers_counts = pd.concat(chunk_sample_list)
speakers_counts = speakers_counts.value_counts()
speakers = speakers_counts[speakers_counts > 500]
speakers = pd.DataFrame(speakers)
ax = sns.barplot(x="speaker", y=speakers.index, data=speakers)
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


KeyboardInterrupt: 