# Milestone 2

---
By the end of this milestone we expect to be able to:
1. Load data from json file
2. Filter the dataset
3. Do a sentimental analisys of the quotes
4. Do a sensitivity analisys; 
5. Create meaningful visualizations to analyze the data;
---


## Context

In this project, we are going to analyze a data set of quotes from New York Times in order to cluster the authors of the quotes by common opinions. 

Oftentimes, we would like to know someone's opinion about a certain missing topic. In that case, if we have a cluster  made  of people with similar opinions in a variety of topics, we can try to infer their opinion on the missing topic based on the opinion of the majority of his/her cluster.THis is a process similar to the KKN algorithm

The data set is presented as an aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles.



## The data

We are provided with a compressed `.bz2` json file containing one row per quote. 
The `.json` has the following fields:

 - `quoteID`: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
 - `quotation`: Text of the longest encountered original form of the quotation
 - `date`: Earliest occurrence date of any version of the quotation
 - `phase`: Corresponding phase of the data in which the quotation first occurred (A-E)
 - `probas`: Array representing the probabilities of each speaker having uttered the quotation.
      The probabilities across different occurrences of the same quotation are summed for
      each distinct candidate speaker and then normalized
      - `proba`: Probability for a given speaker
      - `speaker`: Most frequent surface form for a given speaker in the articles where the quotation occurred
 - `speaker`: Selected most likely speaker. This matches the the first speaker entry in `probas`
 - `qids`: Wikidata IDs of all aliases that match the selected speaker
 - `numOccurrences`: Number of time this quotation occurs in the articles
 - `urls`: List of links to the original articles containing the quotation 

### Imports 

In [2]:
# Imports we may need
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import ujson as json
import bz2

### Load the quotes files

Here I individualy loaded the data of each year by chunks because we can't fully load the entire set, is just to much memory allocation, so my plan is a "Divide and Conquer" method, where we read the first 10000 quotes, work on them, save the results and then we go get 10000 more quotes. Onether thing, I don't know if we are suposed to download the files because they are a bit large, but I dicided to download them just to give them a try


In [3]:
#read YYYY quotes file

chunks_2008 = pd.read_json("data/quotes-2008.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2009 = pd.read_json("data/quotes-2009.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2010 = pd.read_json("data/quotes-2010.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2011 = pd.read_json("data/quotes-2011.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2012 = pd.read_json("data/quotes-2012.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2013 = pd.read_json("data/quotes-2013.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2014 = pd.read_json("data/quotes-2014.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2015 = pd.read_json("data/quotes-2015.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2016 = pd.read_json("data/quotes-2016.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2017 = pd.read_json("data/quotes-2017.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2018 = pd.read_json("data/quotes-2018.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2019 = pd.read_json("data/quotes-2019.json.bz2",lines=True, chunksize = 10000, compression = "bz2")
chunks_2020 = pd.read_json("data/quotes-2020.json.bz2",lines=True, chunksize = 10000, compression = "bz2")

chunks_YYYY = [chunks_2008, chunks_2009, chunks_2010, chunks_2011, chunks_2012, chunks_2013, chunks_2014, chunks_2015, chunks_2016, chunks_2017, chunks_2018, chunks_2019, chunks_2020]

years_list= ["2008","2009","2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018","2019", "2020"]

### Functions to do some tests on the dataset

First I was curious to find how many quotes each file had, so I implemented a function to count the number of chunks with 10000 quotes each

In [75]:
#counts the number of partitions with 10000 quotes of each year 
count = 0
i = 0
for year in years_list:
    for chunks in pd.read_json("data/quotes-"+ year +".json.bz2",lines=True, chunksize = 10000, compression = "bz2"):
        count[i] = count + 1
    print("Year: ", year, "with", count[i], "partitions of 10,000 quotes")
    i = i + 1

TypeError: 'int' object does not support item assignment

Then I was curious to know, how many quotes that month had in a certain year, and at first I thought it was a good way to divide and analyse the quotes by month, but after running the function and reading again what the "date" was, I realized that propably wasn't very useful ideia because the date that is shown in the "date" column is the last date that the quote was mention (i.e if the same quote was mentioned in April and December, the date will be the December one), this makes that we only have quotes in the last months, so the data is not evenly distributed by months as I first expected.

In [24]:
def Quotes_Year_month(year, month):
    chunk_month = []
    chunks_Year = pd.read_json("data/quotes-"+ year +".json.bz2",lines=True, chunksize = 10000, compression = "bz2")
    for chunk in chunks_Year:
        chunk_month.append(chunk[pd.DatetimeIndex(chunk['date']).month == month])
    return chunk_month

In [26]:
chunk_month = Quotes_Year_month("2008", 9)

In [31]:
chunk_month_DataFrame = pd.concat(chunk_month)
chunk_month_DataFrame

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
1,2008-09-15-000037,"13th watered heavily up to the ridge, then roc...",gerry byrne,"[Q3104325, Q959780]",2008-09-15 16:29:27,1,"[[gerry byrne, 0.4065], [howard clark, 0.3118]...",[http://belfasttelegraph.co.uk/sport/golf/amer...,A
3,2008-09-16-000159,80 per cent towards the basis of an agreement.,mario cossio,[Q9029123],2008-09-16 01:12:16,2,"[[mario cossio, 0.4073], [evo morales, 0.247],...",[http://channelnewsasia.com/stories/afp_world/...,A
4,2008-09-03-000269,"a challenge for the u.s. and the world,",,[],2008-09-03 19:10:55,2,"[[None, 0.5239], [mikheil saakashvili, 0.4761]]",[http://warisboring.com/?p=1339],A
5,2008-09-22-000262,a child's head is the heaviest part of his bod...,michael turner,"[Q1372347, Q1372443, Q15969753, Q1929622, Q208...",2008-09-22 14:01:20,1,"[[michael turner, 0.5583], [None, 0.4417]]",[http://cnn.com/2008/HEALTH/family/09/22/accid...,A
8,2008-09-25-000330,a core taking over our education system?,thomas muthee,[Q7792644],2008-09-25 20:39:19,1,"[[thomas muthee, 0.5289], [None, 0.4711]]",[http://woai.com/content/blogs/wiccan/story.as...,A
...,...,...,...,...,...,...,...,...,...
4641312,2008-09-16-071984,you never want to go to somebody else's place ...,jim scherr,[Q15820748],2008-09-16 05:17:56,1,"[[jim scherr, 0.431], [peter ueberroth, 0.2946...",[http://gazette.com/sports/usoc_40594___articl...,A
4641317,2008-09-26-069663,you shouldn't look at [ your balance ] every d...,carmen wong ulrich,[Q5043616],2008-09-26 22:04:25,1,"[[carmen wong ulrich, 0.5583], [None, 0.4417]]",[http://usnews.com/blogs/new-money/2008/9/26/o...,A
4641320,2008-09-30-070962,you will be an accountant for the rest of your...,you long,"[Q45476726, Q45678948]",2008-09-30 00:15:04,1,"[[you long, 0.8172], [None, 0.1828]]",[http://messagefromthemuse.typepad.com/message...,A
4641321,2008-09-16-072287,you'd like to think if you're going to make a ...,mike nolan,"[Q3249544, Q6848203]",2008-09-16 14:44:20,1,"[[mike nolan, 0.2222], [shawntae spencer, 0.20...",[http://pressdemocrat.com/article/20080915/spo...,A


function to get chunks after a certain chunk 

In [87]:
def get_chunk_and_transform(year, chunk_number, chunksize, f):
    iterator = pd.read_json("data/quotes-"+ year +".json.bz2",lines=True, chunksize = chunksize, compression = "bz2")
    for j in range(chunk_number):
        iterator.__next__()
    chunk_partition = iterator.__next__()
    specific_chunk = f(chunk_partition)
    return specific_chunk

In [88]:
chunks_2008

<pandas.io.json._json.JsonReader at 0x7f798ca807f0>

In [92]:
Obama = get_chunk_and_transform("2008", 0, 5, lambda x: x["speaker"])

Obama

45


Then I wanted to do a histogram with the occurencies of the main speakers of a certain year (i tried for the year of 2008), and since that is so many quotes I decided to do the histogram based on a sample of 10% of the data set, and in order to have a cleaner visualization of the main speakers, I only ploted the bars of speakers with more than 500 quotes 

In [77]:
#count occurencies of speakers
init = True
i = 0
chunk_sample_list = []
for chunk in chunks_2008:
    chunk_sample = chunk.sample(frac = 0.1)
    chunk_sample_list.append(chunk_sample["speaker"])
    i = i+ 1
    print(i)
    
speakers_counts = pd.concat(chunk_sample_list)
speakers_counts = speakers_counts.value_counts()
speakers = speakers_counts[speakers_counts > 500]
speakers = pd.DataFrame(speakers)
ax = sns.barplot(x="speaker", y=speakers.index, data=speakers)
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


KeyboardInterrupt: 