# Map dates

### Motivation

The idea explained in the readme needs for each quote the corresponding dates. Since the dates are not present in the standard Quotes dataset (we have just the date of the first occurrence for each quote) we will use the Quotebank article centric dataset to recover all the dates corresponding to the filtered quotes. This is the aim of the following notebook.

The procedure will consist in the following steps:
- We filter the Quotebank (article centric) dataset keeping rows of the chunks corresponding to the articles which contained a quote containing the words Trump or Clinton.
- We then read the file created in the notebook filter_quotes containing all the quotes (from the quotes centric data) containing the word Trump. We call this dataframe df_Trump for the sake of simplicity (already done in the Filter quotes notebook)
- We look for the quoteID corresponding to the above mentioned quote in the filtered Quotebank (article centric) dataset and we extract the date of the articles in which the quoteID appears.
- We create a new row in the dataframe df_Trump containing a list of dates in which the quotation appears.
- We will deal with empty list of dates.

In [17]:
import pandas as pd

### Initial Mapping

This mapping allows to store the article dates information in a very compact way, which in turn allows for the merge of quotes to their respective dates.

In [18]:
def filter_dataframe(year):
# This function will be used to filter the dataset selecting the quotes of a given year containing a given word
# in a free text search fashion.

    list_df = []
    with pd.read_json("../data/quotebank-"+year+".json.bz2", lines=True,  chunksize = 10000, compression = 'bz2') as df_reader: #we read the 
        #Quotebank article centric dataset
        import concurrent.futures
        executor = concurrent.futures.ThreadPoolExecutor(30)
        for chunk in df_reader: #we read chunk by chunk in order not to store everything in memory
            executor.submit(filter_chunk, list_df, chunk)
        executor.shutdown()
            
    df = pd.concat(list_df) #we concatenate the dataframes together to obtain a unique one
    return df

In [19]:
def filtering_func(lst):
    """ we keep just the articles contaning the word Trump or Clinton"""
    final_lst = []
    for el in lst:
        if (("Trump" in el["quotation"]) or ("Clinton" in el["quotation"])):
            final_lst.append(el["quoteID"])
    return final_lst

def filter_chunk(list_df, chunk):
    chunk.drop(columns=["articleID", "phase", "title", "url", "articleLength", "names"], inplace=True) # we drop to save space in memory
    chunk.quotations = chunk.quotations.transform(lambda x: filtering_func(x))
    chunk = chunk[chunk["quotations"].str.len() > 0]
    list_df.append(chunk)  #we append it to the list of dataframes

Get the quoteID_date mapping for each year, and save the result

In [None]:
for i in range(2015, 2021):
    quoteID_date = filter_dataframe(str(i))
    quoteID_date.to_pickle('{}_quoteID_date.pkl'.format(i))

In [30]:
quoteID_cleaned = {str(2000 + i) : pd.read_pickle('../data/' + str(2000 + i) + '_quoteID_date.pkl') for i in range(15,21)}
quoteID_cleaned

{'2015':                                                  quotations  \
 147       [2015-08-05-025201, 2015-08-05-055692, 2015-08...   
 185                  [2015-08-06-081642, 2015-08-06-061444]   
 243                  [2015-08-07-047067, 2015-08-07-046176]   
 257                                     [2015-08-07-100597]   
 667       [2015-08-19-099364, 2015-08-19-121637, 2015-08...   
 ...                                                     ...   
 10144325                                [2015-05-12-050394]   
 10144709                                [2015-07-15-021355]   
 10144886                                [2015-07-23-075404]   
 10144944                                [2015-07-27-056746]   
 10144971                                [2015-07-28-102216]   
 
                         date  
 147      2015-08-05 22:33:14  
 185      2015-08-06 17:46:00  
 243      2015-08-07 17:20:35  
 257      2015-08-07 20:46:40  
 667      2015-08-19 09:49:00  
 ...                      ... 

### Add dates to Trump and Clinton datasets

Update quotes from Trump and Clinton to convey all dates from all occurrences, and not just the first date.

In [31]:
# clinton_cleaned = pd.read_pickle('../data/df_Clinton_cleaned.pkl')
trump_cleaned = pd.read_pickle('../data/df_Trump_cleaned.pkl')

# clinton_cleaned
trump_cleaned

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
12,2018-07-16-000103,[ Ensuring ] the orchestrating and timing of M...,Corey Lewandowski,[Q20740735],2018-07-16 14:05:34,2,"[[Corey Lewandowski, 0.7179], [None, 0.2754], ...",[http://www.theweek.co.uk/95082/donald-trump-s...,E
66,2018-05-09-001003,300-plus years of them cold shoulders... Obama...,Charlamagne Tha God,[Q16203002],2018-05-09 11:00:00,1,"[[Charlamagne Tha God, 0.4806], [None, 0.2924]...",[https://www.portlandmercury.com/music/2018/05...,E
212,2018-08-02-002115,A politically connected contractor made a $500...,,[],2018-08-02 06:24:59,1,"[[None, 0.7942], [President Trump, 0.1406], [K...",[https://sunlightfoundation.com/2018/08/02/tod...,E
264,2018-08-19-001084,A Wider Danger: Trump's Troubling Attacks on J...,,[],2018-08-19 02:05:39,3,"[[None, 0.9052], [Queen Elizabeth II, 0.0948]]","[http://www.vnews.com/Forum-Aug-19-19549635, h...",E
333,2018-09-26-004202,"After the commercial, she says, `They told me ...",,[],2018-09-26 00:00:00,2,"[[None, 0.4595], [Tom Arnold, 0.3248], [Julie ...",[http://feeds.foxnews.com/~r/foxnews/entertain...,E
...,...,...,...,...,...,...,...,...,...
5243994,2020-02-05-103219,Trump offends and disrespects the Venezuelan p...,Jorge Arreaza,[Q6623799],2020-02-05 00:00:00,11,"[[Jorge Arreaza, 0.9164], [None, 0.0726], [Pre...",[https://www.rawstory.com/2020/02/imwithfred-t...,E
5243995,2020-02-05-103235,"Trump survived, but he is the most unpopular p...",,[],2020-02-05 23:11:42,3,"[[None, 0.8786], [Donald Trump, 0.1214]]",[https://www.wellsvilledaily.com/zz/news/20200...,E
5243996,2020-03-13-071475,"Trump tried to mitigate the issue, saying it i...",Hassan Nasrallah,[Q181182],2020-03-13 22:15:06,1,"[[Hassan Nasrallah, 0.922], [None, 0.0741], [P...",[http://israelnationalnews.com/News/News.aspx/...,E
5243997,2020-03-15-037086,Trump's do-over approach -- he unlocked $50 bi...,Newt Gingrich,[Q182788],2020-03-15 00:00:00,40,"[[Newt Gingrich, 0.5146], [None, 0.3958], [Don...",[http://uspolitics.einnews.com/article/5120893...,E


In [32]:
def filter_chunk_by_number_of_qids(initial_chunk, qid_number):
    """we select the quotes whose speaker has a single qid"""
    return initial_chunk[initial_chunk["qids"].str.len() == qid_number]

In [33]:
# clinton_cleaned= filter_chunk_by_number_of_qids(clinton_cleaned, 1)
trump_cleaned= filter_chunk_by_number_of_qids(trump_cleaned, 1)

# clinton_cleaned
trump_cleaned

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
12,2018-07-16-000103,[ Ensuring ] the orchestrating and timing of M...,Corey Lewandowski,[Q20740735],2018-07-16 14:05:34,2,"[[Corey Lewandowski, 0.7179], [None, 0.2754], ...",[http://www.theweek.co.uk/95082/donald-trump-s...,E
66,2018-05-09-001003,300-plus years of them cold shoulders... Obama...,Charlamagne Tha God,[Q16203002],2018-05-09 11:00:00,1,"[[Charlamagne Tha God, 0.4806], [None, 0.2924]...",[https://www.portlandmercury.com/music/2018/05...,E
366,2018-01-07-002036,All I can say is it's not a hoax. The Russians...,Lindsey Graham,[Q22212],2018-01-07 15:40:00,1,"[[Lindsey Graham, 0.5251], [None, 0.2936], [Ch...",[http://postandcourier.com/politics/would-lind...,E
1024,2018-01-16-011608,being too nonchalant about Mr. Trump's rants.,Floyd Abrams,[Q3365171],2018-01-16 20:06:13,3,"[[Floyd Abrams, 0.7512], [None, 0.2421], [Hill...",[http://www.washingtontimes.com/news/2018/jan/...,E
1091,2018-09-24-011280,Brett Kavanaugh is poised to join Neil Gorsuch...,Raul Labrador,[Q555393],2018-09-24 23:21:10,1,"[[Raul Labrador, 0.6471], [None, 0.2436], [Pre...",[http://www.spokesman.com/stories/2018/sep/25/...,E
...,...,...,...,...,...,...,...,...,...
5243971,2020-04-05-029136,To say that I'm infuriated with the recent act...,Dwight Ball,[Q5318112],2020-04-05 23:11:52,1,"[[Dwight Ball, 0.6293], [None, 0.3336], [Justi...",[https://www.cbc.ca/news/politics/trudeau-will...,E
5243994,2020-02-05-103219,Trump offends and disrespects the Venezuelan p...,Jorge Arreaza,[Q6623799],2020-02-05 00:00:00,11,"[[Jorge Arreaza, 0.9164], [None, 0.0726], [Pre...",[https://www.rawstory.com/2020/02/imwithfred-t...,E
5243996,2020-03-13-071475,"Trump tried to mitigate the issue, saying it i...",Hassan Nasrallah,[Q181182],2020-03-13 22:15:06,1,"[[Hassan Nasrallah, 0.922], [None, 0.0741], [P...",[http://israelnationalnews.com/News/News.aspx/...,E
5243997,2020-03-15-037086,Trump's do-over approach -- he unlocked $50 bi...,Newt Gingrich,[Q182788],2020-03-15 00:00:00,40,"[[Newt Gingrich, 0.5146], [None, 0.3958], [Don...",[http://uspolitics.einnews.com/article/5120893...,E


In [45]:
def years_to_try(year):
    """ outputs a list with years after the year received in the parameters """
    year_lst = [str(year)]
    for i in range(2020 - year):
        year_lst.append(str(year + i + 1))
    return year_lst

In [74]:
def refactor_row(row):
    """ refactor row to have all the dates found for every quote occurrence, instead of just the earliest date"""
    if row.numOccurrences > 1:
        quoteID = row.quoteID
        earliest_date = row.date
        occurrences_lst = []
        for year in years_to_try(earliest_date.year):
            dataframe = quoteID_cleaned[year]
            occurrences = dataframe[dataframe["quotations"].transform(lambda x: quoteID in x)].date.to_list() #we look for the quoteID of the quote
            # in the quoteID cleaned and we add the date to the list
            occurrences_lst += occurrences
            if len(occurrences_lst) == row.numOccurrences:
                break
        return occurrences_lst
    else:
        return [row.date] #if in the quotes data there is a single occurrence then we already have the date in the date column
        

In [78]:
def update_quotes(quotes, quoteID_cleaned):
    """ return quotes with all the dates found for every occurrence, instead of having just the earliest date"""
    new_quotes = quotes.copy()
    new_quotes["date"] = new_quotes.apply(lambda x: refactor_row(x), axis=1)
    return new_quotes

Get updated quotes with dates for Clinton and Trump.

In [84]:
# merged_quotes_clinton = update_quotes(clinton_cleaned, quoteID_cleaned)

merged_quotes_trump = update_quotes(trump_cleaned[:100], quoteID_cleaned)

# merged_quotes_clinton
merged_quotes_trump

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
12,2018-07-16-000103,[ Ensuring ] the orchestrating and timing of M...,Corey Lewandowski,[Q20740735],"[2018-07-17 06:00:00, 2018-07-16 14:05:34]",2,"[[Corey Lewandowski, 0.7179], [None, 0.2754], ...",[http://www.theweek.co.uk/95082/donald-trump-s...,E
66,2018-05-09-001003,300-plus years of them cold shoulders... Obama...,Charlamagne Tha God,[Q16203002],[2018-05-09 11:00:00],1,"[[Charlamagne Tha God, 0.4806], [None, 0.2924]...",[https://www.portlandmercury.com/music/2018/05...,E
366,2018-01-07-002036,All I can say is it's not a hoax. The Russians...,Lindsey Graham,[Q22212],[2018-01-07 15:40:00],1,"[[Lindsey Graham, 0.5251], [None, 0.2936], [Ch...",[http://postandcourier.com/politics/would-lind...,E
1024,2018-01-16-011608,being too nonchalant about Mr. Trump's rants.,Floyd Abrams,[Q3365171],"[2018-01-16 20:06:13, 2018-01-17 05:01:43]",3,"[[Floyd Abrams, 0.7512], [None, 0.2421], [Hill...",[http://www.washingtontimes.com/news/2018/jan/...,E
1091,2018-09-24-011280,Brett Kavanaugh is poised to join Neil Gorsuch...,Raul Labrador,[Q555393],[2018-09-24 23:21:10],1,"[[Raul Labrador, 0.6471], [None, 0.2436], [Pre...",[http://www.spokesman.com/stories/2018/sep/25/...,E
...,...,...,...,...,...,...,...,...,...
24124,2018-12-21-099728,"Today, at the direction of President Donald J....",Sonny Perdue,[Q525362],"[2018-12-21 16:06:41, 2018-12-21 17:06:00, 201...",4,"[[Sonny Perdue, 0.9133], [None, 0.0842], [pres...",[http://qz.com/1504507/trumps-snap-food-stamps...,E
24194,2018-11-17-059213,"Trump deserves his choice, just like Obama des...",Megyn Kelly,[Q293260],[2018-11-17 01:12:35],1,"[[Megyn Kelly, 0.7817], [None, 0.1646], [Presi...",[http://dailycaller.com/2018/07/10/megyn-kelly...,E
24198,2018-10-11-130960,"Trump is exploiting Kanye West, who has admitt...",Clay Cane,[Q27063054],[2018-10-11 20:03:38],1,"[[Clay Cane, 0.6559], [None, 0.2107], [Kanye W...",[http://www.theweek.co.uk/97071/kanye-west-at-...,E
24202,2018-12-20-103345,Trump wanted $5 billion for a pointless border...,Seth Meyers,[Q14536],"[2018-12-20 16:30:03, 2018-12-20 15:02:47]",2,"[[Seth Meyers, 0.7962], [None, 0.187], [Sarah ...",[http://www.breitbart.com/politics/2018/12/20/...,E


Unfortunately we cannot display the result of the cell above since we parallelized the work (one of us run the cell for Trump and the other for Clinton: the cell was taking a lot of time to be execute)

In [12]:
merged_quotes_clinton.to_pickle('../data/Clinton_with_dates.pkl')
merged_quotes_clinton.to_pickle('../data/Trump_with_dates.pkl')

### Dealing with missing values

We notice some missing dates after we clean and merge every files. This happens because certain quotes are missing in the article files as explained in the notebook problems to report. In order to deal with this missing values, we will fill them with one date (the date contained in quoteID). We decided to fill it with a single date even if the number of the occurrences was bigger than one because normally the quote appears more times on different articles the same day (at most a few days after in most of the cases). Of course this is an approximation but we thought it was better to fill the missing values instead of dropping the corresponding rows.

In [60]:
def grab_date_from_quoteID(author_name, df):
    """this function is used to insert the date (taken from the quoteID when the above created list of Dates is empty"""
    df.loc[df["date"].str.len() == 0, "date"] = pd.to_datetime(df[(df["date"].str.len() == 0)]["quoteID"].str.slice(stop=10)).apply(lambda x: [x])
    return df

In [61]:
Clinton_with_dates = grab_date_from_quoteID("Clinton", Clinton_merged)
Trump_with_dates = grab_date_from_quoteID("Trump", Trump_merged)

In [62]:
Clinton_with_dates

Unnamed: 0,quoteID,quotation,speaker,qids,numOccurrences,probas,urls,phase,date
26,2015-10-25-000242,"' It is not now, nor has it ever been, the gol...",Bernie Sanders,[Q359442],1,"[[Bernie Sanders, 0.5395], [None, 0.3128], [Hi...",[http://examiner.com/article/bernie-sanders-sl...,E,[2015-10-25 14:12:35]
888,2015-09-16-006359,And I'm just pointing out the absurd on both s...,Kathleen Madigan,[Q6376814],1,"[[Kathleen Madigan, 0.8025], [None, 0.1975]]",[http://northjersey.com/arts-and-entertainment...,E,[2015-09-16 05:44:37]
6930,2015-10-22-051493,If [ Democratic presidential candidate former ...,Marco Rubio,[Q324546],1,"[[Marco Rubio, 0.93], [None, 0.0685], [Hillary...",[http://breitbart.com/video/2015/10/22/rubio-i...,E,[2015-10-22 20:04:16]
7374,2015-12-31-032035,I'm electable. I was elected in a purple state...,Jeb Bush,[Q221997],7,"[[Jeb Bush, 0.8392], [None, 0.0925], [Hillary ...",[http://www.postandcourier.com/article/2015123...,E,[2015-12-31 03:29:00]
9855,2015-11-12-104266,The incentive to invent episodes of discrimina...,Glenn Reynolds,[Q4392664],2,"[[Glenn Reynolds, 0.3454], [Ed Driscoll, 0.322...","[http://pjmedia.com/instapundit/218734/, http:...",E,[2015-11-12 00:00:00]
...,...,...,...,...,...,...,...,...,...
5222109,2020-03-06-025712,"I think that would have been a mistake, becaus...",Jennifer Palmieri,[Q18209402],1,"[[Jennifer Palmieri, 0.9117], [None, 0.056], [...",[https://www.rollingstone.com/politics/politic...,E,[2020-03-06 14:38:07]
5231803,2020-01-18-006266,"Chief Justice Rehnquist, when he presided over...",Dick Durbin,[Q434804],1,"[[Dick Durbin, 0.898], [None, 0.079], [Charlie...",[https://www.washingtonexaminer.com/news/impea...,E,[2020-01-18 05:01:08]
5235860,2020-01-06-061256,The main difference between Lindsey and his De...,David Woodard,[Q1177254],6,"[[David Woodard, 0.7544], [None, 0.1797], [Lin...",[http://chron.com/entertainment/article/How-Li...,E,[2020-01-06 00:00:00]
5235869,2020-04-09-052373,The model of Obama asking Bush and Clinton to ...,Bill Haslam,[Q862186],1,"[[Bill Haslam, 0.905], [None, 0.0837], [Barack...",[http://www.nytimes.com/2020/04/09/us/politics...,E,[2020-04-09 23:04:21]


Save dataframes of Clinton and Trump

In [64]:
Clinton_with_dates.to_pickle('../data/Clinton_with_dates.pkl')
Trump_with_dates.to_pickle('../data/Trump_with_dates.pkl')