# Map dates

### Motivation

The idea explained in the readme needs for each quote the corresponding dates. Since the dates are not present in the standard Quotes dataset (we have just the date of the first occurrence for each quote) we will use the Quotebank article centric dataset to recover all the dates corresponding to the filtered quotes. This is the aim of the following notebook.

The procedure will consist in the following steps:
- We filter the Quotebank (article centric) dataset keeping rows of the chunks corresponding to the articles which contained a quote containing the words Trump or Clinton.
- We then read the file created in the notebook filter_quotes containing all the quotes (from the quotes centric data) containing the word Trump. We call this dataframe df_Trump for the sake of simplicity (already done in the Filter quotes notebook)
- We look for the quoteID corresponding to the above mentioned quote in the filtered Quotebank (article centric) dataset and we extract the date of the articles in which the quoteID appears.
- We create a new row in the dataframe df_Trump containing a list of dates in which the quotation appears.
- We will deal with empty list of dates.

In [1]:
import pandas as pd

### Initial Mapping

This mapping allows to store the article dates information in a very compact way, which in turn allows for the merge of quotes to their respective dates.

In [2]:
def filter_dataframe(year):
# This function will be used to filter the dataset selecting the quotes of a given year containing a given word
# in a free text search fashion.

    list_df = []
    with pd.read_json("../data/quotebank-"+year+".json.bz2", lines=True,  chunksize = 10000, compression = 'bz2') as df_reader: #we read the 
        #Quotebank article centric dataset
        import concurrent.futures
        executor = concurrent.futures.ThreadPoolExecutor(30)
        for chunk in df_reader: #we read chunk by chunk in order not to store everything in memory
            executor.submit(filter_chunk, list_df, chunk)
        executor.shutdown()
            
    df = pd.concat(list_df) #we concatenate the dataframes together to obtain a unique one
    return df

In [3]:
def filtering_func(lst):
    """ we keep just the articles contaning the word Trump or Clinton"""
    final_lst = []
    for el in lst:
        if (("Trump" in el["quotation"]) or ("Clinton" in el["quotation"])):
            final_lst.append(el["quoteID"])
    return final_lst

def filter_chunk(list_df, chunk):
    chunk.drop(columns=["articleID", "phase", "title", "url", "articleLength", "names"], inplace=True) # we drop to save space in memory
    chunk.quotations = chunk.quotations.transform(lambda x: filtering_func(x))
    chunk = chunk[chunk["quotations"].str.len() > 0]
    list_df.append(chunk)  #we append it to the list of dataframes

Get the quoteID_date mapping for each year, and save the result

In [None]:
for i in range(2015, 2021):
    quoteID_date = filter_dataframe(str(i))
    quoteID_date.to_pickle('{}_quoteID_date.pkl'.format(i))

In [None]:
quoteID_cleaned = {str(2000 + i) : pd.read_pickle('../data/' + str(2000 + i) + '_quoteID_date.pkl') for i in range(15,21)}
quoteID_cleaned

### Add dates to Trump and Clinton datasets

Update quotes from Trump and Clinton to convey all dates from all occurrences, and not just the first date.

In [4]:
clinton_cleaned = pd.read_pickle('../data/df_Clinton_cleaned.pkl')
trump_cleaned = pd.read_pickle('../data/df_Trump_cleaned.pkl')

clinton_cleaned
trump_cleaned

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
26,2015-10-25-000242,"' It is not now, nor has it ever been, the gol...",Bernie Sanders,[Q359442],2015-10-25 14:12:35,1,"[[Bernie Sanders, 0.5395], [None, 0.3128], [Hi...",[http://examiner.com/article/bernie-sanders-sl...,E
215,2015-07-30-032957,"I am supporting Hillary Clinton for president,",,[],2015-07-30 04:08:03,1,"[[None, 0.5004], [Miro Weinberger, 0.4696], [J...",[http://www.sevendaysvt.com/OffMessage/archive...,E
888,2015-09-16-006359,And I'm just pointing out the absurd on both s...,Kathleen Madigan,[Q6376814],2015-09-16 05:44:37,1,"[[Kathleen Madigan, 0.8025], [None, 0.1975]]",[http://northjersey.com/arts-and-entertainment...,E
2324,2015-08-15-008350,Clinton's College Hypocrisy Tour Rolls On,,[],2015-08-15 23:24:29,1,"[[None, 0.8041], [Marco Rubio, 0.1016], [Hilla...",[http://miamiherald.typepad.com/nakedpolitics/...,E
2809,2015-08-25-044501,If Vice President Joe Biden decides to jump in...,,[],2015-08-25 12:05:31,2,"[[None, 0.6475], [Joe Biden, 0.29], [Ryan Grim...",[http://feeds.huffingtonpost.com/c/35496/f/677...,E
...,...,...,...,...,...,...,...,...,...
5232184,2020-02-08-016054,"I had done a lot of presidential debates, but ...",Chris Wallace,"[Q21256789, Q2964884, Q2964885, Q37624906, Q51...",2020-02-08 04:40:01,1,"[[Chris Wallace, 0.9267], [None, 0.0702], [Hil...",[https://www.thewrap.com/fox-news-chris-wallac...,E
5235860,2020-01-06-061256,The main difference between Lindsey and his De...,David Woodard,[Q1177254],2020-01-06 12:00:30,6,"[[David Woodard, 0.7544], [None, 0.1797], [Lin...",[http://chron.com/entertainment/article/How-Li...,E
5235869,2020-04-09-052373,The model of Obama asking Bush and Clinton to ...,Bill Haslam,[Q862186],2020-04-09 23:04:21,1,"[[Bill Haslam, 0.905], [None, 0.0837], [Barack...",[http://www.nytimes.com/2020/04/09/us/politics...,E
5237423,2020-01-22-037250,I think the fact is that Mitch keeps telling y...,Patrick Leahy,[Q59315],2020-01-22 02:53:15,10,"[[Patrick Leahy, 0.8213], [None, 0.148], [Roge...",[http://wicz.com/story/41593610/inside-the-rep...,E


In [5]:
def filter_chunk_by_number_of_qids(initial_chunk, qid_number):
    """we select the quotes whose speaker has a single qid"""
    return initial_chunk[initial_chunk["qids"].str.len() == qid_number]

In [7]:
clinton_cleaned= filter_chunk_by_number_of_qids(clinton_cleaned, 1)
trump_cleaned= filter_chunk_by_number_of_qids(trump_cleaned, 1)

clinton_cleaned
trump_cleaned

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
26,2015-10-25-000242,"' It is not now, nor has it ever been, the gol...",Bernie Sanders,[Q359442],2015-10-25 14:12:35,1,"[[Bernie Sanders, 0.5395], [None, 0.3128], [Hi...",[http://examiner.com/article/bernie-sanders-sl...,E
888,2015-09-16-006359,And I'm just pointing out the absurd on both s...,Kathleen Madigan,[Q6376814],2015-09-16 05:44:37,1,"[[Kathleen Madigan, 0.8025], [None, 0.1975]]",[http://northjersey.com/arts-and-entertainment...,E
6930,2015-10-22-051493,If [ Democratic presidential candidate former ...,Marco Rubio,[Q324546],2015-10-22 20:04:16,1,"[[Marco Rubio, 0.93], [None, 0.0685], [Hillary...",[http://breitbart.com/video/2015/10/22/rubio-i...,E
7374,2015-12-31-032035,I'm electable. I was elected in a purple state...,Jeb Bush,[Q221997],2015-12-31 03:29:00,7,"[[Jeb Bush, 0.8392], [None, 0.0925], [Hillary ...",[http://www.postandcourier.com/article/2015123...,E
9855,2015-11-12-104266,The incentive to invent episodes of discrimina...,Glenn Reynolds,[Q4392664],2015-11-12 21:40:00,2,"[[Glenn Reynolds, 0.3454], [Ed Driscoll, 0.322...","[http://pjmedia.com/instapundit/218734/, http:...",E
...,...,...,...,...,...,...,...,...,...
5222109,2020-03-06-025712,"I think that would have been a mistake, becaus...",Jennifer Palmieri,[Q18209402],2020-03-06 14:38:07,1,"[[Jennifer Palmieri, 0.9117], [None, 0.056], [...",[https://www.rollingstone.com/politics/politic...,E
5231803,2020-01-18-006266,"Chief Justice Rehnquist, when he presided over...",Dick Durbin,[Q434804],2020-01-18 05:01:08,1,"[[Dick Durbin, 0.898], [None, 0.079], [Charlie...",[https://www.washingtonexaminer.com/news/impea...,E
5235860,2020-01-06-061256,The main difference between Lindsey and his De...,David Woodard,[Q1177254],2020-01-06 12:00:30,6,"[[David Woodard, 0.7544], [None, 0.1797], [Lin...",[http://chron.com/entertainment/article/How-Li...,E
5235869,2020-04-09-052373,The model of Obama asking Bush and Clinton to ...,Bill Haslam,[Q862186],2020-04-09 23:04:21,1,"[[Bill Haslam, 0.905], [None, 0.0837], [Barack...",[http://www.nytimes.com/2020/04/09/us/politics...,E


In [8]:
def years_to_try(year):
    year_lst = [str(year)]
    for i in range(2020 - year):
        year_lst.append(str(year + i + 1))
    return year_lst

In [5]:
def update_quotes(quotes, quoteID_cleaned):
    new_quotes = quotes.copy()
    new_quotes = new_quotes.drop(columns=["date"])
    new_quotes["date"] = "no_date"
    count = 0
    for index, row in quotes.iterrows():
        if row.numOccurrences > 1:
            quoteID = row.quoteID
            earliest_date = row.date
            occurrences_lst = []
            for year in years_to_try(earliest_date.year):
                dataframe = quoteID_cleaned[year]
                occurrences = dataframe[dataframe["quotations"].transform(lambda x: quoteID in x)].date.to_list() #we look for the quoteID of the quote
                # in the quoteID cleaned and we add the date to the list
                occurrences_lst += occurrences
                if len(occurrences_lst) == row.numOccurrences:
                    break
            new_quotes.at[index, "date"] = occurrences_lst
        else:
            new_quotes.at[index, "date"] = [row.date] #if in the quotes data there is a single occurrence then we already have the date in the date column
        count += 1
        if (count % 100 == 0):
            print(count)
    return new_quotes

Get updated quotes with dates for Clinton and Trump.

In [None]:
merged_quotes_clinton = update_quotes(clinton_cleaned, quoteID_cleaned)
merged_quotes_trump = update_quotes(trump_cleaned, quoteID_cleaned)


merged_quotes_clinton
merged_quotes_trump

Unfortunately we cannot display the result of the cell above since we parallelized the work (one of us run the cell for Trump and the other for Clinton: the cell was taking a lot of time to be execute)

In [12]:
merged_quotes_clinton.to_pickle('../data/Clinton_with_dates.pkl')
merged_quotes_clinton.to_pickle('../data/Trump_with_dates.pkl')

### Dealing with missing values

We notice some missing dates after we clean and merge every files. This happens because certain quotes are missing in the article files as explained in the notebook problems to report. In order to deal with this missing values, we will fill them with one date (the date contained in quoteID). We decided to fill it with a single date even if the number of the occurrences was bigger than one because normally the quote appears more times on different articles the same day (at most a few days after in most of the cases). Of course this is an approximation but we thought it was better to fill the missing values instead of dropping the corresponding rows.

In [60]:
def grab_date_from_quoteID(author_name, df):
    """this function is used to insert the date (taken from the quoteID when the above created list of Dates is empty"""
    df.loc[df["date"].str.len() == 0, "date"] = pd.to_datetime(df[(df["date"].str.len() == 0)]["quoteID"].str.slice(stop=10)).apply(lambda x: [x])
    return df

In [61]:
Clinton_with_dates = grab_date_from_quoteID("Clinton", Clinton_merged)
Trump_with_dates = grab_date_from_quoteID("Trump", Trump_merged)

In [62]:
Clinton_with_dates

Unnamed: 0,quoteID,quotation,speaker,qids,numOccurrences,probas,urls,phase,date
26,2015-10-25-000242,"' It is not now, nor has it ever been, the gol...",Bernie Sanders,[Q359442],1,"[[Bernie Sanders, 0.5395], [None, 0.3128], [Hi...",[http://examiner.com/article/bernie-sanders-sl...,E,[2015-10-25 14:12:35]
888,2015-09-16-006359,And I'm just pointing out the absurd on both s...,Kathleen Madigan,[Q6376814],1,"[[Kathleen Madigan, 0.8025], [None, 0.1975]]",[http://northjersey.com/arts-and-entertainment...,E,[2015-09-16 05:44:37]
6930,2015-10-22-051493,If [ Democratic presidential candidate former ...,Marco Rubio,[Q324546],1,"[[Marco Rubio, 0.93], [None, 0.0685], [Hillary...",[http://breitbart.com/video/2015/10/22/rubio-i...,E,[2015-10-22 20:04:16]
7374,2015-12-31-032035,I'm electable. I was elected in a purple state...,Jeb Bush,[Q221997],7,"[[Jeb Bush, 0.8392], [None, 0.0925], [Hillary ...",[http://www.postandcourier.com/article/2015123...,E,[2015-12-31 03:29:00]
9855,2015-11-12-104266,The incentive to invent episodes of discrimina...,Glenn Reynolds,[Q4392664],2,"[[Glenn Reynolds, 0.3454], [Ed Driscoll, 0.322...","[http://pjmedia.com/instapundit/218734/, http:...",E,[2015-11-12 00:00:00]
...,...,...,...,...,...,...,...,...,...
5222109,2020-03-06-025712,"I think that would have been a mistake, becaus...",Jennifer Palmieri,[Q18209402],1,"[[Jennifer Palmieri, 0.9117], [None, 0.056], [...",[https://www.rollingstone.com/politics/politic...,E,[2020-03-06 14:38:07]
5231803,2020-01-18-006266,"Chief Justice Rehnquist, when he presided over...",Dick Durbin,[Q434804],1,"[[Dick Durbin, 0.898], [None, 0.079], [Charlie...",[https://www.washingtonexaminer.com/news/impea...,E,[2020-01-18 05:01:08]
5235860,2020-01-06-061256,The main difference between Lindsey and his De...,David Woodard,[Q1177254],6,"[[David Woodard, 0.7544], [None, 0.1797], [Lin...",[http://chron.com/entertainment/article/How-Li...,E,[2020-01-06 00:00:00]
5235869,2020-04-09-052373,The model of Obama asking Bush and Clinton to ...,Bill Haslam,[Q862186],1,"[[Bill Haslam, 0.905], [None, 0.0837], [Barack...",[http://www.nytimes.com/2020/04/09/us/politics...,E,[2020-04-09 23:04:21]


Save dataframes of Clinton and Trump

In [64]:
Clinton_with_dates.to_pickle('../data/Clinton_with_dates.pkl')
Trump_with_dates.to_pickle('../data/Trump_with_dates.pkl')