# **LOAD DATA - Do people with different ideologies speak differently?**
*ADA Project Milestone P2*

# Mouting the Google Drive

It is possible to mount your Google Drive to Colab if you need additional storage or if you need to use files from it. To do that run (click on play button or use keyboard shortcut 'Command/Ctrl+Enter') the following code cell:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

 

1.   After running the cell, URL will appear.

2.   Following this URL, you will be redirected to the page where you need to choose Google Drive account to mount to.

3.   You will further be asked to give Google Drive Stream a permission to access the chosen Google account

4.   After granting the access, authorization code will be given to you

5.   Copy the authorization code into the dedicated textbox in Colab under '*Enter your authorization code:*' writing

After copying the authorization code, you should get the message saying '*Mounted at /content/gdrive*'

Path to the files from the mounted Drive will then be '/content/drive/MyDrive/'. By opening the Files tab (left sidebar, folder icon) you should also be able to see the accessible files.

# Required libraries

In [None]:
!pip install pandas==1.0.5

In [None]:
# Imports
import bz2
import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

# Load/filter/merge quotebank and wikidata samples
In this section 5 main operations are performed:
- **load and filter wikidata** from a parquet file into a data frame. The filtering is based on our initial goal, so based on polytical parties;
- **load quotebank dataset** from json bz2 file chunk by chunk, technique chosen to address the initial file size;
- **filter quotebank** entries
- **merge quotebank and wikidata** on the QID
- **store merged df for each year** in an additional parquet file (initially a json bz2 file format was chosen, but the memory requirements needed in order to handle them exceeded the available memory from colab)

In [None]:
# Load and filter wikidata for our purpose
def load_filter_wikidata_df(path_to_file):
  # load from file
  columns = ["id", "gender", "occupation", "party"]
  df_wikidata = pd.read_parquet(path_to_file, columns=columns)

  # Filter
  # Remove rows without a party
  df_wikidata_parties = df_wikidata.dropna(subset=['party'])

  # Select only rows with either republican or democrats party
  QID_republicans = "Q29468"
  QID_democrats = "Q29552"
  df_wikidata_filtered = df_wikidata_parties[df_wikidata_parties.apply(lambda x: (QID_republicans in x['party']) or (QID_democrats in x['party']) , axis=1)]

  return df_wikidata_filtered

In [None]:
# Perform all operations needed on quotebank dataset (load/filter/merge with wikidata/store)
def handle_quotebank_df(input_file, chunk_size, df_wikidata, n_chunks=0):
  curr_chunk = 0
  df = pd.DataFrame()

  # read input file by chunks (as the whole file can't fit into memory)
  reader = pd.read_json(input_file, lines=True, compression='bz2', chunksize=chunk_size)
  for chunk in reader:
    #if curr_chunk == n_chunks:
     # break
    curr_chunk += 1
    # append only when the speaker is knows (the best % is not from "None" speaker)
    chunk = chunk[chunk['speaker'] != 'None' ][['quoteID', 'quotation','speaker', 'qids', 'probas']]
    
    # apply filter to single chunk
    chunk = filter_quotebank_df(chunk)

    # merge single chunk with wikidata in order to reduce even further the dataset and allow RAM to store it
    chunk = merge_quotebank_wikidata_df(chunk, df_wikidata)
    
    df = pd.concat([df, chunk], ignore_index=True)

  return df
        

In [None]:
# Filter unuset entries in wikidata
def filter_quotebank_df(df):
  # remove the data with not unique qid speaker because we are not sure who is the speaker: speakers with same name but different qids
  df_filtered = df[df.apply(lambda x: len(x['qids']) == 1, axis=1)]

  # now we don't have anymore list of quids (only 1 quid per entry possible), so remove list and store only the single value
  df_filtered['qids'] = df_filtered['qids'].apply(lambda x: x[0])

  return df_filtered

In [None]:
# Merge quotebank and wikidata entries on QID
def merge_quotebank_wikidata_df(df_quotebank, df_wikidata):
  #merge quotebank data with wikidata 
  df_merged = df_quotebank.merge(right=df_wikidata, how='inner', left_on='qids', right_on='id')

  #drop the id column because we already have the qid
  df_merged = df_merged.drop(labels='id', axis=1)
  
  return df_merged

In [None]:
# Print to output file in json compressed format
def store_df(path_to_file, df):
  # Dump the single chunk to csv, appending it to previously written chunks
  df.to_parquet(output_file)

Actual data cleaning and preprocessing is done here. The final dataframe for each year is saved in an additional .parquet file.<br>
*(Note that it takes around 3 hours to run the following cell)*

In [None]:
chunk_size = 100000
# n_chunks = 10

path_to_parquet = '/content/drive/MyDrive/Project datasets/speaker_attributes.parquet'
df_wikidata = load_filter_wikidata_df(path_to_parquet)

input_file = '/content/drive/MyDrive/Quotebank/quotes-2020.json.bz2'
output_file = '/content/drive/MyDrive/Quotebank_Repub_Dem/new-quotes-2020-repub-dem.parquet'
df = handle_quotebank_df(input_file, chunk_size, df_wikidata)
store_df(output_file, df)
df = []

input_file = '/content/drive/MyDrive/Quotebank/quotes-2019.json.bz2'
output_file = '/content/drive/MyDrive/Quotebank_Repub_Dem/new-quotes-2019-repub-dem.parquet'
df = handle_quotebank_df(input_file, chunk_size, df_wikidata)
store_df(output_file, df)
df = []

input_file = '/content/drive/MyDrive/Quotebank/quotes-2018.json.bz2'
output_file = '/content/drive/MyDrive/Quotebank_Repub_Dem/new-quotes-2018-repub-dem.parquet'
df = handle_quotebank_df(input_file, chunk_size, df_wikidata)
store_df(output_file, df)
df = []

input_file = '/content/drive/MyDrive/Quotebank/quotes-2017.json.bz2'
output_file = '/content/drive/MyDrive/Quotebank_Repub_Dem/new-quotes-2017-repub-dem.parquet'
df = handle_quotebank_df(input_file, chunk_size, df_wikidata)
store_df(output_file, df)
df = []

input_file = '/content/drive/MyDrive/Quotebank/quotes-2016.json.bz2'
output_file = '/content/drive/MyDrive/Quotebank_Repub_Dem/new-quotes-2016-repub-dem.parquet'
df = handle_quotebank_df(input_file, chunk_size, df_wikidata)
store_df(output_file, df)
df = []

input_file = '/content/drive/MyDrive/Quotebank/quotes-2015.json.bz2'
output_file = '/content/drive/MyDrive/Quotebank_Repub_Dem/new-quotes-2015-repub-dem.parquet'
df = handle_quotebank_df(input_file, chunk_size, df_wikidata)
store_df(output_file, df)
df = []
