# Calculating prescient ideas using BERT: A Tutorial

Author: Paul Vicinanza

E-mail: pvicinanza@gmail.com

The tutorial provides introductory code to:

1. Finetune custom BERT models over a temporally split corpus
2. Compute prescience using the finetuned BERT models
3. Analyze prescience to understand what the model deems prescient

#### A note

This tutorial is designed with social scientists with little computational background in mind. As such, the entire code is executed in jupyter notebook and is heavily commented for ease of implementation.

#### Citation
Paul Vicinanza, Amir Goldberg, Sameer B Srivastava, A deep-learning model of prescient ideas demonstrates that they emerge from the periphery, *PNAS Nexus*, Volume 2, Issue 1, Janurary 2023, pgac275, https://doi-org.stanford.idm.oclc.org/10.1093/pnasnexus/pgac275

In [None]:
import numpy as np
import pandas as pd
import os
import swifter
from tqdm.auto import tqdm
tqdm.pandas()

# Import custom functions 
from bert_finetune_utils import *
from read_politics_helpers import readCongress, splitSents
from prescience_helpers import *

## Downloading the data

For this tutorial, we will be using the bound and daily editions of the United States Congressional Record avaiable. This dataset contains speeches made on the House and Senate floor by U.S. Federal Politicians. Download and unzip here: https://stacks.stanford.edu/file/druid:md374tz9962/hein-bound.zip

For this example, we'll be using speeches from the 87th congress (1961-1963) and the 109th congress (2005-2007)


#### Citation

Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text

In [None]:
def readCongress(file):
    '''
    Read in and progress congressional data
    @param file (str) - File of congressional speeches to process

    @return df (DataFrame) - Dataframe holding political speeches

    @dependencies splitSents
    ''' 
    df = pd.read_csv(file, sep='\n', encoding='latin-1')
    df = [x[0].split('|')[:2] for x in df.values]   # Split on | - text after second | is dropped - extremely rare and inconsequential 
    df = pd.DataFrame(df, columns=['speech_id', 'speech'])

    # Split dataframe on sentences
    df = df[df['speech'] != ''] # Drop empty strings
    df['speech'] = df['speech'].swifter.allow_dask_on_strings().apply(lambda x : splitSents(x))

    # Expand dataframe so that each sentence is a unique row
    df = expandDF(df, 'speech')

    return df

In [None]:
data_path = "INSERT PATH TO SAVED DATA"
cong_87 = readCongress(os.path.join(data_path, 'speeches_087.txt'))
cong_107 = readCongress(os.path.join(data_path, 'speeches_107.txt'))

### Examing the text data

Using readCongress, we have read the speech data and split each speech act into separate sentences, which form the basic input for BERT. Now we are ready to finetune.

In [None]:
cong_109.head()

# 1. Finetuning BERT

For this finetuning example we'll use the 107th congress

In [None]:
df = cong_107
model_name = 'bert_base_uncased_107th_congress'

### Picking the BERT model

We use the default BERT base uncased model, but this approach can use a custom BERT model or even taken a custom BERT vocabulary using a pre-trained model.

In [None]:
bert_model = 'bert-base-uncased'
vocab = bert_model  # Path of vocabulary for BERT tokenizer - If using default set to bert_model
model = BertForMaskedLM.from_pretrained(bert_model)

### Building training data

We convert the data into an iterable object and remove:

1. Really short sentences (to short to be useful in computing prescience)
2. Really long sentences (likely errors in the sentence parser) which slow down training 

We use the BertDataset object to do so.

In [None]:
# Construct data object for training
data = BertDataset(df['speech'], tokenizer_vocab_path=vocab,
                   min_doc_len=12, max_doc_len=102)

### Finetuning

Now we're ready to finetune. Below are the hyperparameters I used when finetuning BERT to the politics data. 

Note that I *do not* split into a train/test split and simply finetune for an extended amount of time (approximately 1 day's worth of training on a 2080ti). This may be necessary when finetuning over small datasets to prevent overfitting. 

In [None]:
# Establish GPU usage to accelerate training
torch.cuda.set_device(0)
torch.cuda.get_device_name(0)

In [None]:
training_args = TrainingArguments(
    output_dir='./output/{}'.format(model_name),
    overwrite_output_dir=True,
    num_train_epochs=8,
    logging_steps=1000,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=32,
    save_steps=30000,
    save_total_limit=10,
    do_train=True,
    seed=102093,
    fp16=True)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=data.tokenizer, mlm=True, mlm_probability=0.15
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=data
)

In [None]:
trainer.train()

# 2. Computing Sentence Perplexity

The second step is to compute sentence perplexity using multiple models. Reset the notebook to clear memory and import all the packages again.

In [None]:
import numpy as np
import pandas as pd
import os
import swifter
import torch
from tqdm.auto import tqdm
tqdm.pandas()

# Import custom functions 
from bert_finetune_utils import *
from read_politics_helpers import readCongress, splitSents
from prescience_helpers import *

In [None]:
# Establish GPU usage to accelerate training
torch.cuda.set_device(0)
torch.cuda.get_device_name(0)

### We'll compute sentence perplexity using speeches from the 87th congress

In [None]:
data_path = r'C:\Users\Paul\Dropbox\bert\politics\data\speeches'
cong_87 = readCongress(os.path.join(data_path, 'speeches_087.txt'))

### And the finetuned models of the 87th and 107th congresses

Note: Provide the same path as the output directory from finetuning, _NOT_ the specific save state. The code automatically selects the last save state from finetuning.

In [None]:
model_path = r'C:\Users\Paul\Dropbox\bert\politics\example\output'
models = [os.path.join(model_path, 'bert_base_uncased_87th_congress'),
          os.path.join(model_path, 'bert_base_uncased_107th_congress')]

### The workhorse function here is computePerplexitiesForPrescience

View it's docstring. For illustrative purposes we'll be setting compute_sent_perp to false. This returns word-level perplexities instead of sentence-level perplexities - useful for exploring _what_ words contribute to prescience but is significantly more memory intensive.

In [None]:
computePerplexitiesForPrescience?

In [None]:
cong_87 = computePerplexitiesForPrescience(cong_87, 'speech', model_names=models)