The goal of this notebook is to explore the attention patterns of transformer-based language models on document-level event detection.
DoCEE dataset is used, and for the model we use RoBERTa-base.

First, we're gonna explore the pretrained model. We need to load the model, which includes:
    - loading the tokenizer
    - loading the config file
    - loading the model

In [1]:
import torch
from transformers import RobertaPreTrainedModel, RobertaTokenizerFast

device = "cuda"
model_name_or_path = "roberta-base"
cache_dir = "../pretrained_models"

tokenizer = RobertaTokenizerFast.from_pretrained(
    pretrained_model_name_or_path=model_name_or_path,
    cache_dir=cache_dir
)

model = RobertaPreTrainedModel.from_pretrained(
    model_name_or_path,
    cache_dir
).to(device)

print(model)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaPreTrainedModel: ['roberta.encoder.layer.3.attention.self.value.bias', 'roberta.encoder.layer.5.attention.self.value.bias', 'roberta.encoder.layer.7.attention.self.value.bias', 'roberta.encoder.layer.10.attention.output.LayerNorm.bias', 'roberta.encoder.layer.0.output.dense.weight', 'roberta.encoder.layer.0.output.LayerNorm.bias', 'roberta.encoder.layer.3.output.LayerNorm.weight', 'roberta.encoder.layer.6.output.LayerNorm.weight', 'roberta.encoder.layer.7.attention.self.key.weight', 'roberta.encoder.layer.4.output.LayerNorm.bias', 'roberta.encoder.layer.5.attention.output.LayerNorm.bias', 'roberta.encoder.layer.0.output.LayerNorm.weight', 'roberta.encoder.layer.9.output.dense.weight', 'roberta.encoder.layer.10.attention.output.dense.bias', 'roberta.encoder.layer.5.output.LayerNorm.bias', 'roberta.encoder.layer.10.attention.self.key.weight', 'roberta.encoder.layer.11.output.dense.weight', 'robert

RobertaPreTrainedModel()


Now we need to load the dataset, and we'll use the training split for exploration.
We'll load only 10 examples, because we don't really need more than that

In [2]:
import pandas as pd

keep_only = 10
train_df = pd.read_csv("../data/docee/train_all.csv")[:keep_only]
train_df.head()

Unnamed: 0.1,Unnamed: 0,title,text,event_type,arguments,date,metadata
0,0,Vietnam reelects conservative Nguyễn Phú Trọng...,Vietnam's Communist Party Wednesday re-elected...,Government Job change - Election,"[{'start': 0, 'end': 24, 'type': 'Candidates a...",January 2016,"['(AP via ABC News)', '(Channel NewsAsia)']"
1,1,At least 42 people are killed in a bus crash i...,Another 43 people were injured when the bus ca...,Road Crash,"[{'start': 8, 'end': 29, 'type': 'Casualties a...",October 2006,['(BBC)']
2,2,At least 27 migrants die in a shipwreck in the...,At least 27 migrants have died off the Turkish...,Shipwreck,"[{'start': 0, 'end': 29, 'type': 'Casualties a...",February 2016,"['(ANSAmed)', '(Leadership)', '(news.com.au)',..."
3,3,Colten Treu faces charges of vehicular homicid...,"Colten Treu, 21, and his roommate both told au...",Road Crash,"[{'start': 183, 'end': 207, 'type': 'Number of...",November 2018,"['(KSTP)', '(Oxygen)']"
4,4,"Hours after the announcement, Morales resigns ...",Bolivian President Evo Morales has resigned af...,Government Job change - Resignation_Dismissal,"[{'start': 0, 'end': 17, 'type': 'Position', '...",November 2019,"['(BBC News)', '(The Guardian)']"


In [3]:
from src.data import DoceeDataset

train_dataset = DoceeDataset(train_df, tokenizer=tokenizer)
train_dataset.inspect()

 ===text=== 
Field type: <class 'list'>
Field length : 10
Type of element: <class 'str'>
First element = Vietnam's Communist Party Wednesday re-elected its 71-year-old chief for a second term, an expected outcome that sees the conservative pro-China ideologue cementing his hold on power.
The party's congress elected Nguyen Phu Trong (pronounced Noo-yen Foo Chong) to a 19-member Politburo, the all-powerful body that handles the day-to-day affairs of the government and the party. In a subsequent vote, he was immediately chosen as the general-secretary, the de facto No. 1 leader of the country.
The announcement was made on the official Vietnam News Agency's website.
Officials said Deputy Prime Minister Nguyen Xuan Phuc was also elected to the Politburo, and he is now expected to become the prime minister. He will replace Nguyen Tan Dung, who had had led economic reforms over the last 10 years and had harbored ambitions for the top job. His challenge, however, was snuffed by Trong's suppor

In [4]:
example_arg = train_dataset.arguments[0]
print(example_arg)

[Argument(start=0, end=24, type='Candidates and their parties', text="Vietnam's Communist Party"), Argument(start=26, end=34, type='Date', text='Wednesday'), Argument(start=213, end=228, type='Candidates and their parties', text='Nguyen Phu Trong'), Argument(start=604, end=619, type='Candidates and their parties', text='Nguyen Xuan Phuc'), Argument(start=2657, end=2663, type='Location', text='Vietnam')]


It seems that for some examples, arguments is a list. The list contains multiple dictionaries, each representing a single argument. Each argument has the following information:
    - start -- denotes the starting index of the argument (character index?)
    - end   -- denotes the ending index of the argument
    - type  -- denotes the argument type
    - text  -- text which is labeled as the argument

In [5]:
first_arg = example_arg[0]
print(first_arg)

Argument(start=0, end=24, type='Candidates and their parties', text="Vietnam's Communist Party")
