In [1]:

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
import datasets
import torch
from datasets import Dataset, load_dataset, Sequence, ClassLabel, Features, Value
import evaluate
from preprocessing import preprocessing
from transformers import DataCollatorForTokenClassification
import numpy as np
import random
%load_ext autoreload
%autoreload 2


  from .autonotebook import tqdm as notebook_tqdm
  metric = load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


# 1. Preprocessing
First, we will define some paths and constant variables.
+ `DATA_NAME`: The name of the file to use for training
+ `MODE`: It can either be `train` or `test` depending on wether a model already exists or not
+ `PER_DS`: It is the percentage of the dataset that is sampled for training

In [2]:
TRAIN_FILE_NAME = 'en_ewt-up-dev.conllu'
MODE='train'
PER_DS = 0.5

## 1.1 Calling external libraries
We will call the preprocessing library with the `DATA_NAME` file that we used earlier. It should be located in the `data/raw` path for it to be recongnized. It will go over the the CONLL-U file and create a dataframe with it as well as organize the argument label and repeat the tokes as many times at there are predictes in a given sentence. Consequently, if there are 3 predicates in a sentence, there will be 3 rows with the same token per sentence with a different target label depending on if they are part of the argument or not and which argument they are for that predicate.

In [3]:
df = preprocessing(TRAIN_FILE_NAME)
label_list = list(df['label'].unique())

## 1.2 Agregation per sentence
We will agregate the dataframe based on `sentence_id`. This means that each row will now represent **one sentence**. Then, each row will contain a list of tokens for which corresponding lists of lemmas, predicates and labels are assigned. We perform this aggregation to facilitate then traning procedure, since we will to sequence to sequence tagging we want to pass a whole sequence and receive the output for it.

In [4]:
sent_df = df.groupby(['sentence_id']).agg(lambda x: x.tolist()).reset_index()

## 1.3  Dataset construction
The `huggingface` set of libraries provides a very good wrapper for processing datasets. For that reason, we will transform our data to that format by giving the types of the `features` and then passing our raw data to it

In [5]:
features = Features({
    'token_id': Sequence(feature=Value('float32')),
    'sentence_num': Sequence(feature=Value('int32')),
    'token': Sequence(feature=Value('string')),
    'lemma': Sequence(feature=Value('string')),
    'upos': Sequence(feature=Value('string')),
    'POS': Sequence(feature=Value('string')),
    'feats': Sequence(feature=Value('string')),
    'head': Sequence(feature=Value('string')),
    'deprel': Sequence(feature=Value('string')),
    'deps': Sequence(feature=Value('string')),
    'misc': Sequence(feature=Value('string')),
    'predicate': Sequence(feature=Value('string')),
    'predicate_token': Sequence(feature=Value('string')),
    'predicate_token_id': Sequence(feature=Value('int32')),
    'sentence_id': Value('int32'),
    'label': Sequence(feature=ClassLabel(names=label_list)),

})

ds = Dataset.from_pandas(sent_df[list(features.keys())], features=features)


# 1.4 Dataset filtering
Here we sample $k$ observations uniformly from the dataset, where $k=N*\alpha$ and $\alpha$ is `PER_DS` or the porcentage of the dataset we want to use for training. 

In [6]:
ds = ds.select(random.sample(range(len(ds)), int(len(ds)*PER_DS)))
len(ds)

2488

# 1.5 Tokenization 
To perform tokenization we use the BERT base tokenizer from the `bert-base-uncased` model, so that is the standard BERT implementation without separate tokens for word cases. 

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
SEP_TOKEN_ID = tokenizer.all_special_ids[tokenizer.all_special_tokens.index('[SEP]')]

# 1.5.1 Actual tokenization and alignment
Here we call a function that will return a dataset in terms of `input_ids` and `attention_masks`. It will construct the proposed input as **CITE**, where we have `sent [SEP] pred`, giving the model the whole sentence and then the predicate at the end of the sentence. It will also construt the corresponding true labels on this, assigning the tokens to the true labels. Words we do not want to predict a label for are marked with an integer that is generally $-100$.

In [8]:
from utils import tokenize_and_align_labels
tokenized_datasets = ds.map(lambda x: tokenize_and_align_labels(tokenizer, x))

Map: 100%|██████████| 2488/2488 [00:01<00:00, 1697.59 examples/s]


Here we can look at the  labels of the first row in the dataset already tokenized

In [9]:
tokenized_datasets['labels'][0]

[-100,
 0,
 0,
 -100,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 3,
 2,
 0,
 0,
 0,
 1,
 0,
 -100]

# 2 Baseline model training
Now we specify some general information about the model we are going to train and the hyper-parameters we will use.

+ `LR`: Learning rate for the weights (amount of adjustment to the gradients on update of weights)
+ `EPOCHS`: The full runs we do on the training data 
+ `WEIGHT_DECAY`: A normalization parameter applied to the weights each iteration
+ `BATCH_SIZE`: The amount of batches where to sum of gradients before performing an update. It can be though of like the **step-size**

In [10]:
task = 'SRL'
BATCH_SIZE = 32
model_name = 'bert-base-uncased'
LR =2e-5
EPOCHS = 3
WEIGHT_DECAY = 0.01

In [12]:
from transformers import DataCollatorForTokenClassification


args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    push_to_hub=False,
)

model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))

data_collator = DataCollatorForTokenClassification(tokenizer)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 2.1 Metrics
We also need a way to meassure how well does the model perform. In this case we use a method from `huggingface` called `seqeval` which calculates evaluation metrics on sequence labeling tasks. It will return the average *precision*, *recall* and *F1*.

**NOTE**: We perform a little test calculating the true labels against themselves to make sure the output is $1$

In [13]:
import warnings
metric = evaluate.load("seqeval")


labels = [label_list[j] for l in ds["label"] for j in l]
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    m = metric.compute(predictions=[labels], references=[labels])

# 2.2 Model training
Now we actually get to training. We remove irrelevant columns from the dataset and pass all of our information to the actual trainer. Then we run the training and save the model

In [14]:
from utils import compute_metrics
td = tokenized_datasets.remove_columns(ds.column_names)
trainer = Trainer(
    model,
    args,
    train_dataset=td,
    eval_dataset=td,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=lambda x: compute_metrics(x, label_list),
)


In [15]:
trainer.train()
trainer.save_model("bert_model")

 31%|███       | 72/234 [09:43<05:07,  1.90s/it]  

RuntimeError: MPS backend out of memory (MPS allocated: 6.64 GB, other allocations: 2.41 GB, max allowed: 9.07 GB). Tried to allocate 89.42 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).