# COLX 561 Lab Assignment 2: Semantic Role Labelling
## Assignment Objectives

In this lab, you will be fine-turning BERT to predict Semantic Role Labels (SRLs). Components of this lab include:

1. Building a dataset
2. Fine tune BERT for SRLabeling
3. Evaluating and running inference on your model

If you do not have access to a GPU locally you may want to run this lab on Google CoLab with a GPU backend. Go to the 'Runtime' menu, choose 'Change Runtime Type...' and there is usually a T4 GPU available for free, for at least a few hours. Our dataset is not huge, nor is our model, so CPU training is doable (shouldn't take more than about 30 minutes), but training will be *much* faster on a GPU if you are trying different options.


## Getting Started

Run the code below to access relevant modules (you can add to this as needed).

In [13]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [15]:
import sys

# !{sys.executable} -m pip install panda transformers accelerate tokenizers datasets 
# !{sys.executable} -m pip install scikit-learn

In [16]:
import os, time
import pandas as pd
import codecs
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.metrics import f1_score, classification_report
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from transformers import AutoTokenizer,AutoModelForTokenClassification, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict

For this lab, we will be working with the OntoNotes v. 5.0 corpus, specifically the data tagged with Semantic Role Labels in the PropBank style using the standard format of CoNLL. Download [the data](https://github.ubc.ca/MDS-CL-2022-23/COLX_563_adv-semantics_instructors/tree/master/labs/Data/Lab2) from github, unzip it into a directory outside of your lab repo and change the path below.

## Tidy Submission
rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this Jupyter notebook with your answers embedded
- Be sure to follow the instructions

## Initial Data Processing
In the following cells, we'll be collecting our data and building our data set. **The code has been provided for you. You just need to run the cells.**

First, we'll generate three lists (`train_files`, `dev_files`, and `test_files`) which consist of the paths to all SRL datafiles. 

In [17]:
ontonotes_path = '../Data/Lab2'

In [18]:
train_files = [os.path.join('train',filename) for filename in os.listdir(os.path.join(ontonotes_path , 'train'))]
dev_files = [os.path.join('dev', filename) for filename in os.listdir(os.path.join(ontonotes_path, 'dev'))]
test_files = [os.path.join('test', filename) for filename in os.listdir(os.path.join(ontonotes_path,'test'))]

train_files, dev_files, test_files = sorted(train_files), sorted(dev_files), sorted(test_files)

In [19]:
# Check that we get the correct number of files
assert len(train_files) == 262
print('Success!')

Success!


### Building our dataset

Like with our NER set, we need to generate IOB (**I**nside-**O**utside-**B**eginning) SRL tags for our CoNLL formatted data. But this time, because a sentence can have multiple verbal predicates (with SRL arguments for each predicate), we will potentially create *multiple* IOB tag sequences for a single sentence. 

For your reference, here is an example of CoNLL formatted data. Columns 9 and 10 contain SRL arguments for the verbs *Shooting* and *turn*, respectively. Parentheses like `(ARG2* ... *)` are used to indicate spans for arguments. For example, `ARG2` for the verb *turn* contains the entire span *Into a Funeral in the Gaza Strip*. 

```
ontonotes/wb/a2e/00/a2e_0000   0    0  Celebration    NN  (TOP(S(NP*     -          -    (ARGM-LOC*)   (ARG0*
ontonotes/wb/a2e/00/a2e_0000   0    1     Shooting    NN           *) shoot  shoot.02           (V*)        *)
ontonotes/wb/a2e/00/a2e_0000   0    2        Turns   VBZ        (VP*   turn   turn.02             *       (V*)
ontonotes/wb/a2e/00/a2e_0000   0    3      Wedding    NN        (NP*)    -          -             *    (ARG1*)
ontonotes/wb/a2e/00/a2e_0000   0    4         Into    IN        (PP*     -          -             *    (ARG2*
ontonotes/wb/a2e/00/a2e_0000   0    5            a    DT        (NP*     -          -             *         *
ontonotes/wb/a2e/00/a2e_0000   0    6      Funeral    NN           *)    -          -             *         *
ontonotes/wb/a2e/00/a2e_0000   0    7           in    IN        (PP*     -          -             *         *
ontonotes/wb/a2e/00/a2e_0000   0    8     Southern    JJ        (NP*     -          -             *         *
ontonotes/wb/a2e/00/a2e_0000   0    9         Gaza   NNP           *     -          -             *         *
ontonotes/wb/a2e/00/a2e_0000   0   10        Strip   NNP      *))))))    -          -             *         *)

```
To distinguish between predicates, we'll add a binary indicator feature to **every token**. The value is **1** if the token is the targeted verb and **0**, otherwise. For example, the first SRL column for the sentence above would generate tokens and tags that would look like this (*shooting* is the predicate here as indicated by `('shooting', 1)`):
```
('celebration', 0) : B-ARGM
('shooting', 1) : B-PRED
('turns', 0) : O
('wedding', 0) : O
('into', 0) : O
('a', 0) : O
('funeral', 0) : O
('in', 0) : O
('southern', 0) : O
('gaza', 0) : O
('strip', 0) : O
```
The second SRL column would generate an output that looks like this. The predicate here is *turns*.
```
('celebration', 0) : B-ARG0
('shooting', 0) : I-ARG0
('turns', 1) : B-PRED
('wedding', 0) : B-ARG1
('into', 0) : B-ARG2
('a', 0) : I-ARG2
('funeral', 0) : I-ARG2
('in', 0) : I-ARG2
('southern', 0) : I-ARG2
('gaza', 0) : I-ARG2
('strip', 0) : I-ARG2
```

**Note:** The 'V' tag has been changed to 'PRED,' and all tags 'ARGM(-XYZ)' have been collapsed into a single 'ARGM' type.

The code here reads each sentence as a pandas.DataFrame() and then generates sets of tokens and tags for it.

In [20]:
def df2iob(df):
    '''Generates tokens and tags from a dataframe corresponding to a sentence of the Ontonotes
    corpus tagged for semantic roles. Returns a tuple consisting of a list of lists of 
    (token,is_target_pred) pairs, and lists of lists of IOB tags; the length of these lists will
    correspond to number of predicates in the sentence (potentially zero)'''
    word_tokens = df['Word'].values.tolist()
    all_tags = []
    all_tokens = []
    for column in df:
        if column.startswith("SRL"):
            sent_tags = []
            srl = df[column].values.tolist()    
            in_tag = 0
            for tag in srl:
                # Used to collapse tag classes
                tag = tag.replace('C-', '')
                tag = tag.replace('R-','')
                tag = tag.replace('-DSP','')
                if '(' in tag and ')' in tag:
                    curr_tag = tag[1:-2]
                    curr_tag = 'PRED' if curr_tag == 'V' else curr_tag
                    curr_tag = 'ARGM' if 'ARGM' in curr_tag else curr_tag
                    sent_tags.append('B-' + curr_tag)
                elif '(' in tag:
                    curr_tag = tag[1:-1]
                    curr_tag = 'PRED' if curr_tag == 'V' else curr_tag
                    curr_tag = 'ARGM' if 'ARGM' in curr_tag else curr_tag
                    sent_tags.append('B-' + curr_tag)
                    in_tag = 1
                elif ')' in tag:
                    sent_tags.append('I-' + curr_tag)
                    in_tag = 0
                elif in_tag:
                    sent_tags.append('I-' + curr_tag)
                else:
                    sent_tags.append('O')
            word_token_is_verb = [1 if 'PRED' in tag else 0 for tag in sent_tags]
            # sent_token_pairs = [(w.lower(), j) for w,j in zip(word_tokens, word_token_is_verb)]
            # all_tokens.append(sent_token_pairs)
            sent_tokens = [w.lower() for w in word_tokens]
            all_tags.append(sent_tags)
            all_tokens.append(sent_tokens)
    return all_tokens, all_tags


The following function extracts sentences from a file and converts them into Pandas `DataFrame` objects.

In [21]:
column_headers = ['Path', 'SentId', 'WordIdx', 'Word', 'POS', 'Parse', 'Verb', 'VerbSense']

def get_dfs(file):
    '''Gets all dataframes (which correspond to sentences) from a .gold_conll file'''
    dfs = []
    curr_sent = []
    with open(file, encoding='utf=8') as inF:
        for line in inF:
            if line == '\n' and curr_sent:
                num_srl = len(curr_sent[0]) - len(column_headers)
                local_column_headers = column_headers + ['SRL' + str(ii) for ii in range(num_srl)] 
                df = pd.DataFrame(curr_sent, columns=local_column_headers)
                dfs.append(df)
                #Reset for next sentence
                curr_sent = []
            else:
                curr_sent.append(line.strip().split())
    return dfs

Code to make sure things are working properly. Please have a look at `all_tokens` and `all_tags` returned by `df2iob()`.

In [22]:
check_file = os.path.join(ontonotes_path, dev_files[0])
dfs = get_dfs(check_file)
all_tokens, all_tags = df2iob(dfs[3])
assert len(all_tokens) == 3
assert len(all_tags) == 3
# assert all_tokens[2] == [('the', 0), ('yugoslav', 0), ('election', 0), ('commission', 0), ('claims', 0), ('he', 0), ('did', 0), ('not', 0), ('win', 1), ('more', 0), ('than', 0), ('50', 0), ('%', 0), ('of', 0), ('the', 0), ('vote', 0), ('.', 0)]
assert all_tags[2] == ['O', 'O', 'O', 'O', 'O', 'B-ARG0', 'O', 'B-ARGM', 'B-PRED', 'B-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'O']
# assert all_tokens[0] == [('the', 0), ('yugoslav', 0), ('election', 0), ('commission', 0), ('claims', 1), ('he', 0), ('did', 0), ('not', 0), ('win', 0), ('more', 0), ('than', 0), ('50', 0), ('%', 0), ('of', 0), ('the', 0), ('vote', 0), ('.', 0)]
assert all_tags[0] == ['B-ARG0', 'I-ARG0', 'I-ARG0', 'I-ARG0', 'B-PRED', 'B-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'O']
print('Success!')

Success!


In [23]:
all_tokens[0]

['the',
 'yugoslav',
 'election',
 'commission',
 'claims',
 'he',
 'did',
 'not',
 'win',
 'more',
 'than',
 '50',
 '%',
 'of',
 'the',
 'vote',
 '.']

In [24]:
all_tags[0]

['B-ARG0',
 'I-ARG0',
 'I-ARG0',
 'I-ARG0',
 'B-PRED',
 'B-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'I-ARG1',
 'O']

## Exercise 1: Building a dataset

### Exercise 1.1
rubric = {accuracy:1}

Using the functions `get_dfs()` and `df2iob()` provided in **Initial Data Processing** above, write a `prepare_SRL_data` function that takes as input a list of .gold_conll files and creates a two python lists `input_words` and `output_tags`. Each list should contain an element for every sentense in the input files. 

`input_tokens` should be a list of lists of strings representing the words in the input as returned by `df2iob()`.

`output_tags` should be a list of list of IOB tags, corresponding to the words in `input_words`

**HINT:** Check how to use the functions `get_dfs()` and `df2iob()` in the assert cell above.

In [25]:
def prepare_SRL_data(srl_files):
    '''create IOB-data for all the SRL-tagged Ontonotes files in srl_files. 
    
       Input: list of file names
       Output: (input_words, output_tags) 
       
       input_words is a list of lists, one list for every sentence in the data. Here 
                    each sentence list consists of (token, is_target_pred) pairs. 
       output_tags  is a list of lists, one list for every sentence in the data. Here 
                    each sentence list consists of IOB tags.
    '''
    input_words, output_tags = [], []

    # #your code here
    
    # 1. Iterating Over SRL Files
    #     srl_files: A list of filenames containing SRL data.
    #     The loop iterates through each filename (fn) in this list.
    # 2. Joining File Path
    #     ontonotes_path: The base directory path where the SRL files are stored.
    #     os.path.join combines the base path (ontonotes_path) with the filename (fn) to create the full path to the SRL file (srl_file).
    # 3. Parsing the SRL File
    #     get_dfs(srl_file): A function that processes the SRL file and likely returns a list of DataFrames (dfs). 
    #     Each DataFrame may represent the semantic role labeling data for a sentence or a group of sentences.
    # 4. Converting DataFrames to IOB Format
    #     df2iob(df): A function that converts each DataFrame (df) into IOB format (Inside-Outside-Beginning tagging scheme).
    #     annotated_sents: A list where each element is the result of converting a DataFrame to IOB format. Each element might be a tuple of (words, tags):
    # 5. Updating Input Words and Output Tags
    #     Loops through the list annotated_sents, where each element is (word, tags).
    #     input_words: A list where all the tokenized words from all sentences are appended.
    #     output_tags: A list where all the IOB tags corresponding to the words are appended.
    #     The += operation extends the input_words and output_tags lists.    


    return input_words, output_tags

In [26]:
input_words, output_tags = prepare_SRL_data(train_files)
assert len(input_words) == 22437
assert len(output_tags) == 22437
print('Success!')

Success!


The full dataset is huge, so you might want to run the rest of this notebook with just a smaller slice for now. Feel free to adjust the numbers in the cell below to your needs. The 'tokens' variable is also being renamed to 

In [27]:
limit = 50 #
words = input_words[:limit]
tags = output_tags[:limit]

## Exercise 1.2
rubric={accuracy:3, quality:1}

Next you need to create a dataset, consisting of inputs paired with outputs. Inputs are words, and the outputs are numbers representing semantic role labels, which the model must learn to predict. Before training, the inputs need to be tokenized. You can get a tokenizer that is appropriate for BERT through HuggingFace's `AutoTokenizer` object, by passing the name of a BERT model.

In [28]:
model_name = "google/bert_uncased_L-2_H-128_A-2"
#This model was chosen because it is small. You can experiment with larger models if you like. 
tokenizer = AutoTokenizer.from_pretrained(model_name)

To use the tokenizer, simply call the it and pass a list of strings.

In [29]:
#Experiment with the tokenizer and print out a few things before going any further, and be comfortable with the output format
t = tokenizer(['Hello'])
print(t)
t2 = tokenizer(['Hello', 'Goodbye'])
print(t2)
t3 = tokenizer(['Good morning everbody'])
print(t3)
t4 = tokenizer([['Good', 'morning', 'everybody']], is_split_into_words=True)
print(t4)

{'input_ids': [[101, 7592, 102]], 'token_type_ids': [[0, 0, 0]], 'attention_mask': [[1, 1, 1]]}
{'input_ids': [[101, 7592, 102], [101, 9119, 102]], 'token_type_ids': [[0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}
{'input_ids': [[101, 2204, 2851, 2412, 23684, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}
{'input_ids': [[101, 2204, 2851, 7955, 102]], 'token_type_ids': [[0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1]]}


For the output labels, you can use the following dictionaries to convert semantic role tags to numbers, or the other way around

In [30]:
full_tag_list = list(set([tag for sent_tags in output_tags for tag in sent_tags]))
tag_to_id = {tag:i for (i,tag) in enumerate(full_tag_list)}
id_to_tag = {i:tag for (i,tag) in enumerate(full_tag_list)}

Now you can put this all together with a Dataset object from HuggingFace for this task. To create a Dataset, you need to convert your data into a list of dictionaries, where each dictionary has two keys:

- `words` is a list of strings representing input words (not yet tokenized)
- `tags` is a list of integers representing semantic role tags

In [31]:
#This is an illustration of the format you're looking for, but of course you should generate it by using a for loop or list comprehension
data_to_tokenize = [              
{'words': ['opposition', 'leaders','claim', 'milosevic', 'rigged','last','week', "'s",'presidential','vote','.'],
 'tags': [0, 0, 0, 5, 11, 1, 4, 4, 4, 4, 0]}, #first sentence
{'words': [], 
 'tags': []}, #second sentence
{'words': [], 
 'tags':[]} #third sentence, etc.
]

data_to_tokenize = [{'words': word, 'tags': [tag_to_id[t] for t in tag]} for (word,tag) in zip(words, tags)]
dataset = Dataset.from_list(data_to_tokenize) 

Now, write a `process_data` function that applies the tokenizer to the dataset. You will need to look up additional arguments that enable the tokenizer to (1) pad the input, (2) truncate the input, (3) allow a maximum of 32 tokens. 

The tokenizer may split a word into more than one token, so you will also need to write some code to 're-align' the output and map each of the sub-word tokens to the same output label. There are some hints in the code below. 

You can find more information in the Tokenizer documentation [here](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__), and you can also consult the BERT notebook from Lecture 2. 

In [32]:
def process_data(data):
    tokenized_data = tokenizer(data['words'], is_split_into_words=True, padding=True, truncation=True, max_length=32, return_tensors="pt")
    all_labels = []
    #your code here for aligning labels goes below
    #HINTS:
    #for each semantic role tag, you need to find all of the tokens that correspond to the input word which carried that tag
    #you can find the tokens for tag N like this:
    # word_ids = tokenized_data.word_ids(batch_index=N)
    #you can pass word_ids to the `align_labels` function below to get back a list of labels for a given token, and append that to all_labels
    for i, label in enumerate(data["tags"]):
        word_ids = tokenized_data.word_ids(batch_index=i)
        aligned_labels = align_labels(label, word_ids)
        all_labels.append(aligned_labels)
    tokenized_data["labels"] = all_labels
    return tokenized_data

def align_labels(labels, word_ids):
    new_labels = []
    previous_word_idx = None
    for word_idx in word_ids:
        if word_idx is None:
            #This means there's nothing to label, e.g. it's a pad token
            #(-100) is a special value that the model will ignore
            new_labels.append(-100)
        elif word_idx != previous_word_idx:
            new_labels.append(labels[word_idx])
        else:
            new_labels.append(-100)
        previous_word_idx = word_idx
    return new_labels

To generate a dataset with your function, run the next cell

In [33]:
tokenized_dataset = dataset.map(process_data, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(['words', 'tags', 'token_type_ids']) #These are unnecessary for training

  0%|          | 0/1 [00:00<?, ?ba/s]

In [34]:
#Examine the output
tokenized_dataset[0]

{'input_ids': [101,
  8936,
  2078,
  2343,
  22889,
  16429,
  13390,
  2078,
  26038,
  17726,
  2003,
  5432,
  1997,
  1996,
  8465,
  1997,
  2019,
  4559,
  3377,
  1999,
  2019,
  9046,
  19550,
  2602,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0],
 'labels': [-100,
  3,
  -100,
  3,
  3,
  -100,
  -100,
  -100,
  3,
  -100,
  0,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100]}

In [35]:
N = len(tokenized_dataset)
assert len(tokenized_dataset['input_ids']) == len(tokenized_dataset['labels']) == len(tokenized_dataset['attention_mask'])
assert all(isinstance(label, int) for label in tokenized_dataset['labels'][0]), 'Labels were not converted to integers'
assert max(len(tokenized_dataset[n]['input_ids']) <=32 for n in range(N)), 'Truncation failed'
assert any((-100) in tokenized_dataset[n]['labels'] for n in range(N)), 'Label alignment failed'
assert any(0 in tokenized_dataset[n]['input_ids'] for n in range(N)), 'Padding failed'

Finally, split the data into train, test, and validate sets

In [36]:
train_test = tokenized_dataset.train_test_split(test_size=0.1) # keep 10% of the data as test set
test_valid = train_test["test"].train_test_split(test_size=0.05) #keep 5% of the training data as validation 
final_dataset = DatasetDict({
    "train": train_test["train"], #training data
    "eval": test_valid["train"], #for evaluation/validation during the training
    "test": test_valid["test"] #for testing after training is complete
})

## Exercise 2: Training BERT

# Exercise 2.1
rubric = {accuracy:1}

The next step is to set up BERT for training. Some code is already provided for you below, but must make the following changes:

- Modify the TrainingArguments object so the model trains for 5 epochs
- Create a Trainer object, by passing it (1) a model, (2) a tokenizer, (3) a train set, (4) an eval set, (5) training arguments. 

When you run the next cell, you will probably get a warning BERT needs to be trained. You can ignore it, you will be doing the training shortly.

If you get an error running the next cell because of a failure to load the 'accelerate' library, follow the instructions to `pip install` the correct library, then restart your notebook kernal for this to take effect. 

You can find more details about the Trainer object in the [transformers documentation here](https://huggingface.co/docs/transformers/v4.48.0/en/main_classes/trainer#transformers.Trainer), and you can also consult the notebook from Lecture 2.

In [47]:
num_srl_tags = len(full_tag_list) #set this equal to the actual number of output labels in your dataset
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=num_srl_tags)

#generic method to give average label accuracy during training, no changes required
# def compute_metrics(eval_pred):
#     logits, labels = eval_pred
#     predictions = torch.argmax(torch.tensor(logits), dim=-1)
#     # accuracy = (predictions == labels).float().mean()
#     accuracy = (predictions == labels).type(torch.float32).mean()

#     return {"accuracy": accuracy.item()}
def compute_metrics(eval_pred):
    logits, labels = eval_pred

    # Convert logits and labels to PyTorch tensors (if not already)
    logits = torch.tensor(logits)
    labels = torch.tensor(labels)

    # Get predictions and ensure they are tensors
    predictions = torch.argmax(logits, dim=-1)

    # Calculate accuracy
    accuracy = (predictions == labels).float().mean()
    return {"accuracy": accuracy.item()}


#update this so that the model trains for 5 epochs
training_args = TrainingArguments(
    output_dir="./results", #you can change this to any directory you like
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    report_to='none',
    # no_cuda=True,  # Disable GPU usage, ensuring that no `accelerate` optimizations are applied
)

#fill in the appropriate arguments here
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=final_dataset['train'],
    eval_dataset=final_dataset['eval'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    )

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Assertions to check your code.

In [48]:
assert len(trainer.train_dataset) == len(final_dataset['train'])
assert len(trainer.eval_dataset) == len(final_dataset['eval'])

The next step actually trains the model. Depending on the size of the dataset you are using and the number of epochs you chose above, this can anywhere from minutes to hours. If you want to play more with the model parameters, a GPU is recommended. For the purpose of this lab, you can train the model on a few hundred examples for 5 epochs. The results will not be fantastic, but you can complete the lab on an ordinary laptop.

In [49]:
history = trainer.train()

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 2.607465982437134, 'eval_accuracy': 0.109375, 'eval_runtime': 0.0079, 'eval_samples_per_second': 505.566, 'eval_steps_per_second': 126.392, 'epoch': 1.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 2.576277732849121, 'eval_accuracy': 0.15625, 'eval_runtime': 0.0069, 'eval_samples_per_second': 577.529, 'eval_steps_per_second': 144.382, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 2.5534257888793945, 'eval_accuracy': 0.203125, 'eval_runtime': 0.0085, 'eval_samples_per_second': 473.251, 'eval_steps_per_second': 118.313, 'epoch': 3.0}
{'loss': 2.5855, 'learning_rate': 6.666666666666667e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 2.5392467975616455, 'eval_accuracy': 0.2265625, 'eval_runtime': 0.0071, 'eval_samples_per_second': 561.863, 'eval_steps_per_second': 140.466, 'epoch': 4.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 2.533625364303589, 'eval_accuracy': 0.234375, 'eval_runtime': 0.0074, 'eval_samples_per_second': 539.669, 'eval_steps_per_second': 134.917, 'epoch': 5.0}
{'train_runtime': 0.5994, 'train_samples_per_second': 375.384, 'train_steps_per_second': 25.026, 'train_loss': 2.5679574330647785, 'epoch': 5.0}


In [50]:
assert history.metrics['epoch'] >= 5

### Exercise 3: Testing the model

## Exercise 3.1
rubric={accuracy:1}

Now you will test the model in two different ways. 

First, run a full evaluation over the test set created earlier in the notebook. To do this, look up how to use the Trainer object's `.evaluate()` method in the transformers documentation. Save the results in a variable called `test_metrics`.

In [51]:
test_metrics = trainer.evaluate(final_dataset['test'])

  0%|          | 0/1 [00:00<?, ?it/s]

In [52]:
print(test_metrics)

{'eval_loss': 2.5009238719940186, 'eval_accuracy': 0.25, 'eval_runtime': 0.017, 'eval_samples_per_second': 58.84, 'eval_steps_per_second': 58.84, 'epoch': 5.0}


In [53]:
assert test_metrics['epoch'] == history.metrics['epoch']

How well did the model do on the test set, compared to the training set? Write a simple if-else block that compares the training and testing loss values, and prints out which one is better.

In [54]:
if history.training_loss > test_metrics['eval_loss']:
    print('training is better')
else:
    print('testing is better')

training is better


### Exercise 3.2
rubric={accuracy:2}

Lastly, you will run 'inference' on the model, by giving it specific inputs and checking its predictions. To get 1 point on this exercise, write a function that takes a list of strings as input, tokenizes them, and then returns the predicted labels.  To get 2 points on this exercise, the function should also re-align the predictions with full words from the input. Some of this code is already supplied for you.

In [55]:
def predict_labels(sentence):
    model.eval()  # Set model to evaluation mode

    # Tokenize the input sentence, update this appropriately. You can re-use code from the tokenization in Exercise 1.2
    tokenized_sentence = tokenizer(sentence, is_split_into_words=True, return_tensors="pt", padding='max_length', truncation=True, max_length=64)

    # Make predictions, no changes needed
    with torch.no_grad():
        outputs = model(**tokenized_sentence)
        logits = outputs.logits 
    predicted_label_ids = logits.argmax(dim=-1).squeeze() # Shape: (sequence_length,)
    predicted_label_ids = [int(id) for id in predicted_label_ids]

    #your code for formatting the return value goes here
    #HINTS:
    #Use the align_labels() function from earlier as a guide
    #You will once again need to loop over token IDs using tokenized_sentence.word_ids(batch_index=0)
    #the batch_index will always be 0 this time, because there's only 1 input sentence to this function

    word_ids = tokenized_sentence.word_ids(batch_index=0)
    predictions = []
    for idx, word_id in enumerate(word_ids):
        if word_id is not None: 
            label_id = predicted_label_ids[idx]
            label = id_to_tag[label_id]
            predictions.append((sentence[word_id], label))
    return predictions

sentence = ["I", "like", "swimming", "in", "the", "lake", "sometimes"]
print(predict_labels(sentence))

[('I', 'O'), ('like', 'O'), ('swimming', 'B-ARG3'), ('in', 'O'), ('the', 'O'), ('lake', 'B-ARG4'), ('sometimes', 'I-ARG2')]


In [56]:
prediction = predict_labels(["I", "like", "swimming"])
assert not all(isinstance(id, int) for id in prediction), 'Failed to convert back to label names'
assert isinstance(prediction[0], tuple)
assert prediction[0][1][0] in ['B', 'I', 'O']