# Named Entity Recognition with BERT

In this assignment, we will tackled named **entity recognition using BERT**.

We will use the MIT Movie dataset,
which contains user-generated queries about films.
Each sentence is annotated for the presence of various movie-specific entities such as ACTOR, PLOT, RATING, ...
The tagging is performed using the **IOB** (Inside, Outsize, Begin) tagging scheme.

Here are two sample sentences from the dataset (labels and words are separated by tabs, sentences are separated by empty lines).

```
O	list
O	the
B-RATINGS_AVERAGE	five
I-RATINGS_AVERAGE	star
O	rated
O	movies
O	starring
B-ACTOR	mel
I-ACTOR	gibson

O	what
B-GENRE	science
I-GENRE	fiction
O	films
O	have
O	come
O	out
B-YEAR	recently
```

# [1] Task and Dataset preprocessing

As usual, let's start by downloading and preprocessing the dataset.

In [1]:
import os
if not os.path.exists("mit_movie.zip"):
    ! wget -O mit_movie.zip http://ailab.uniud.it/wp-content/uploads/2020/09/mit_movie.zip
    ! unzip mit_movie.zip -d .
    ! ls -l

/bin/sh: wget: command not found
unzip:  cannot find or open mit_movie.zip, mit_movie.zip.zip or mit_movie.zip.ZIP.
total 2160
-rw-------   1 mattiadurso  staff  216258 Oct 22 17:47 01_python-tutorial.ipynb
-rw-------   1 mattiadurso  staff   36193 Jan 18 15:48 02_Data_Analysis_with_Pandas.ipynb
-rw-------@  1 mattiadurso  staff   56092 Jan 18 16:00 03_Data_Preprocessing.ipynb
-rw-r--r--@  1 mattiadurso  staff  101158 Nov 11 10:35 04 - Intro to pytorch.ipynb
drwxr-xr-x@ 10 mattiadurso  staff     320 Jan 18 17:25 [34m05 - Embeddings and Sentiment with Neural Networks[m[m
drwxr-xr-x@  6 mattiadurso  staff     192 Dec  2 10:27 [34m06_Image_Classification_with_CNNs[m[m
drwxr-xr-x@  5 mattiadurso  staff     160 Jan 19 14:52 [34m07_object_detection[m[m
-rw-------@  1 mattiadurso  staff  231662 Jan 24 14:32 08_BERT_for_Named_Entity_Recognition.ipynb
-rw-------@  1 mattiadurso  staff   18614 Jan 24 14:15 09_Time_series_forecasting_with_LSTMs.ipynb
-rw-------@  1 mattiadurso  staff   2

## [1.1] Loading Dataset

Let's first load the datasets and create lists of sentences and their respective labels:
`train_sents`, `train_labels`, `test_sents` and `test_labels`.

In [2]:
from tqdm.auto import tqdm

def get_sents_and_labels_from_tsv(tsv_path):

    with open(tsv_path, "r") as f:
        lines = f.readlines()
    lines = [line.strip() for line in lines]

    output_sents = []
    output_labels = []

    current_tokens = []
    current_labels = []

    for line in tqdm(lines):
        if line == "":
            output_sents.append(current_tokens)
            output_labels.append(current_labels)
            current_tokens = []
            current_labels = []
        else:
            label, text = line.split("\t")
            current_tokens.append(text) # FILL WITH CODE
            current_labels.append(label) # FILL WITH CODE

    return output_sents, output_labels


In [3]:
train_sents, train_labels = get_sents_and_labels_from_tsv("engtrain.bio.txt")

print("Number of train samples", len(train_sents))
print()
print("First train sentence and its labels")
print(train_sents[0])
print(train_labels[0])

FileNotFoundError: [Errno 2] No such file or directory: 'engtrain.bio.txt'

In [None]:
test_sents, test_labels = get_sents_and_labels_from_tsv("engtest.bio.txt")

print("Number of test samples", len(train_sents))
print()
print("First test sentence and its labels")
print(test_sents[0])
print(test_labels[0])

## [1.2] Looking at the entities
Let's now take a closer look at the kind of entities present in the datasets: we are going to count the frequency of each entity in the train and test dataset.

In [None]:
from collections import Counter

all_test_labels = []

for sent in test_labels:
    for label in sent:
        label = label if label == "O" else label[2:]
        all_test_labels.append(label)
        
all_train_labels = []

for sent in train_labels:
    for label in sent:
        label = label if label == "O" else label[2:]
        all_train_labels.append(label)

most_common_test = Counter(all_test_labels).most_common()
most_common_train = Counter(all_train_labels).most_common()

matrix = []
for i,j in zip(most_common_train, most_common_test):
    matrix.append([i[0],i[1], j[0], j[1]])

import pandas as pd
most_common_df = pd.DataFrame(matrix,
                              columns=["train_labels", "train_labels_freq",
                                       "test_labels", "test_labels_freq"])
display(most_common_df)

There are a lot of entities in this dataset!

We might want to **remove some of the least frequent labels, or focus on one kind of entity in particular**.

In this case we are not going to remove anything, so the next cell will have no effect. But you can try and add labels to the `labels_to_remove` list and see how that influences the performance of the final classifier.

In [None]:
labels_to_remove = [] #["SONG", "REVIEW", "TRAILER", "CHARACTER"]
for del_label in labels_to_remove:
    
    for i, sent in enumerate(train_labels):
        for j, label in enumerate(sent):
            if label == del_label or label[2:] == del_label:
                train_labels[i][j] = "O"
    
    for i, sent in enumerate(test_labels):
        for j, label in enumerate(sent):
            if label == del_label or label[2:] == del_label:
                test_labels[i][j] = "O"
keep_IB = True
if not keep_IB:
    for i, sent in enumerate(train_labels):
        for j, label in enumerate(sent):
            if "-" in label:
                train_labels[i][j] = label[2:]
    
    for i, sent in enumerate(test_labels):
        for j, label in enumerate(sent):
            if "-" in label:
                test_labels[i][j] = label[2:]

# [2] Pytorch BERT

BERT (Bidirectional Encoder Representations from Transformers), released in late 2018, deep pre-trained architecture for language modelling.

You can either use these models to extract high quality language features from your text data, or you can **fine-tune** them on a **specific task** (classification, entity recognition, question answering, etc.) with your own data to produce state of the art predictions.

When a model is **pre-trained**, it is provided with **all of its layers already trained for a specific task** (in this case language modelling) on some large corpus (in this case a dump of Wikipedia and other resources).

We can simply **add an untrained layer of neurons on the end of the model**, and train the new model for our named entity recognition task, leveraging on the knowledge given by the pre-training.


$$
\begin{array}{ccccccc|c}
\\
\boxed{\small{\text{O}}} & \boxed{\small{\text{O}}} & \boxed{\small{\text{O}}} & \boxed{\small{\text{O}}} & \boxed{\small{\text{B-ACT}}} & \boxed{\small{\text{I-ACT}}} & \boxed{\small{\text{O}}}
& \small{\text{output NER labels}}\\
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\ \hline
&&&\textbf{Classifier}&&& & \small{\text{untrained layer}}\\ \hline
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\
\boxed{H_{\small{\text{[CLS]}}}} & \boxed{H_\text{what}} & \boxed{H_\text{movies}} & \boxed{H_\text{star}} & \boxed{H_\text{bruce}} & \boxed{H_\text{willis}} & \boxed{H_{\small{\text{[SEP]}}}} & \small{\text{output embeddings}}\\
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\ \hline
&&&\textbf{Layer 12}&&& & \small{\text{BERT pretrained layer}} \\ \hline
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\ \hline
&&&\textbf{Layer 2}&&& & \small{\text{BERT pretrained layer}} \\ \hline
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\ \hline
&&&\textbf{Layer 1}&&& & \small{\text{BERT pretrained layer}} \\ \hline
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\
\boxed{\small{\text{[CLS]}}} & \boxed{\text{what}} & \boxed{\text{movies}} & \boxed{\text{star}} & \boxed{\text{bruce}} & \boxed{\text{willis}} & \boxed{\small{\text{[SEP]}}} & \small{\text{input}} \\
\\
\end{array}
$$

Let's install the [transformers](https://github.com/huggingface/transformers) package from Hugging Face which will give us a pytorch interface for working with BERT.

At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for working with BERT. It also includes pre-built modifications of these models suited for specific task. For example, in this lab we will use `BertForTokenClassification`.
(but there are also other classes for sequence classification, question answering, next sentence prediciton, etc.)

In [None]:
!pip install 'transformers==3.0.0'

In [None]:
import random
import numpy as np
import torch

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
set_seed(42)

# [3] Tokenization

## [3.1] About BERT tokenization

To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer **included with BERT**, which performs a special kind of tokenization called **Wordpiece tokenization**. This allows to map virtually any word to a token without using any special "out of vocabulary" or "unknown" tokens.

The cell below will download the tokenizer for us. We'll be using the "uncased" version of BERT (`bert-base-uncased`) in this example, meaning that the model was pre-trained on text that was only lower-case.

When creating the `tokenizer` we need specify both the kind of BERT model we want to use and the fact that we want the tokenizer to convert every string to lower case (`do_lower_case=True`), otherwise the `tokenizer` will produce tokens that the model will not understand.

In [None]:
import torch
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Let's apply the tokenizer to one sentence just to see the output.


In [None]:
# Print the original sentence.
sent = "I like eating strawberries with my friend Mike."
print('Original: ', sent)

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sent))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent)))


As we can see, the tokenizer converts every word to lower case. It also **splits some words into smaller pieces**, called wordpieces: the word `strawberries` becomes the two tokens `straw` and `#berries`. This helps the BERT model to understand new words by breaking them into smaller pieces and analyzing them separately.

The last line shows the numericalization of the tokens: each number is the index of that token in the vocabulary of the model, similarly to what happened in the first bag-of-word approach.

Apart from tokenizing the words and numericalizing them, we also need to use some **special tokens** which were included in the pretraining of BERT.

- **`[SEP]`**: signals to BERT the end of the sentence. It is especially useful in tasks where you need to give two sentences as an input (e.g. entailment).

- **`[CLS]`**: in classification tasks we must prepend the special `[CLS]` token to the beginning of every sentence.
BERT consists of 12 Transformer layers. Each transformer takes in a list of token embeddings, and produces the same number of embeddings on the output (but with the feature values changed).
If we were performing sentence classification then in the output of the final (12th) transformer, *only the first embedding (corresponding to the [CLS] token) would be used by the classifier*.
But we will be performing **token classification**, so we will use the output of all of the tokens.

- **`[PAD]`**: padding token, used to make all sequences have the same length. Padding tokens will also be ignore when calculating the loss of the model.

## [3.2] Tokenizing the dataset

Now that we know more about how BERT tokenizers work, let's tokenize our input texts.

The `transformers` library offers a lot of automated functions to tokenize texts, add special tokens and pad. But unfortunately we can not make use of them.

We are dealing with a very special case: we have **one label for each word in the text**. And each word might get split into subtokens by the wordpiece tokenizer of BERT. This is something we need to keep track of by hand: **we need tokenize the text and and the labels at the same time** to make sure that the lists have the same length!

Here is an example tagging FRUITS:

|| | | | | | | |
|--|--------------|-----|-------|--------|---|---|---|
|**Original text and labels**| Strawberries | and | green | apples | . | | |
|| B | O | B | I | O | | |

|| | | | | | | |
|--|--------------|-----|-------|--------|---|---|---|
|**After BERT tokenization** | Straw|##berries | and | green | apple|##s | . |
| | B | I | O | B | I | I | O |

As you can see, "Strawberries" was split into two sub-words, so we had to split its label `B` into two labels: `B` and `I`. Sames goes for "apples": its original label `I` was split into two new labels `I` and `I`.

The function `tokenize_and_preserve_labels` will take care of this.

In [None]:
def tokenize_and_preserve_labels(tokens, labels, bert_tokenizer):
    '''
    Word piece tokenization makes it difficult to match word labels
    back up with individual word pieces. This function tokenizes each
    word one at a time so that it is easier to preserve the correct
    label for each subword. It is, of course, a bit slower in processing
    time, but it will help our model achieve higher accuracy.
    
    See also:
    https://gab41.lab41.org/lessons-learned-fine-tuning-bert-for-named-entity-recognition-4022a53c0d90
    '''

    extended_tokens = []
    extended_labels = []

    for (word, label) in zip(tokens, labels):
        
        # Tokenize the word and count number of subwords the word is broken into
        tokenized_word = bert_tokenizer.tokenize(word)
        n_subwords = len(tokenized_word) # FILL WITH CODE

        # Add the tokenized word to the final tokenized word list
        extended_tokens.extend(tokenized_word) # FILL WITH CODE

        # Add the label to the new list of labels `n_subwords` times
        suffix = ''
        if len(label) > 1:
            suffix = label[1:]  # suffix is the "ACTOR" in "B-ACTOR"

        # if the original label is B -> B I I ...
        if label[0] == 'B':
            extended_labels.extend([f'B{suffix}']+[f'I{suffix}'] * (n_subwords-1)) # FILL WITH CODE
        # if the original label is I -> I I I ...
        # or the origianl label is O -> O O O ...
        else:
            extended_labels.extend([label]*n_subwords) # FILL WITH CODE
        
    assert(len(extended_labels) == len(extended_tokens))

    CLS = bert_tokenizer.cls_token
    PAD = bert_tokenizer.pad_token
    SEP = bert_tokenizer.sep_token

    # adding special tokens
    extended_tokens = [CLS] + extended_tokens + [SEP]
    extended_labels = ["O"] + extended_labels + ["O"]
    
    return extended_tokens, extended_labels


Great, let's see if this function works by tokenizing the train set.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
train_tokens = []
train_toklabels = []

for sentence, labels in zip(tqdm(train_sents), train_labels):
    toks, toklabels = tokenize_and_preserve_labels(sentence, labels, tokenizer)
    train_tokens.append(toks)
    train_toklabels.append(toklabels)

idx = 7
print("-"*30)
for s,l in zip(train_sents[idx], train_labels[idx]):
    print(f"{s:<15}| {l}")
print("-"*30)
for s,l in zip(train_tokens[idx], train_toklabels[idx]):
    print(f"{s:<15}| {l}")
print("-"*30)

**Expected output:**

```text
------------------------------
do             | O
you            | O
have           | O
any            | O
thrillers      | B-GENRE
directed       | O
by             | O
sofia          | B-DIRECTOR
coppola        | I-DIRECTOR
------------------------------
[CLS]          | O
do             | O
you            | O
have           | O
any            | O
thriller       | B-GENRE
##s            | I-GENRE
directed       | O
by             | O
sofia          | B-DIRECTOR
cop            | I-DIRECTOR
##pol          | I-DIRECTOR
##a            | I-DIRECTOR
[SEP]          | O
------------------------------
```


In [None]:
test_tokens = []
test_toklabels = []

# FILL WITH CODE
for sentence, labels in zip(tqdm(test_sents),test_labels):
  toks, toklabels = tokenize_and_preserve_labels(sentence, labels, tokenizer)
  test_tokens.append(toks)
  test_toklabels.append(toklabels)



idx = 7
print("-"*30)
for s,l in zip(test_sents[idx], test_labels[idx]):
    print(f"{s:<15}| {l}")
print("-"*30)
for s,l in zip(test_tokens[idx], test_toklabels[idx]):
    print(f"{s:<15}| {l}")
print("-"*30)

## [3.3] Padding all texts and labels

We need all input sequences to be of the same length.
Since we have one output label for each token of the sequence, labels will need padding too!
We will pad the sentences with the special `[PAD]` token of the BERT tokenizer. Different tokenizers might use different strings for the special token, so it's always safer to use `tokenizer.pad_token`.
We decide to pad labels with the same padding token.

In [None]:
MAX_SEQ_LEN = 50

def pad_sequence(sequence, max_len, pad_item):
    
    length_of_padding = max_len - len(sequence) # FILL WITH CODE
    
    # If the sequence is too short: pad it
    if length_of_padding >= 0: 
        padding = [pad_item] * length_of_padding # FILL WITH CODE
        out = sequence + padding
    
    # If the sequence is too long: cut it
    elif length_of_padding < 0: 
        out = sequence[:max_len] # FILL WITH CODE
    
    return out

In [None]:
padded_train_tokens = [pad_sequence(t, MAX_SEQ_LEN, tokenizer.pad_token) for t in train_tokens]
padded_train_labels = [pad_sequence(t, MAX_SEQ_LEN, tokenizer.pad_token) for t in train_toklabels]
print(" ".join(padded_train_tokens[0]))
print(" ".join(padded_train_labels[0]))

**Expected output:**
```
[CLS] what movies star bruce willis [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
O O O O B-ACTOR I-ACTOR O [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]```



In [None]:
padded_test_tokens = [pad_sequence(t, MAX_SEQ_LEN, tokenizer.pad_token) for t in test_tokens]
padded_test_labels = [pad_sequence(t, MAX_SEQ_LEN, tokenizer.pad_token) for t in test_toklabels]
print(" ".join(padded_test_tokens[0]))
print(" ".join(padded_test_labels[0]))

**Expected output:**
```
[CLS] are there any good romantic comedies out right now [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
O O O O O B-GENRE I-GENRE O B-YEAR I-YEAR O [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]```


## [3.4] Attention masks

Despite their name, they have nearly nothing to do with the attention mechanism of BERT!
They are an additional input for the BERT model and they are used to **distinguish padding tokens from real tokens**.

Padding tokens will have a 0 in their mask, while real tokens (including [CLS] and [SEP]) will have a 1.

Example
```
[CLS] are there any good romantic comedies out right now [SEP] [PAD] [PAD] [PAD] [PAD]
1     1   1     1   1    1        1        1   1     1   1     0     0     0     0
```

Attention masks are importat: **the loss of the model will be calculated on all the token which have a 1 in their mask**. Models will not generate errors or warnings if attention masks are not given as an input, but they will create a default mask of ones (so padding tokens will be considered as real tokens).

In [None]:
train_attention_mask = [[0 if t==tokenizer.pad_token else 1 for t in tokens] for tokens in padded_train_tokens]
test_attention_mask = [[0 if t==tokenizer.pad_token else 1 for t in tokens] for tokens in padded_test_tokens] # FILL WITH CODE

for t,m in zip(padded_train_tokens[7],train_attention_mask[7]):
    print(f"{m}  {t}")

# [4] Numericalization and Dataloaders

Now we are ready to turn all the strings (tokens and labels) into numerical data.

## [4.1] Numericalize the tokens

We can easily confer all the tokens to numbers using the `tokenizer`: `tokenizer.convert_tokens_to_ids()` takes as input a list of tokens and returns a list of int, representing the index of the tokens in the tokenizer's vocabulary.

In [None]:
padded_train_ids = [tokenizer.convert_tokens_to_ids(t) for t in padded_train_tokens] # FILL WITH CODE
padded_test_ids = [tokenizer.convert_tokens_to_ids(t) for t in padded_test_tokens] # FILL WITH CODE

print(padded_train_ids[0])

## [4.2] Numericalize the labels

This time we are going to need to numericalize the labels too. Frist of all we will create `label_map` (a "vocabulary" for labels) which will map each label to an int. This map will include a value for the padding label too. We will set the padding label to -1 to make it clearly different from the numericalization of the other real labels.

Let's write a function `create_label_map` to generate a dictionary and map each label to an integer.

In [None]:
def create_label_map(dataset_labels):
    
    all_labels = set()
    
    for sent in dataset_labels:
        for label in sent:
            all_labels.add(label)

    label_map = {label:val for val, label in enumerate(all_labels)} # FILL WITH CODE
    label_map[tokenizer.pad_token] = -1 # FILL WITH CODE
        
    return label_map

In [None]:
label_map = create_label_map(train_labels)
print(len(label_map), "labels")
display(label_map)

**Expected output:**



```
26 labels
{'B-ACTOR': 4,
 'B-CHARACTER': 5,
 'B-DIRECTOR': 21,
 'B-GENRE': 1,
 'B-PLOT': 18,
 'B-RATING': 19,
 'B-RATINGS_AVERAGE': 3,
 'B-REVIEW': 10,
 'B-SONG': 8,
 'B-TITLE': 14,
 'B-TRAILER': 20,
 'B-YEAR': 2,
 'I-ACTOR': 7,
 'I-CHARACTER': 0,
 'I-DIRECTOR': 17,
 'I-GENRE': 16,
 'I-PLOT': 6,
 'I-RATING': 9,
 'I-RATINGS_AVERAGE': 15,
 'I-REVIEW': 24,
 'I-SONG': 22,
 'I-TITLE': 12,
 'I-TRAILER': 13,
 'I-YEAR': 23,
 'O': 11,
 '[PAD]': -1}```



We can now use this function to conver all the padded labels into their numericalized version!


In [None]:
padded_train_labels_int = [[label_map[l] for l in labels] for labels in padded_train_labels] # FILL WITH CODE
padded_test_labels_int = [[label_map[l] for l in labels] for labels in padded_test_labels] # FILL WITH CODE

print("Actual text and labels\n", train_sents[7], "\n", train_labels[7])
print("\nTokenized text and labels\n", padded_train_tokens[7], "\n", padded_train_labels[7])
print("\nText_ids and numericalized labels:\n", padded_train_ids[7], "\n", padded_train_labels_int[7])

**Expected output:**


```
Actual text and labels
 ['do', 'you', 'have', 'any', 'thrillers', 'directed', 'by', 'sofia', 'coppola'] 
 ['O', 'O', 'O', 'O', 'B-GENRE', 'O', 'O', 'B-DIRECTOR', 'I-DIRECTOR']

Tokenized text and labels
 ['[CLS]', 'do', 'you', 'have', 'any', 'thriller', '##s', 'directed', 'by', 'sofia', 'cop', '##pol', '##a', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] 
 ['O', 'O', 'O', 'O', 'O', 'B-GENRE', 'I-GENRE', 'O', 'O', 'B-DIRECTOR', 'I-DIRECTOR', 'I-DIRECTOR', 'I-DIRECTOR', 'O', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

Text_ids and numericalized labels:
 [101, 2079, 2017, 2031, 2151, 10874, 2015, 2856, 2011, 8755, 8872, 18155, 2050, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 
 [11, 11, 11, 11, 11, 1, 16, 11, 11, 21, 17, 17, 17, 11, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]```



## [4.3] Tensor Datasets and Data Loaders

We can now create our two Tensor Datasets!

We will also add a new field to our Dataset: an **unique id for each sample**. This will be useful later to map all the predictions of the model back to the corresponding sample.
So the fields of our datasets will be:
1. input ids
2. labels (numericalized, one for each token)
3. attention mask
4. unique id for the text sample

In [None]:
from torch.utils.data import TensorDataset

input_ids_train = torch.LongTensor(padded_train_ids)
attention_masks_train = torch.LongTensor(train_attention_mask)
labels_train = torch.LongTensor(padded_train_labels_int)
sample_ids_train = torch.LongTensor(range(len(input_ids_train)))

train_dataset = TensorDataset(input_ids_train, attention_masks_train, labels_train, sample_ids_train)

input_ids_test = torch.LongTensor(padded_test_ids) # FILL WITH CODE
attention_masks_test = torch.LongTensor(test_attention_mask) # FILL WITH CODE
labels_test = torch.LongTensor(padded_test_labels_int) # FILL WITH CODE
sample_ids_test = torch.LongTensor(range(len(input_ids_test))) # FILL WITH CODE

test_dataset = TensorDataset(input_ids_test, attention_masks_test, labels_test, sample_ids_test) # FILL WITH CODE

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training
batch_size = 32

# Create the DataLoaders for our train and test sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# Craete the test_dataloader on the test_dataset
# Use a SequentialSampler instead of a RandomSampler to pull out batches sequentially
test_dataloader = DataLoader(
            test_dataset, # FILL WITH CODE
            sampler = SequentialSampler(test_dataset), # FILL WITH CODE
             batch_size = batch_size # FILL WITH CODE
        )

print("Number of train batches:", len(train_dataloader))
print("Number of test batches: ", len(test_dataloader))

# [5] The actual model!

We can now finally instantiate our BERT model for token classification.

We will use a `BertConfig` to configure some options of the model: what kind of BERT we want to use (version base uncased) and the number of output labels for out NER task. Note that we count the labels using the length of the `label_map` minus one (we don't need the model to predict the padding label).

In [None]:
from transformers import BertForTokenClassification, AdamW, BertConfig

config = BertConfig.from_pretrained("bert-base-uncased", num_labels = len(label_map)-1)

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForTokenClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    config = config
)

device_id = "cuda" if torch.cuda.is_available() else "cpu"

# Tell pytorch to run this model on the GPU.
model.to(device_id)

With that in mind, let's define the functions to train and test the model: they are pretty similar to the ones you are already used to.

In [None]:
def train_bert_one_epoch(model, dataloader, epoch):
    
    # Reset the total loss for this epoch.
    total_loss = 0
    total_accuracy = 0

    # Put the model into training mode
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(tqdm(dataloader)):

        # Unpack this training batch from our dataloader. 
        # `batch` contains four pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        #   [3]: sample ids
        b_input_ids = batch[0].to(device_id)
        b_input_mask = batch[1].to(device_id)
        b_labels = batch[2].to(device_id)
        
        # Always clear any previously calculated gradients before performing a backward pass
        model.zero_grad()

        # Perform a forward pass (evaluate the model on this training batch).
        loss, logits = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask, 
                            labels=b_labels)

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()
    # Calculate the average loss over all of the batches.
    avg_loss = total_loss / len(dataloader)            
    
    print("  Average loss: {0:.4f}".format(avg_loss))

During testing we will actually **save the predictions in a dataframe** (a table) to compute some metrics at a later time.

`result_df` is a dataframe with **a line for each test sample** and three columns: one for the `text_ids`, one for the real labels (`gold_labels`) and one for the predictions of the model (`gold_labels`).

It starts as completely empty, and the test function will populate it thanks to the sample ids that we have included in our dataloaders.

In [None]:
result_df_index = [int(d[-1]) for d in test_dataset]
result_df_columns = ["text_ids", "gold_labels", "pred_labels"]
result_df = pd.DataFrame(index=result_df_index, columns=result_df_columns)
result_df

In [None]:
def test_bert(model, dataloader):

    print("Testing")
    
    # Reset the total loss for this epoch.
    total_loss = 0
    total_accuracy = 0

    # Put the model into testing mode
    model.eval() # FILL WITH CODE

    # For each batch of testing data...
    for step, batch in enumerate(tqdm(dataloader)):

        # Unpack this test batch from our dataloader. 
        # `batch` contains four pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        #   [3]: sample ids
        b_input_ids = batch[0].to(device_id) # FILL WITH CODE
        b_input_mask = batch[1].to(device_id) # FILL WITH CODE
        b_labels = batch[2].to(device_id) # FILL WITH CODE
        sample_ids = batch[3].to(device_id) # FILL WITH CODE

        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Perform a forward pass (evaluate the model on this test batch).
            (loss, logits) = model(b_input_ids,
                                   token_type_ids=None,
                                   attention_mask=b_input_mask,
                                   labels=b_labels) # FILL WITH CODE

        # Accumulate the test loss over all of the batches so that we can
        # calculate the average loss at the end.
        total_loss += loss.item()

        
        # No need to backpropagate on the loss.
        # Let's calculate the accuracy instead.
        
        # Move logits and labels to CPU
        #logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        preds = torch.argmax(logits, axis=-1)
        
        for idx_, text_, target_, pred_ in zip(sample_ids, b_input_ids, b_labels, preds):
            idx_ = int(idx_)
            result_df.at[idx_, "text_ids"] = text_.tolist()
            result_df.at[idx_, "gold_labels"] = target_.tolist()
            result_df.at[idx_, "pred_labels"] = pred_.tolist()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        #total_accuracy += flat_accuracy(logits, label_ids)
    # Calculate the average loss over all of the batches.
    avg_loss = total_loss / len(dataloader)            
    
    print("  Average loss: {0:.4f}".format(avg_loss))

    
    # Report the final accuracy for this validation run.
    avg_accuracy = total_accuracy / len(dataloader)

Let's create the optimizer and scheduler for the model (this is the defaul setup for BERT in this library) and train the model for 4 epochs!

In [None]:
BERT_EPOCHS = 4

# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
optimizer = AdamW(model.parameters(),
                    lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                    eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

from transformers import get_linear_schedule_with_warmup

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * BERT_EPOCHS

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

Training will take approximately 10 minutes.

In [None]:
if torch.cuda.is_available():
    model.to("cuda")

set_seed(42)

print("training")
for epoch in range(BERT_EPOCHS):
    train_bert_one_epoch(model, train_dataloader, epoch) # FILL WITH CODE


Testing the model will populate `result_df` with the predictions!

In [None]:
test_bert(model, test_dataloader) # FILL WITH CODE
display(result_df)

# [6] Evaluation

In this last part we will finally take a look at some NER metrics to understand the performance of the model.

Before starting we will create two new columns in our `result_df`: `gold_no_pad` and `pred_no_pad`. In this way the padding tokens will not create problems in our evaluation. 

In [None]:
result_df["gold_no_pad"] = None
result_df["pred_no_pad"] = None

for idx, row in result_df.iterrows():

    gold = row.gold_labels
    pred = row.pred_labels

    first_pad = gold.index(-1)
    
    result_df.loc[idx, "gold_no_pad"] = gold[:first_pad]
    result_df.loc[idx, "pred_no_pad"] = pred[:first_pad]

result_df

In the setting of named entity recognitions metrics can be computed either at **token level** or at **entity level**.

Token-level metrics are usually way more "optimistic" than entity level ones, as they do not check if the whole entity is being detected, but just look at the general distribution of the output labels.

On the other side, entity-level metrics are more difficult to compute and there is no widespread agreement on how to compute them.

## [6.1] Token level metrics

Here we will calculate precision, recall and f1 score for all the samples, aggregated by entity type.

Note that we are not considering the padding when calculating these metrics: we are dropping all the items for which the gold label is -1 (our chosen padding).

In [None]:
from sklearn.metrics import precision_recall_fscore_support
import numpy as np

def calc_metrics_TOKEN(gold_list, pred_list):
        
    all_gold = np.array(gold_list).flat
    all_pred = np.array(pred_list).flat
    
    gold_flat = []
    pred_flat = []
    
    for gold, pred in zip(all_gold, all_pred):
        if gold != -1:
            gold_flat.append(gold)
            pred_flat.append(pred)
            
    gold_flat = np.array(gold_flat)
    pred_flat = np.array(pred_flat)
    
    return precision_recall_fscore_support(gold_flat, pred_flat, labels=list(range(len(label_map))))
    

In [None]:
pre, rec, f1, supp = calc_metrics_TOKEN(result_df.gold_labels.tolist(), result_df.pred_labels.tolist())

label_map_ = {v:k for k,v in label_map.items()}

metrics_df = pd.DataFrame({
    "label_id": list(range(len(label_map))),
    "precision": pre,
    "recall": rec,
    "f1": f1,
    "support": supp,
})
metrics_df.index = [label_map_[x] for x in label_map_.keys()]
metrics_df.drop("[PAD]", axis=0, inplace=True)
metrics_df.sort_values(by="support", ascending=False)

## [6.2] Entity level metrics

We are going to calculate the entity level metrics using the metrics defined by the International Workshop on Semantic Evaluation (SemEval).

A brief history of different metrics for NER can be found at here: http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

The author of the post also provides an implementation to evaluate all the mentioned metrics.
We are going to use a sligtly modified (improved?) version of its code, located in the file `ner_eval.py` (it was downloaded together with the dataset and you can find it in the file section of colab).

In [None]:
from ner_eval import *

def calc_metrics_ENTITY(gold_list, pred_list):
        
    all_gold = []
    all_pred = []

    for g in gold_list:
        all_gold.append([label_map_[x] for x in g])
    for p in pred_list:
        all_pred.append([label_map_[x] for x in p])

    all_entities = list(label_map.keys())
    all_entities = [e[2:] for e in all_entities if e not in ["O","[PAD]"]]
    
    evaluator = Evaluator(all_gold, all_pred, all_entities)
            
    results, results_agg = evaluator.evaluate()

    return results_agg

res = calc_metrics_ENTITY(result_df.gold_no_pad.tolist(), result_df.pred_no_pad.tolist())

"""
            | check boundaries  |
            | correct           |
            +--------+----------+
            | Y      | N        |
--------+---+--------+----------+
check   | Y | strict | ent_type |
entity  +---+--------+----------+
correct | N | exact  | partial  |
--------+---+--------+----------+

"""
new_res = {}
for k in res.keys():
    new_res[k] = res[k]["partial"]
df = pd.DataFrame.from_dict(new_res, orient="index")
df["f1"] = (2*df["precision"]*df["recall"]) / (df["precision"]+df["recall"])
df.columns = df.columns.str.replace("^(.+)", "partial \g<1>")

new_res = {}
for k in res.keys():
    new_res[k] = res[k]["strict"]
df2 = pd.DataFrame.from_dict(new_res, orient="index")
df2["f1"] = (2*df2["precision"]*df2["recall"]) / (df2["precision"]+df2["recall"])
df2.columns = df2.columns.str.replace("^(.+)", "strict \g<1>")

df = pd.concat([df, df2], axis=1)
display(df[["strict precision", "strict recall", "strict f1",
            "partial precision", "partial recall", "partial f1"]]\
        .sort_values(by="partial f1", ascending=False))