# Practice 4: Named Entity Recognition

## Introduction

### Formulation of the problem

In this assignment, you will solve the Named Entity Recognition (NER) problem, one of the most common in NLP, along with the text classification problem.

This task involves classifying each word/token whether it is part of a named entity (an entity may consist of multiple words/tokens) or not.

For example, we want to extract names and organization names. Then for the text

     Yan    Goodfellow  works  for  Google  Brain

The model should extract the following sequence:

     B-PER  I-PER       O      O    B-ORG   I-ORG

where the prefixes *B-* and *I-* denote the beginning and end of the named entity, *O* denotes a word without a tag. This prefix system (*BIO* tagging) was introduced to distinguish between successive named entities of the same type.
There are other types of tagging, such as [*BILUO*](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)), but for this tutorial we will focus on *BIO*.

We will solve the NER problem on the CoNLL-2003 dataset using recurrent networks and models based on the Transformer architecture.

### Libraries

Main libraries:
  - [PyTorch](https://pytorch.org/)
  - [Transformers](https://github.com/huggingface/transformers)

### Data

The data is stored in an archive, which consists of:

- *train.tsv* - training sample. Each line contains: <word / token>, <word / token tag>

- *valid.tsv* - validation sample, which can be used to select hyperparameters and quality measurements. It has an identical structure to train.tsv.

- *test.tsv* - test sample, which is used to evaluate the final quality. It has an identical structure to train.tsv.

You can download the data here: [link](https://drive.google.com/drive/folders/1OKNrfHsBm1ehbG-yM0R1BGshbscf_eue?usp=drive_link)

In [1]:
# !pip install numpy==1.21.6 scikit-learn==1.0.2 tensorboard==2.9.0 torch==1.12.1 tqdm==4.64.0 transformers==4.21.1


In [2]:
import random
from collections import Counter, defaultdict, namedtuple
from typing import Any, Dict, List, Tuple

import numpy as np

import torch

from tqdm import tqdm, trange


Let's fix the seed for reproducibility of the results (it is advisable to do this **always**!):

In [3]:
def set_global_seed(seed: int) -> None:
    """
    Set global seed for reproducibility.
    """

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


set_global_seed(42)


Let’s initialize the device (CPU / GPU) on which we will work (preferably **GPU**):

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device


'cpu'

Initialize *tensorboard* to log metrics during the training process:

In [5]:
%load_ext tensorboard
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 64061), started 4 days, 22:48:23 ago. (Use '!kill 64061' to kill it.)

## Part 1. Data preparation (4 points)

First of all, we need to read the data. Let's write a function that takes as input the path to one of the conll-2003 files and returns two lists:
- a list of lists of words/tokens (and corresponding to it)
- list of lists of tags

P.S. Let's make this function more flexible by supplying a boolean variable as input, whether we read data in *lowercase* or not.

**Exercise. Implement the `read_conll2003` function.** **<font color='red'>(1 point)</font>**

In [6]:
def read_conll2003(
    path: str,
    lower: bool = True,
) -> Tuple[List[List[str]], List[List[str]]]:
    """
    Prepare data in CoNNL like format.
    """

    token_seq = []
    label_seq = []
    token_buffer = []
    label_buffer = []

    with open(path, "r") as file:
        for line in file:
            if not line.strip():
                if token_buffer:
                    token_seq.append(token_buffer)
                    label_seq.append(label_buffer)
                    token_buffer = []
                    label_buffer = []
                continue
            try:
                token, label = line.strip().split(" ")
            except:
                print("Can't split line: ", line)
                continue
            if lower:
                token = token.lower()
            token_buffer.append(token)
            label_buffer.append(label)

    if token_buffer:
        token_seq.append(token_buffer)
        label_seq.append(label_buffer)

    return token_seq, label_seq


Let's read all three files:

- *train.tsv*
- *valid.tsv*
- *test.tsv*

In [7]:
train_token_seq, train_label_seq = read_conll2003("data/train.txt")
valid_token_seq, valid_label_seq = read_conll2003("data/valid.txt")
test_token_seq, test_label_seq = read_conll2003("data/test.txt")


Look at what we got:

In [8]:
for token, label in zip(train_token_seq[0], train_label_seq[0]):
    print(f"{token}\t{label}")


eu	B-ORG
rejects	O
german	B-MISC
call	O
to	O
boycott	O
british	B-MISC
lamb	O
.	O


In [9]:
for token, label in zip(valid_token_seq[0], valid_label_seq[0]):
    print(f"{token}\t{label}")


cricket	O
-	O
leicestershire	B-ORG
take	O
over	O
at	O
top	O
after	O
innings	O
victory	O
.	O


In [10]:
for token, label in zip(test_token_seq[0], test_label_seq[0]):
    print(f"{token}\t{label}")


soccer	O
-	O
japan	B-LOC
get	O
lucky	O
win	O
,	O
china	B-PER
in	O
surprise	O
defeat	O
.	O


In [11]:
assert len(train_token_seq) == len(
    train_label_seq
), "The lengths of the training token_seq and label_seq do not match, an error in the read_conll2003 function"
assert len(valid_token_seq) == len(
    valid_label_seq
), "The lengths of the validation token_seq and label_seq do not match, an error in the read_conll2003 function"
assert len(test_token_seq) == len(
    test_label_seq
), "The lengths of the test token_seq and label_seq do not match, an error in the read_conll2003 function"

assert train_token_seq[0] == [
    "eu",
    "rejects",
    "german",
    "call",
    "to",
    "boycott",
    "british",
    "lamb",
    ".",
], "Error in training token_seq"
assert train_label_seq[0] == [
    "B-ORG",
    "O",
    "B-MISC",
    "O",
    "O",
    "O",
    "B-MISC",
    "O",
    "O",
], "Error in training label_seq"

assert valid_token_seq[0] == [
    "cricket",
    "-",
    "leicestershire",
    "take",
    "over",
    "at",
    "top",
    "after",
    "innings",
    "victory",
    ".",
], "Error in validation token_seq"
assert valid_label_seq[0] == [
    "O",
    "O",
    "B-ORG",
    "O",
    "O",
    "O",
    "O",
    "O",
    "O",
    "O",
    "O",
], "Error in validation label_seq"

assert test_token_seq[0] == [
    "soccer",
    "-",
    "japan",
    "get",
    "lucky",
    "win",
    ",",
    "china",
    "in",
    "surprise",
    "defeat",
    ".",
], "Error in test token_seq"
assert test_label_seq[0] == [
    "O",
    "O",
    "B-LOC",
    "O",
    "O",
    "O",
    "O",
    "B-PER",
    "O",
    "O",
    "O",
    "O",
], "Error in test label_seq"

print("All tests passed!")


All tests passed!


The CoNLL-2003 dataset is presented in the form of **BIO** tagging, where the label is:
- *B-{label}* - beginning of entity *{label}*
- *I-{label}* - continuation of the entity *{label}*
- *O* - no entity

There are also other sequence tagging methods, such as **BILUO**.

### Preparing dictionaries

To train the neural network, we will use two mappings:
- {**token**}→{**token_idx**}: correspondence between word/token and string in *embedding* matrix (starts from 0);
- {**label**}→{**label_idx**}: correspondence between tag and unique index (starts from 0);

Now we need to implement two functions:
- get_token2idx
- get_label2idx

which will return the corresponding dictionaries.

P.S. token2idx dictionary must also contain special tokens:
- `<PAD>` is a special token for padding, since we are going to train the models in batches
- `<UNK>` is a special token for processing words/tokens that are not in the dictionary (relevant for inference)

Let's assign them to idx 0 and 1 respectively for convenience.

P.P.S. You can also add a *min_count* parameter to get_token2idx, which will only include words exceeding a certain frequency.

First let's collect:
- token2cnt - a dictionary from a unique word / token to the number of these words / tokens in the training set (it is important that only in the training set!)
- label_set - a list of unique tags

P.S. You can also use stemming to convert different word forms of the same word into one token, but we will skip this point.

**Exercise. Implement the `get_token2idx` and `get_label2idx` functions.** **<font color='red'>(1 point)</font>**

In [12]:
token2cnt = Counter([token for sentence in train_token_seq for token in sentence])


In [13]:
token2cnt.most_common(10)


[('the', 8390),
 ('.', 7374),
 (',', 7290),
 ('of', 3815),
 ('in', 3621),
 ('to', 3424),
 ('a', 3199),
 ('and', 2872),
 ('(', 2861),
 (')', 2861)]

In [14]:
print(f"Number of unique words in the training dataset: {len(token2cnt)}")
print(
    f"Number of words occurring only once in the training dataset: {len([token for token, cnt in token2cnt.items() if cnt == 1])}"
)


Number of unique words in the training dataset: 21010
Number of words occurring only once in the training dataset: 10060


As we can see, we have many words that appear only once in the dataset. Obviously, we won’t be able to learn from them, we will only overfit, so let’s throw out such words when forming our vocabulary.

In [15]:
# use the min_count parameter to cut off words with frequency cnt < min_count


def get_token2idx(
    token2cnt: dict[str, int],
    min_count: int,
) -> dict[str, int]:
    """
    Get mapping from tokens to indices to use with Embedding layer.
    """

    token2idx: dict[str, int] = {}

    token2idx["<PAD>"] = 0
    token2idx["<UNK>"] = 1

    for token, cnt in token2cnt.items():
        if cnt >= min_count:
            token2idx[token] = len(token2idx)

    return token2idx


In [16]:
token2idx = get_token2idx(token2cnt, min_count=2)


In [17]:
# Function for sorting tags so that first there is an O tag,
# then B- tags and only after I- tags (can be set manually)


def sort_labels_func(x: str) -> int:
    if x == "O":
        return 0
    elif x.startswith("B-"):
        return 1
    else:
        return 2


label_set = sorted(
    set(label for sentence in train_label_seq for label in sentence),
    key=lambda x: (sort_labels_func(x), x),
)


In [18]:
label_set


['O', 'B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER']

In [19]:
def get_label2idx(label_set: list[str]) -> dict[str, int]:
    """
    Get mapping from labels to indices.
    """

    label2idx: dict[str, int] = {}

    for label in label_set:
        label2idx[label] = len(label2idx)

    return label2idx


In [20]:
label2idx = get_label2idx(label_set)


Let's look at what we got:

In [21]:
for token, idx in list(token2idx.items())[:10]:
    print(f"{token}\t{idx}")


<PAD>	0
<UNK>	1
eu	2
german	3
call	4
to	5
boycott	6
british	7
lamb	8
.	9


In [22]:
for label, idx in label2idx.items():
    print(f"{label}\t{idx}")


O	0
B-LOC	1
B-MISC	2
B-ORG	3
B-PER	4
I-LOC	5
I-MISC	6
I-ORG	7
I-PER	8


In [23]:
assert (
    len(get_token2idx(token2cnt, min_count=1)) == 21012
), "Error in dictionary length, most likely min_count is implemented incorrectly"
assert (
    len(token2idx) == 10952
), "Incorrect token2idx length, most likely min_count is implemented incorrectly"
assert len(label2idx) == 9, "Incorrect label2idx length"

assert list(token2idx.items())[:10] == [
    ("<PAD>", 0),
    ("<UNK>", 1),
    ("eu", 2),
    ("german", 3),
    ("call", 4),
    ("to", 5),
    ("boycott", 6),
    ("british", 7),
    ("lamb", 8),
    (".", 9),
], "Wrong format of token2idx"
assert label2idx == {
    "O": 0,
    "B-LOC": 1,
    "B-MISC": 2,
    "B-ORG": 3,
    "B-PER": 4,
    "I-LOC": 5,
    "I-MISC": 6,
    "I-ORG": 7,
    "I-PER": 8,
}, "Wrong format of label2idx"

print("All tests passed!")


All tests passed!


### Preparing the dataset and loader

Typically, neural networks are trained in batches. This means that each update of the neural network's weights occurs based on multiple sequences. A technical detail is the need to complete all sequences within the batch to the same length.

From the previous practical task, you should know about `Dataset` (`torch.utils.data.Dataset`) - a data structure that stores and can index data for training. The dataset must inherit from the standard PyTorch Dataset class and override the `__len__` and `__getitem__` methods.

The `__getitem__` method must return the indexed sequence and its tags.

**Don't forget** about `<UNK>` special token for unknown words!

Let's write a custom dataset for our task, which will receive as input (the `__init__` method):
- token_seq - list of lists of words/tokens
- label_seq - list of lists of tags
- token2idx
- label2idx

and return from the `__getitem__` method two int64 tensors (`torch.LongTensor`) with the indices of words / tokens in the sample and the indices of the corresponding tags:

**Exercise. Implement the NERDataset class.** **<font color='red'>(1 point)</font>**

In [24]:
class NERDataset(torch.utils.data.Dataset):
    """
    PyTorch Dataset for NER.
    """

    def __init__(
        self,
        token_seq: List[List[str]],
        label_seq: List[List[str]],
        token2idx: Dict[str, int],
        label2idx: Dict[str, int],
    ):
        self.token2idx = token2idx
        self.label2idx = label2idx

        self.token_seq = [
            self.process_tokens(tokens, token2idx) for tokens in token_seq
        ]
        self.label_seq = [
            self.process_labels(labels, label2idx) for labels in label_seq
        ]

    def __len__(self):
        return len(self.token_seq)

    def __getitem__(
        self,
        idx: int,
    ) -> tuple[torch.LongTensor, torch.LongTensor]:
        return torch.LongTensor(self.token_seq[idx]), torch.LongTensor(
            self.label_seq[idx]
        )

    @staticmethod
    def process_tokens(
        tokens: List[str],
        token2idx: Dict[str, int],
        unk: str = "<UNK>",
    ) -> List[int]:
        """
        Transform list of tokens into list of tokens' indices.
        """
        return [token2idx.get(token, token2idx[unk]) for token in tokens]

    @staticmethod
    def process_labels(
        labels: List[str],
        label2idx: Dict[str, int],
    ) -> List[int]:
        """
        Transform list of labels into list of labels' indices.
        """
        return [label2idx.get(label, label2idx["O"]) for label in labels]


Create three datasets:
- *train_dataset*
- *valid_dataset*
- *test_dataset*

In [25]:
train_dataset = NERDataset(
    token_seq=train_token_seq,
    label_seq=train_label_seq,
    token2idx=token2idx,
    label2idx=label2idx,
)
valid_dataset = NERDataset(
    token_seq=valid_token_seq,
    label_seq=valid_label_seq,
    token2idx=token2idx,
    label2idx=label2idx,
)
test_dataset = NERDataset(
    token_seq=test_token_seq,
    label_seq=test_label_seq,
    token2idx=token2idx,
    label2idx=label2idx,
)


Let's look at what we got:

In [26]:
train_dataset[0]


(tensor([2, 1, 3, 4, 5, 6, 7, 8, 9]), tensor([3, 0, 2, 0, 0, 0, 2, 0, 0]))

In [27]:
valid_dataset[0]


(tensor([1737,  571, 1777,  197,  687,  145,  349,  111, 1819, 1558,    9]),
 tensor([0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0]))

In [28]:
test_dataset[0]


(tensor([1516,  571, 1434, 1729, 4893, 2014,   67,  310,  215, 3157, 3139,    9]),
 tensor([0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0]))

In [29]:
assert len(train_dataset) == 14986, "Incorrect train_dataset length"
assert len(valid_dataset) == 3465, "Incorrect valid_dataset length"
assert len(test_dataset) == 3683, "Incorrect test_dataset length"

assert torch.equal(
    train_dataset[0][0], torch.tensor([2, 1, 3, 4, 5, 6, 7, 8, 9])
), "Malformed train_dataset"
assert torch.equal(
    train_dataset[0][1], torch.tensor([3, 0, 2, 0, 0, 0, 2, 0, 0])
), "Malformed train_dataset"

assert torch.equal(
    valid_dataset[0][0],
    torch.tensor([1737, 571, 1777, 197, 687, 145, 349, 111, 1819, 1558, 9]),
), "Malformed valid_dataset"
assert torch.equal(
    valid_dataset[0][1], torch.tensor([0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0])
), "Malformed valid_dataset"

assert torch.equal(
    test_dataset[0][0],
    torch.tensor([1516, 571, 1434, 1729, 4893, 2014, 67, 310, 215, 3157, 3139, 9]),
), "Malformed test_dataset"
assert torch.equal(
    test_dataset[0][1], torch.tensor([0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0])
), "Malformed test_dataset"

print("All tests passed!")


All tests passed!


In order to complete sequences with padding, we will use the `collate_fn` parameter of the `DataLoader` class.

Given a sequence of pairs of tensors for sentences and tags, it is necessary to complete all sequences to the sequence of the maximum length in the batch.

Use the special token `<PAD>` for completion of word/token sequences and -1 for tag sequences.

**hint**: it is convenient to use the `torch.nn.utils.rnn` method. Pay attention to the `batch_first` parameter.

`Collator` can be implemented in two ways:
- class with method `__call__`
- function

We will go the first way.

Initialize an instance of the `Collator` class (the `__init__` method) using two parameters:
- id `<PAD>` special token for word/token sequences
- id `<PAD>` special token for tag sequences (value -1)

The `__call__` method takes a batch as input, namely a list of tuples of what is returned from the `__getitem__` method of our dataset. In our case, this is a list of tuples of two int64 tensors - `List[Tuple[torch.LongTensor, torch.LongTensor]]`.

Ad the output we want to get two tensors:
- Indexes of word/token with paddings
- Indexes of tags with paddings
    
P.S. The `<PAD>` value is needed to easily distinguish pad tokens from others when calculating loss. You can use the `ignore_index` parameter when initializing the loss.

**Exercise. Implement the collator class NERCollator.** **<font color='red'>(1 point)</font>**

In [30]:
class NERCollator:
    """
    Collator that handles variable-size sentences.
    """

    def __init__(
        self,
        token_padding_value: int,
        label_padding_value: int,
    ):
        self.token_padding_value = token_padding_value
        self.label_padding_value = label_padding_value

    def __call__(
        self,
        batch: List[Tuple[torch.LongTensor, torch.LongTensor]],
    ) -> Tuple[torch.LongTensor, torch.LongTensor]:

        tokens, labels = zip(*batch)
        padded_tokens = torch.nn.utils.rnn.pad_sequence(
            tokens, batch_first=True, padding_value=self.token_padding_value
        )
        padded_labels = torch.nn.utils.rnn.pad_sequence(
            labels, batch_first=True, padding_value=self.label_padding_value
        )

        return padded_tokens, padded_labels


In [31]:
collator = NERCollator(
    token_padding_value=token2idx["<PAD>"],
    label_padding_value=-1,
)


Now everything is ready to define the loaders.

In [32]:
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=collator,
)
valid_dataloader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False,  # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)
test_dataloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False,  # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)


Let's look at what we got:

In [33]:
tokens, labels = next(iter(train_dataloader))

tokens = tokens.to(device)
labels = labels.to(device)


In [34]:
tokens


tensor([[7796, 1162, 2553, 7237, 1342,    0,    0,    0,    0,    0],
        [ 125, 1167,    1,   67, 1349,  489, 1215, 1364, 1365, 1366]])

In [35]:
labels


tensor([[ 3,  0,  3,  7,  0, -1, -1, -1, -1, -1],
        [ 0,  4,  8,  0,  1,  0,  0,  0,  0,  0]])

In [36]:
train_tokens, train_labels = next(
    iter(
        torch.utils.data.DataLoader(
            train_dataset,
            batch_size=2,
            shuffle=False,
            collate_fn=collator,
        )
    )
)
assert torch.equal(
    train_tokens,
    torch.tensor([[2, 1, 3, 4, 5, 6, 7, 8, 9], [10, 11, 0, 0, 0, 0, 0, 0, 0]]),
), "Looks like a bug in the collator"
assert torch.equal(
    train_labels,
    torch.tensor([[3, 0, 2, 0, 0, 0, 2, 0, 0], [4, 8, -1, -1, -1, -1, -1, -1, -1]]),
), "Looks like a bug in the collator"

valid_tokens, valid_labels = next(
    iter(
        torch.utils.data.DataLoader(
            valid_dataset,
            batch_size=2,
            shuffle=False,
            collate_fn=collator,
        )
    )
)
assert torch.equal(
    valid_tokens,
    torch.tensor(
        [
            [1737, 571, 1777, 197, 687, 145, 349, 111, 1819, 1558, 9],
            [248, 10679, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    valid_labels,
    torch.tensor(
        [[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1]]
    ),
), "Looks like a bug in the collator"

test_tokens, test_labels = next(
    iter(
        torch.utils.data.DataLoader(
            test_dataset,
            batch_size=2,
            shuffle=False,
            collate_fn=collator,
        )
    )
)
assert torch.equal(
    test_tokens,
    torch.tensor(
        [
            [1516, 571, 1434, 1729, 4893, 2014, 67, 310, 215, 3157, 3139, 9],
            [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    test_labels,
    torch.tensor(
        [
            [0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0],
            [4, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
        ]
    ),
), "Looks like a bug in the collator"

print("All tests passed!")


All tests passed!


## Part 2. BiLSTM tagger (6 points)

Define the network architecture using the PyTorch library.

Your architecture at this point should follow the standard tagger:
* Embedding layer at the input
* LSTM (unidirectional or bidirectional) layer for sequence processing
* Dropout (specified separately or built into LSTM) to reduce overfitting
* Linear output layer

To train the network, use an element-wise cross-entropy loss function.

**Please note** that `<PAD>` tokens should not be included in the loss function calculation. It is recommended to use Adam as an optimizer. To obtain prediction values from model outputs, use the `argmax` function.

**Exercise. Implement the BiLSTM model class.** **<font color='red'>(2 points)</font>**

In [37]:
class BiLSTM(torch.nn.Module):
    """
    Bidirectional LSTM architecture.
    """

    def __init__(
        self,
        num_embeddings: int,
        embedding_dim: int,
        hidden_size: int,
        num_layers: int,
        dropout: float,
        bidirectional: bool,
        n_classes: int,
    ):
        super().__init__()

        self.embedding = torch.nn.Embedding(num_embeddings, embedding_dim)
        self.rnn = torch.nn.LSTM(
            embedding_dim,
            hidden_size,
            num_layers,
            dropout=dropout,
            bidirectional=bidirectional,
        )
        self.head = torch.nn.Linear(
            hidden_size * (2 if bidirectional else 1), n_classes
        )

    def forward(self, tokens: torch.LongTensor) -> torch.Tensor:
        embed = self.embedding(tokens)

        # we use the special function pack_padded_sequence in order to obtain a PackedSequence structure
        # that does not take padding into account when passing rnn
        length = (tokens != 0).sum(dim=1).detach().cpu()
        packed_embed = torch.nn.utils.rnn.pack_padded_sequence(
            embed, length, batch_first=True, enforce_sorted=False
        )

        # we use the special function pad_packed_sequence to get a tensor from PackedSequence
        packed_rnn_output, _ = self.rnn(packed_embed)
        rnn_output, _ = torch.nn.utils.rnn.pad_packed_sequence(
            packed_rnn_output, batch_first=True
        )

        logits = self.head(rnn_output)
        return logits.transpose(1, 2)


In [38]:
model = BiLSTM(
    num_embeddings=len(token2idx),
    embedding_dim=100,
    hidden_size=100,
    num_layers=1,
    dropout=0.0,
    bidirectional=True,
    n_classes=len(label2idx),
).to(device)


In [39]:
model


BiLSTM(
  (embedding): Embedding(10952, 100)
  (rnn): LSTM(100, 100, bidirectional=True)
  (head): Linear(in_features=200, out_features=9, bias=True)
)

In [40]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = torch.nn.CrossEntropyLoss(ignore_index=-1)


In [41]:
outputs = model(tokens)


In [42]:
assert outputs.shape == torch.Size([2, 9, 10])
assert 2 < criterion(outputs, labels) < 3

print("All tests passed!")


All tests passed!


### Experiments

Run experiments on the data. Adjust parameters based on the validation set without using the test set. Your goal is to configure the network so that the quality of the model according to the F1-macro measure on the validation and test sets is no less than **0.76**.

Draw conclusions about model quality, overfitting, and sensitivity of the architecture to the choice of hyperparameters. Present the results of your experiments in the form of a mini-report (in the same ipython notebook).

In [43]:
# let's create a SummaryWriter for experimenting with BiLSTMModel

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir=f"logs/BiLSTMModel")


**Exercise. Implement a metric calculation function `compute_metrics`.** **<font color='red'>(1 point)</font>**

In [44]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score


def compute_metrics(
    outputs: torch.Tensor,
    labels: torch.LongTensor,
) -> Dict[str, float]:
    """
    Compute NER metrics.
    """

    metrics = {}

    y_true = labels.detach().cpu().numpy()
    y_pred = outputs.argmax(dim=-2).detach().cpu().numpy()
    assert y_true.shape == y_pred.shape
    mask = y_true != -1
    if not mask.sum():
        assert False, "No labels found"
        metrics = {}
        result = 0.0
        metrics["accuracy"] = result
        metrics["precision_micro"] = result
        metrics["precision_macro"] = result
        metrics["precision_weighted"] = result
        metrics["recall_micro"] = result
        metrics["recall_macro"] = result
        metrics["recall_weighted"] = result
        metrics["f1_micro"] = result
        metrics["f1_macro"] = result
        metrics["f1_weighted"] = result
        return metrics

    y_pred = y_pred[mask]
    y_true = y_true[mask]

    # accuracy
    accuracy = accuracy_score(
        y_true=y_true,
        y_pred=y_pred,
    )

    # precision
    precision_micro = precision_score(
        y_true=y_true,
        y_pred=y_pred,
        average="micro",
        zero_division=0,
    )
    precision_macro = precision_score(
        y_true=y_true,
        y_pred=y_pred,
        average="macro",
        zero_division=0,
    )
    precision_weighted = precision_score(
        y_true=y_true,
        y_pred=y_pred,
        average="weighted",
        zero_division=0,
    )

    # recall
    recall_micro = recall_score(
        y_true=y_true,
        y_pred=y_pred,
        average="micro",
        zero_division=0,
    )
    recall_macro = recall_score(
        y_true=y_true,
        y_pred=y_pred,
        average="macro",
        zero_division=0,
    )
    recall_weighted = recall_score(
        y_true=y_true,
        y_pred=y_pred,
        average="weighted",
        zero_division=0,
    )

    # f1
    f1_micro = f1_score(
        y_true=y_true,
        y_pred=y_pred,
        average="micro",
        zero_division=0,
    )
    f1_macro = f1_score(
        y_true=y_true,
        y_pred=y_pred,
        average="macro",
        zero_division=0,
    )
    f1_weighted = f1_score(
        y_true=y_true,
        y_pred=y_pred,
        average="weighted",
        zero_division=0,
    )

    metrics["accuracy"] = accuracy

    metrics["precision_micro"] = precision_micro
    metrics["precision_macro"] = precision_macro
    metrics["precision_weighted"] = precision_weighted

    metrics["recall_micro"] = recall_micro
    metrics["recall_macro"] = recall_macro
    metrics["recall_weighted"] = recall_weighted

    metrics["f1_micro"] = f1_micro
    metrics["f1_macro"] = f1_macro
    metrics["f1_weighted"] = f1_weighted

    return metrics


**Exercise. Implement the training and testing functions `train_epoch` and `evaluate_epoch`. <font color='red'>(2 points)</font>**

In [45]:
def train_epoch(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One training cycle (loop).
    """

    model.train()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    for i, (tokens, labels) in tqdm(
        enumerate(dataloader),
        total=len(dataloader),
        desc="loop over train batches",
    ):

        tokens, labels = tokens.to(device), labels.to(device)

        # Loss calculation and optimizer step
        outputs = model(tokens)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss.append(loss.item())
        writer.add_scalar(
            "batch loss / train", loss.item(), epoch * len(dataloader) + i
        )

        with torch.no_grad():
            model.eval()
            outputs_inference = model(tokens)
            model.train()

        batch_metrics = compute_metrics(
            outputs=outputs_inference,
            labels=labels,
        )

        for metric_name, metric_value in batch_metrics.items():
            batch_metrics_list[metric_name].append(metric_value)
            writer.add_scalar(
                f"batch {metric_name} / train",
                metric_value,
                epoch * len(dataloader) + i,
            )

    avg_loss = np.mean(epoch_loss)
    print(f"Train loss: {avg_loss}\n")
    writer.add_scalar("loss / train", avg_loss, epoch)

    for metric_name, metric_value_list in batch_metrics_list.items():
        metric_value = np.mean(metric_value_list)
        print(f"Train {metric_name}: {metric_value}\n")
        writer.add_scalar(f"{metric_name} / train", metric_value, epoch)


def evaluate_epoch(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One evaluation cycle (loop).
    """

    model.eval()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    with torch.no_grad():

        for i, (tokens, labels) in tqdm(
            enumerate(dataloader),
            total=len(dataloader),
            desc="loop over test batches",
        ):

            tokens, labels = tokens.to(device), labels.to(device)

            outputs = model(tokens)
            loss = criterion(outputs, labels)

            epoch_loss.append(loss.item())
            writer.add_scalar(
                "batch loss / test", loss.item(), epoch * len(dataloader) + i
            )

            batch_metrics = compute_metrics(
                outputs=outputs,
                labels=labels,
            )

            for metric_name, metric_value in batch_metrics.items():
                batch_metrics_list[metric_name].append(metric_value)
                writer.add_scalar(
                    f"batch {metric_name} / test",
                    metric_value,
                    epoch * len(dataloader) + i,
                )

        avg_loss = np.mean(epoch_loss)
        print(f"Test loss:  {avg_loss}\n")
        writer.add_scalar("loss / test", avg_loss, epoch)

        for metric_name, metric_value_list in batch_metrics_list.items():
            metric_value = np.mean(metric_value_list)
            print(f"Test {metric_name}: {metric_value}\n")
            writer.add_scalar(f"{metric_name} / test", np.mean(metric_value), epoch)


def train(
    n_epochs: int,
    model: torch.nn.Module,
    train_dataloader: torch.utils.data.DataLoader,
    test_dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
) -> None:
    """
    Training loop.
    """

    for epoch in range(n_epochs):

        print(f"Epoch [{epoch+1} / {n_epochs}]\n")

        train_epoch(
            model=model,
            dataloader=train_dataloader,
            optimizer=optimizer,
            criterion=criterion,
            writer=writer,
            device=device,
            epoch=epoch,
        )
        evaluate_epoch(
            model=model,
            dataloader=test_dataloader,
            criterion=criterion,
            writer=writer,
            device=device,
            epoch=epoch,
        )


**Exercise. Conduct experiments. <font color='red'>(2 points)</font>**

In [46]:
train(
    n_epochs=10,
    model=model,
    train_dataloader=train_dataloader,
    test_dataloader=valid_dataloader,
    optimizer=optimizer,
    criterion=criterion,
    writer=writer,
    device=device,
)


Epoch [1 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:30<00:00, 82.54it/s] 


Train loss: 0.7162200990978796

Train accuracy: 0.8176733508528256

Train precision_micro: 0.8176733508528256

Train precision_macro: 0.36326264703874805

Train precision_weighted: 0.7124099645124161

Train recall_micro: 0.8176733508528256

Train recall_macro: 0.39418130655413613

Train recall_weighted: 0.8176733508528256

Train f1_micro: 0.8176733508528256

Train f1_macro: 0.3709149633193839

Train f1_weighted: 0.753545331917927



loop over test batches: 100%|██████████| 3465/3465 [00:15<00:00, 223.40it/s]


Test loss:  0.5515609950433293

Test accuracy: 0.8484862802261407

Test precision_micro: 0.8484862802261407

Test precision_macro: 0.5873741642628185

Test precision_weighted: 0.7806537956704694

Test recall_micro: 0.8484862802261407

Test recall_macro: 0.6169040982042243

Test recall_weighted: 0.8484862802261407

Test f1_micro: 0.8484862802261407

Test f1_macro: 0.5939688997802355

Test f1_weighted: 0.8049093361256343

Epoch [2 / 10]



loop over train batches: 100%|██████████| 7493/7493 [02:03<00:00, 60.75it/s]


Train loss: 0.4435078318598061

Train accuracy: 0.876660841544516

Train precision_micro: 0.876660841544516

Train precision_macro: 0.5455114702939275

Train precision_weighted: 0.819441102095202

Train recall_micro: 0.876660841544516

Train recall_macro: 0.5444911130007679

Train recall_weighted: 0.876660841544516

Train f1_micro: 0.876660841544516

Train f1_macro: 0.5337393237556199

Train f1_weighted: 0.839749948746517



loop over test batches: 100%|██████████| 3465/3465 [00:17<00:00, 195.97it/s]


Test loss:  0.4067951879139034

Test accuracy: 0.8868973988882053

Test precision_micro: 0.8868973988882053

Test precision_macro: 0.6789348395503537

Test precision_weighted: 0.8500878402756635

Test recall_micro: 0.8868973988882053

Test recall_macro: 0.6874422962054177

Test recall_weighted: 0.8868973988882053

Test f1_micro: 0.8868973988882053

Test f1_macro: 0.6759691519989974

Test f1_weighted: 0.8618642296281379

Epoch [3 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:52<00:00, 66.58it/s]


Train loss: 0.3278581460449707

Train accuracy: 0.906552583303105

Train precision_micro: 0.906552583303105

Train precision_macro: 0.6472102463574351

Train precision_weighted: 0.872109099320353

Train recall_micro: 0.906552583303105

Train recall_macro: 0.6384118236137317

Train recall_weighted: 0.906552583303105

Train f1_micro: 0.906552583303105

Train f1_macro: 0.6318785582276208

Train f1_weighted: 0.8829632242902398



loop over test batches: 100%|██████████| 3465/3465 [00:15<00:00, 217.57it/s]


Test loss:  0.33083469974814694

Test accuracy: 0.9067880209720021

Test precision_micro: 0.9067880209720021

Test precision_macro: 0.7328792957613026

Test precision_weighted: 0.8808532867021561

Test recall_micro: 0.9067880209720021

Test recall_macro: 0.7334567476132695

Test recall_weighted: 0.9067880209720021

Test f1_micro: 0.9067880209720021

Test f1_macro: 0.7260847132919686

Test f1_weighted: 0.8879503388979029

Epoch [4 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:54<00:00, 65.55it/s]


Train loss: 0.2572962213834766

Train accuracy: 0.9263899678622195

Train precision_micro: 0.9263899678622195

Train precision_macro: 0.7184152151733068

Train precision_weighted: 0.9061448842970666

Train recall_micro: 0.9263899678622195

Train recall_macro: 0.7039706963803615

Train recall_weighted: 0.9263899678622195

Train f1_micro: 0.9263899678622195

Train f1_macro: 0.7007813788512709

Train f1_weighted: 0.9107627858773192



loop over test batches: 100%|██████████| 3465/3465 [00:17<00:00, 197.61it/s]


Test loss:  0.2857373454006182

Test accuracy: 0.9182191560833324

Test precision_micro: 0.9182191560833324

Test precision_macro: 0.7665320041012155

Test precision_weighted: 0.9010467711816883

Test recall_micro: 0.9182191560833324

Test recall_macro: 0.764349396726055

Test recall_weighted: 0.9182191560833324

Test f1_micro: 0.9182191560833324

Test f1_macro: 0.7588096984906372

Test f1_weighted: 0.9043682844853509

Epoch [5 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:53<00:00, 65.87it/s]


Train loss: 0.2075319669602178

Train accuracy: 0.9406450361153569

Train precision_micro: 0.9406450361153569

Train precision_macro: 0.7707733034938945

Train precision_weighted: 0.927923618938035

Train recall_micro: 0.9406450361153569

Train recall_macro: 0.7548777546636575

Train recall_weighted: 0.9406450361153569

Train f1_micro: 0.9406450361153569

Train f1_macro: 0.7533023029038125

Train f1_weighted: 0.929723740191504



loop over test batches: 100%|██████████| 3465/3465 [00:18<00:00, 189.58it/s]


Test loss:  0.25292052573202756

Test accuracy: 0.926362983023455

Test precision_micro: 0.926362983023455

Test precision_macro: 0.7872397479925246

Test precision_weighted: 0.9160884497874232

Test recall_micro: 0.926362983023455

Test recall_macro: 0.7829875005132624

Test recall_weighted: 0.926362983023455

Test f1_micro: 0.926362983023455

Test f1_macro: 0.7788640554705487

Test f1_weighted: 0.916517748613874

Epoch [6 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:58<00:00, 63.48it/s]


Train loss: 0.17424432394416436

Train accuracy: 0.9503933381729998

Train precision_micro: 0.9503933381729998

Train precision_macro: 0.8047947986709131

Train precision_weighted: 0.9413942847927006

Train recall_micro: 0.9503933381729998

Train recall_macro: 0.791207130117618

Train recall_weighted: 0.9503933381729998

Train f1_micro: 0.9503933381729998

Train f1_macro: 0.7895527959263787

Train f1_weighted: 0.9420612067723361



loop over test batches: 100%|██████████| 3465/3465 [00:19<00:00, 181.11it/s]


Test loss:  0.23877395098765616

Test accuracy: 0.9315559252610549

Test precision_micro: 0.9315559252610549

Test precision_macro: 0.800342045163772

Test precision_weighted: 0.9208869311874693

Test recall_micro: 0.9315559252610549

Test recall_macro: 0.795810157505128

Test recall_weighted: 0.9315559252610549

Test f1_micro: 0.9315559252610549

Test f1_macro: 0.7919998155509692

Test f1_weighted: 0.9216396989994365

Epoch [7 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:59<00:00, 62.73it/s]


Train loss: 0.14596842856202913

Train accuracy: 0.9592084052619951

Train precision_micro: 0.9592084052619951

Train precision_macro: 0.8375968885024085

Train precision_weighted: 0.9534829533385603

Train recall_micro: 0.9592084052619951

Train recall_macro: 0.8244494276825227

Train recall_weighted: 0.9592084052619951

Train f1_micro: 0.9592084052619951

Train f1_macro: 0.823320147309146

Train f1_weighted: 0.9530442857398648



loop over test batches: 100%|██████████| 3465/3465 [00:16<00:00, 203.89it/s]


Test loss:  0.2236914431752661

Test accuracy: 0.9323579917084855

Test precision_micro: 0.9323579917084855

Test precision_macro: 0.8085444125938779

Test precision_weighted: 0.9303803075684769

Test recall_micro: 0.9323579917084855

Test recall_macro: 0.803509217948642

Test recall_weighted: 0.9323579917084855

Test f1_micro: 0.9323579917084855

Test f1_macro: 0.8001125369828183

Test f1_weighted: 0.9270835370419527

Epoch [8 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:58<00:00, 63.32it/s]


Train loss: 0.12208725538268841

Train accuracy: 0.9659826462486678

Train precision_micro: 0.9659826462486678

Train precision_macro: 0.8594923981058524

Train precision_weighted: 0.962057765448152

Train recall_micro: 0.9659826462486678

Train recall_macro: 0.8484643413265979

Train recall_weighted: 0.9659826462486678

Train f1_micro: 0.9659826462486678

Train f1_macro: 0.847368450970972

Train f1_weighted: 0.9613046239371499



loop over test batches: 100%|██████████| 3465/3465 [00:16<00:00, 208.84it/s]


Test loss:  0.21008095766053592

Test accuracy: 0.9395714688005937

Test precision_micro: 0.9395714688005937

Test precision_macro: 0.8226786473934631

Test precision_weighted: 0.9350778595987816

Test recall_micro: 0.9395714688005937

Test recall_macro: 0.817484810309738

Test recall_weighted: 0.9395714688005937

Test f1_micro: 0.9395714688005937

Test f1_macro: 0.8145535188753923

Test f1_weighted: 0.9333365277903942

Epoch [9 / 10]



loop over train batches: 100%|██████████| 7493/7493 [01:51<00:00, 67.10it/s]


Train loss: 0.10336082677352099

Train accuracy: 0.9724473616893471

Train precision_micro: 0.9724473616893471

Train precision_macro: 0.886097787322051

Train precision_weighted: 0.9695847601000299

Train recall_micro: 0.9724473616893471

Train recall_macro: 0.8770007495884851

Train recall_weighted: 0.9724473616893471

Train f1_micro: 0.9724473616893471

Train f1_macro: 0.8757321405347733

Train f1_weighted: 0.9687166465238459



loop over test batches: 100%|██████████| 3465/3465 [00:17<00:00, 196.91it/s]


Test loss:  0.20637416760202007

Test accuracy: 0.9428552955254539

Test precision_micro: 0.9428552955254539

Test precision_macro: 0.8265979522724355

Test precision_weighted: 0.9390198799411887

Test recall_micro: 0.9428552955254539

Test recall_macro: 0.8205503308953832

Test recall_weighted: 0.9428552955254539

Test f1_micro: 0.9428552955254539

Test f1_macro: 0.8183987993397167

Test f1_weighted: 0.9371073455778494

Epoch [10 / 10]



loop over train batches: 100%|██████████| 7493/7493 [02:12<00:00, 56.36it/s]


Train loss: 0.08607914459147181

Train accuracy: 0.977325744097776

Train precision_micro: 0.977325744097776

Train precision_macro: 0.9018114387276817

Train precision_weighted: 0.9756119817701798

Train recall_micro: 0.977325744097776

Train recall_macro: 0.8935028344875946

Train recall_weighted: 0.977325744097776

Train f1_micro: 0.977325744097776

Train f1_macro: 0.8925548085375432

Train f1_weighted: 0.9745677803847799



loop over test batches: 100%|██████████| 3465/3465 [00:16<00:00, 205.20it/s]

Test loss:  0.20373854027010543

Test accuracy: 0.9434727883928633

Test precision_micro: 0.9434727883928633

Test precision_macro: 0.8327695683872912

Test precision_weighted: 0.9369463995251371

Test recall_micro: 0.9434727883928633

Test recall_macro: 0.8283745643233776

Test recall_weighted: 0.9434727883928633

Test f1_micro: 0.9434727883928633

Test f1_macro: 0.825398992085176

Test f1_weighted: 0.9365874503274739






In [47]:
evaluate_epoch(
    model=model,
    dataloader=test_dataloader,
    criterion=criterion,
    writer=writer,
    device=device,
    epoch=0,
)


loop over test batches: 100%|██████████| 3683/3683 [00:17<00:00, 213.43it/s]

Test loss:  0.3035410697948575

Test accuracy: 0.9146604490199576

Test precision_micro: 0.9146604490199576

Test precision_macro: 0.7909212469654361

Test precision_weighted: 0.9056660738372527

Test recall_micro: 0.9146604490199576

Test recall_macro: 0.7921245104987077

Test recall_weighted: 0.9146604490199576

Test f1_micro: 0.9146604490199576

Test f1_macro: 0.7862576719784516

Test f1_weighted: 0.9056084307890361






I did only one experiment, as it was already successful. The parameters were:
```
n_epochs=10
lr=1e-4
embedding_dim=100
hidden_size=100
num_layers=1
dropout=0.0
```

## Part 3. Transformers tagger (6 points)

In this part of the task, you need to do the same thing, but using a model based on the Transformer architecture, namely, it is proposed to additionally fine-tune the pre-trained **BERT** model.

This model requires special data preparation, which is where we will start:

The **BERT** model uses a custom WordPiece tokenizer to break sentences into tokens. A pre-trained version of such a tokenizer exists in the `transformers` library. There are two classes: `BertTokenizer` and `BertTokenizerFast`. You can use either one, but the second option works much faster because it is written in C programming language.

Tokenizers can be trained from scratch using your own data corpus, or you can load pre-trained ones. Pre-trained tokenizers typically match a pre-trained model configuration that uses the vocabulary from that tokenizer.

We will use a basic pretrained **BERT** configuration for the model and tokenizer.

P.S. Often you have to experiment with models of different architectures, for example **BERT** and **GPT**, so it is convenient to use the `AutoTokenizer` class, which, based on the name of the model, will determine which class is needed to initialize the tokenizer.

In [48]:
from transformers import AutoTokenizer


In [49]:
model_name = "distilbert-base-cased"


Pretrained models and tokenizers are loaded from `huggingface` using the `from_pretrained` constructor.

In this constructor, you can specify either the path to the pretrained tokenizer, or the name of the pretrained configuration, as in our case. `transformers` will load the necessary parameters itself:

In [50]:
tokenizer = AutoTokenizer.from_pretrained(model_name)


### Preparing dictionaries

Compared to recurrent models, there is no more need to build a dictionary, since this is already done in advance thanks to tokenizers and the algorithms behind them.

But as before, we will need:
- {**label**}→{**label_idx**}: correspondence between tag and unique index (starts from 0);

We have already implemented this mapping in one of the previous parts of the task.

### Preparing the dataset and loader

We also want to train the model in batches, so we will still need `Dataset`, `Collator` and `DataLoader`.

But we cannot reuse those from the previous parts of the task, since the data processing must be done a little differently using a tokenizer.

Let's write a new custom dataset that will receive as input (the `__init__` method):
- token_seq - list of lists of words/tokens
- label_seq - list of lists of tags

and return two lists from the `__getitem__` method:
- list of text values (`List[str]`) from token indices in the sample
- a list of integer values (`List[int]`) from the indices of the corresponding tags

P.S. Unlike the previous custom dataset, here we return two `Lists` instead of `torch.LongTensor`, since we will transfer the logic for generating a padded batch to `Collator` due to the specifics of the tokenizer - it itself returns an already padded tensor with token indexes, and for tag indexes we will need to do this ourselves, similar to the previous dataset.

**Exercise. Implement the TransformersDataset class. <font color='red'>(1 point)</font>**

In [51]:
class TransformersDataset(torch.utils.data.Dataset):
    """
    Transformers Dataset for NER.
    """

    def __init__(
        self,
        token_seq: List[List[str]],
        label_seq: List[List[str]],
    ):
        self.token_seq = token_seq
        self.label_seq = [
            self.process_labels(labels, label2idx) for labels in label_seq
        ]

    def __len__(self):
        return len(self.token_seq)

    def __getitem__(
        self,
        idx: int,
    ) -> Tuple[List[str], List[int]]:
        return self.token_seq[idx], self.label_seq[idx]

    @staticmethod
    def process_labels(
        labels: List[str],
        label2idx: Dict[str, int],
    ) -> List[int]:
        """
        Transform list of labels into list of labels' indices.
        """
        return [label2idx[label] for label in labels]


Create three datasets:
- *train_dataset*
- *valid_dataset*
- *test_dataset*

In [52]:
train_dataset = TransformersDataset(
    token_seq=train_token_seq,
    label_seq=train_label_seq,
)
valid_dataset = TransformersDataset(
    token_seq=valid_token_seq,
    label_seq=valid_label_seq,
)
test_dataset = TransformersDataset(
    token_seq=test_token_seq,
    label_seq=test_label_seq,
)


Let's look at what we got:

In [53]:
train_dataset[0]


(['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'],
 [3, 0, 2, 0, 0, 0, 2, 0, 0])

In [54]:
valid_dataset[0]


(['cricket',
  '-',
  'leicestershire',
  'take',
  'over',
  'at',
  'top',
  'after',
  'innings',
  'victory',
  '.'],
 [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0])

In [55]:
test_dataset[0]


(['soccer',
  '-',
  'japan',
  'get',
  'lucky',
  'win',
  ',',
  'china',
  'in',
  'surprise',
  'defeat',
  '.'],
 [0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0])

In [56]:
assert len(train_dataset) == 14986, "Incorrect train_dataset length"
assert len(valid_dataset) == 3465, "Incorrect valid_dataset length"
assert len(test_dataset) == 3683, "Incorrect test_dataset length"

assert train_dataset[0][0] == [
    "eu",
    "rejects",
    "german",
    "call",
    "to",
    "boycott",
    "british",
    "lamb",
    ".",
], "Malformed train_dataset"
assert train_dataset[0][1] == [3, 0, 2, 0, 0, 0, 2, 0, 0], "Malformed train_dataset"

assert valid_dataset[0][0] == [
    "cricket",
    "-",
    "leicestershire",
    "take",
    "over",
    "at",
    "top",
    "after",
    "innings",
    "victory",
    ".",
], "Malformed valid_dataset"
assert valid_dataset[0][1] == [
    0,
    0,
    3,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
], "Malformed valid_dataset"

assert test_dataset[0][0] == [
    "soccer",
    "-",
    "japan",
    "get",
    "lucky",
    "win",
    ",",
    "china",
    "in",
    "surprise",
    "defeat",
    ".",
], "Malformed test_dataset"
assert test_dataset[0][1] == [
    0,
    0,
    1,
    0,
    0,
    0,
    0,
    4,
    0,
    0,
    0,
    0,
], "Malformed test_dataset"

print("All tests passed!")


All tests passed!


Let's implement a new `Collator`.

The collator will be initialized with 3 arguments:
- tokenizer
- tokenizer parameters in the form of a dictionary (then used as `**kwargs`)
- special token id for tag sequences (value -1)

The `__call__` method takes a batch as input, namely a list of tuples of what is returned from the dataset with `__getitem__` method. In our case, this is a list of tuples of two int64 tensors - `List[Tuple[torch.LongTensor, torch.LongTensor]]`.

At the output we want to get two tensors:
- Padded word/token indexes
- Padded tag indexes

**Exercise. Implement the TransformersCollator class. <font color='red'>(2 points)</font>**

In [57]:
from transformers import PreTrainedTokenizer
from transformers.tokenization_utils_base import BatchEncoding


class TransformersCollator:
    """
    Transformers Collator that handles variable-size sentences.
    """

    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        tokenizer_kwargs: Dict[str, Any],
        label_padding_value: int,
    ):
        self.tokenizer = tokenizer
        self.tokenizer_kwargs = tokenizer_kwargs

        self.label_padding_value = label_padding_value

    def __call__(
        self,
        batch: List[Tuple[List[str], List[int]]],
    ) -> Tuple[torch.LongTensor, torch.LongTensor]:
        tokens, labels = zip(*batch)

        tokens = self.tokenizer(tokens, **self.tokenizer_kwargs)
        labels = self.encode_labels(tokens, labels, self.label_padding_value)

        tokens.pop("offset_mapping")

        return tokens, labels

    @staticmethod
    def encode_labels(
        tokens: BatchEncoding,
        labels: List[List[int]],
        label_padding_value: int,
    ) -> torch.LongTensor:

        encoded_labels = []

        for doc_labels, doc_offset in zip(labels, tokens.offset_mapping):

            doc_enc_labels = np.ones(len(doc_offset), dtype=int) * label_padding_value
            arr_offset = np.array(doc_offset)

            doc_enc_labels[(arr_offset[:, 0] == 0) & (arr_offset[:, 1] != 0)] = (
                doc_labels
            )
            encoded_labels.append(doc_enc_labels.tolist())

        return torch.LongTensor(encoded_labels)


In [58]:
tokenizer_kwargs = {
    "is_split_into_words": True,
    "return_offsets_mapping": True,
    "padding": True,
    "truncation": True,
    "max_length": 512,
    "return_tensors": "pt",
}


In [59]:
collator = TransformersCollator(
    tokenizer=tokenizer,
    tokenizer_kwargs=tokenizer_kwargs,
    label_padding_value=-1,
)


Now you're ready to define the loaders:

In [60]:
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=collator,
)
valid_dataloader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False,  # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)
test_dataloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False,  # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)


Let's look at what we got:

In [61]:
tokens, labels = next(iter(train_dataloader))

tokens = tokens.to(device)
labels = labels.to(device)


  arr_offset = np.array(doc_offset)


In [62]:
tokens


{'input_ids': tensor([[  101,   116,   125, 26036,  9349,   176, 25409,   113, 12686, 16468,
          4567,   114,  1194,  1492,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0],
        [  101,  1103,  8382,  1144,  1163,  1115,  1122,  2919,  1106,  1862,
          5306,   119,   127,  3029,  1104,  1103,   170,  9739,  1657,  1137,
           126,   119,   123,  1550,  8754,  1106,  1157,  1560,  5032,  1118,
          1103,  1322,  1104,  1142,  1214,   119,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [63]:
labels


tensor([[-1,  0, -1,  4, -1,  8, -1,  0,  1, -1, -1,  0,  0,  0, -1, -1, -1, -1,
         -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
         -1],
        [-1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0,  0, -1,
          0,  0,  0, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         -1]])

In [64]:
train_tokens, train_labels = next(
    iter(
        torch.utils.data.DataLoader(
            train_dataset,
            batch_size=2,
            shuffle=False,
            collate_fn=collator,
        )
    )
)
assert torch.equal(
    train_tokens["input_ids"],
    torch.tensor(
        [
            [
                101,
                174,
                1358,
                22961,
                176,
                14170,
                1840,
                1106,
                21423,
                9304,
                10721,
                1324,
                2495,
                12913,
                119,
                102,
            ],
            [101, 11109, 1200, 1602, 6715, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    train_tokens["attention_mask"],
    torch.tensor(
        [
            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    train_labels,
    torch.tensor(
        [
            [-1, 3, -1, 0, 2, -1, 0, 0, 0, 2, -1, -1, 0, -1, 0, -1],
            [-1, 4, -1, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
        ]
    ),
), "Looks like a bug in the collator"

valid_tokens, valid_labels = next(
    iter(
        torch.utils.data.DataLoader(
            valid_dataset,
            batch_size=2,
            shuffle=False,
            collate_fn=collator,
        )
    )
)
assert torch.equal(
    valid_tokens["input_ids"],
    torch.tensor(
        [
            [
                101,
                5428,
                118,
                5837,
                18117,
                5759,
                15189,
                1321,
                1166,
                1120,
                1499,
                1170,
                6687,
                2681,
                119,
                102,
            ],
            [101, 25338, 17996, 1820, 118, 4775, 118, 1476, 102, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    valid_tokens["attention_mask"],
    torch.tensor(
        [
            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    valid_labels,
    torch.tensor(
        [
            [-1, 0, 0, 3, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, -1],
            [-1, 1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
        ]
    ),
), "Looks like a bug in the collator"

test_tokens, test_labels = next(
    iter(
        torch.utils.data.DataLoader(
            test_dataset,
            batch_size=2,
            shuffle=False,
            collate_fn=collator,
        )
    )
)
assert torch.equal(
    test_tokens["input_ids"],
    torch.tensor(
        [
            [
                101,
                5862,
                118,
                179,
                26519,
                1179,
                1243,
                6918,
                1782,
                117,
                5144,
                1161,
                1107,
                3774,
                3326,
                119,
                102,
            ],
            [101, 9468, 3309, 1306, 19122, 2293, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    test_tokens["attention_mask"],
    torch.tensor(
        [
            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        ]
    ),
), "Looks like a bug in the collator"
assert torch.equal(
    test_labels,
    torch.tensor(
        [
            [-1, 0, 0, 1, -1, -1, 0, 0, 0, 0, 4, -1, 0, 0, 0, 0, -1],
            [-1, 4, -1, -1, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
        ]
    ),
), "Looks like a bug in the collator"

print("All tests passed!")


All tests passed!


  arr_offset = np.array(doc_offset)


The **transformers** library contains classes for the BERT model, already customized to solve specific problems, with corresponding classification heads. For the NER task we will use the `BertForTokenClassification` class.

By analogy with tokenizers, we can use the `AutoModelForTokenClassification` class, which, based on the name of the model, will determine which class is needed to initialize the model.

In [65]:
from transformers import AutoModelForTokenClassification


In [66]:
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label2idx),
).to(device)


Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [67]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)


In [68]:
outputs = model(**tokens)


In [69]:
assert 2 < criterion(outputs["logits"].transpose(1, 2), labels) < 3

print("All tests passed!")


All tests passed!


In [70]:
# let's create a SummaryWriter for experimenting with BiLSTMModel

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir=f"logs/Transformer")


### Experiments

Run experiments on the data. Adjust parameters based on the validation set without using the test set. Your goal is to configure the network so that the quality of the model according to the F1-macro measure on the validation and test sets is no less than **0.9**.

Draw conclusions about model quality, overfitting, and sensitivity of the architecture to the choice of hyperparameters. Present the results of your experiments in the form of a mini-report (in the same ipython notebook).

You can use the same train function as before, except that instead of `model(tokens)` inference you need to do `model(**tokens)`, and instead of `outputs` you use `outputs["logits"].transpose(1, 2)`

In [75]:
def train_epoch(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One training cycle (loop).
    """

    model.train()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    for i, (tokens, labels) in tqdm(
        enumerate(dataloader),
        total=len(dataloader),
        desc="loop over train batches",
    ):

        tokens, labels = tokens.to(device), labels.to(device)

        # Loss calculation and optimizer step
        outputs = model(**tokens)["logits"].transpose(1, 2)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss.append(loss.item())
        writer.add_scalar(
            "batch loss / train", loss.item(), epoch * len(dataloader) + i
        )

        with torch.no_grad():
            model.eval()
            outputs_inference = model(**tokens)["logits"].transpose(1, 2)
            model.train()

        batch_metrics = compute_metrics(
            outputs=outputs_inference,
            labels=labels,
        )

        for metric_name, metric_value in batch_metrics.items():
            batch_metrics_list[metric_name].append(metric_value)
            writer.add_scalar(
                f"batch {metric_name} / train",
                metric_value,
                epoch * len(dataloader) + i,
            )

    avg_loss = np.mean(epoch_loss)
    print(f"Train loss: {avg_loss}\n")
    writer.add_scalar("loss / train", avg_loss, epoch)

    for metric_name, metric_value_list in batch_metrics_list.items():
        metric_value = np.mean(metric_value_list)
        print(f"Train {metric_name}: {metric_value}\n")
        writer.add_scalar(f"{metric_name} / train", metric_value, epoch)


def evaluate_epoch(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One evaluation cycle (loop).
    """

    model.eval()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    with torch.no_grad():

        for i, (tokens, labels) in tqdm(
            enumerate(dataloader),
            total=len(dataloader),
            desc="loop over test batches",
        ):

            tokens, labels = tokens.to(device), labels.to(device)

            outputs = model(**tokens)["logits"].transpose(1, 2)
            loss = criterion(outputs, labels)

            epoch_loss.append(loss.item())
            writer.add_scalar(
                "batch loss / test", loss.item(), epoch * len(dataloader) + i
            )

            batch_metrics = compute_metrics(
                outputs=outputs,
                labels=labels,
            )

            for metric_name, metric_value in batch_metrics.items():
                batch_metrics_list[metric_name].append(metric_value)
                writer.add_scalar(
                    f"batch {metric_name} / test",
                    metric_value,
                    epoch * len(dataloader) + i,
                )

        avg_loss = np.mean(epoch_loss)
        print(f"Test loss:  {avg_loss}\n")
        writer.add_scalar("loss / test", avg_loss, epoch)

        for metric_name, metric_value_list in batch_metrics_list.items():
            metric_value = np.mean(metric_value_list)
            print(f"Test {metric_name}: {metric_value}\n")
            writer.add_scalar(f"{metric_name} / test", np.mean(metric_value), epoch)


def train(
    n_epochs: int,
    model: torch.nn.Module,
    train_dataloader: torch.utils.data.DataLoader,
    test_dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
) -> None:
    """
    Training loop.
    """

    for epoch in range(n_epochs):

        print(f"Epoch [{epoch+1} / {n_epochs}]\n")

        train_epoch(
            model=model,
            dataloader=train_dataloader,
            optimizer=optimizer,
            criterion=criterion,
            writer=writer,
            device=device,
            epoch=epoch,
        )
        evaluate_epoch(
            model=model,
            dataloader=test_dataloader,
            criterion=criterion,
            writer=writer,
            device=device,
            epoch=epoch,
        )


**Exercise. Conduct experiments.** **<font color='red'>(2 points)</font>**

In [74]:
train(
    n_epochs=10,
    model=model,
    train_dataloader=train_dataloader,
    test_dataloader=valid_dataloader,
    optimizer=optimizer,
    criterion=criterion,
    writer=writer,
    device=device,
)


Epoch [1 / 10]



  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_

Train loss: 0.12875038324686075

Train accuracy: 0.9689739496533584

Train precision_micro: 0.9689739496533584

Train precision_macro: 0.8655111137812581

Train precision_weighted: 0.9659449067479068

Train recall_micro: 0.9689739496533584

Train recall_macro: 0.8676007227094886

Train recall_weighted: 0.9689739496533584

Train f1_micro: 0.9689739496533584

Train f1_macro: 0.8616428095269807

Train f1_weighted: 0.9653438051661338



  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_

Test loss:  0.06725922858130108

Test accuracy: 0.9819445064113974

Test precision_micro: 0.9819445064113974

Test precision_macro: 0.9358832137504917

Test precision_weighted: 0.9815174614975786

Test recall_micro: 0.9819445064113974

Test recall_macro: 0.9351121050266566

Test recall_weighted: 0.9819445064113974

Test f1_micro: 0.9819445064113974

Test f1_macro: 0.9335589842677765

Test f1_weighted: 0.980556274514616

Epoch [2 / 10]



  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_offset)
  arr_offset = np.array(doc_

## Part 4 - Bonus. BiLSTMAttention-tagger (2 points)

You need to carry out the same experiments as in part 2, but using the improved BiLSTM tagger architecture with the Attention mechanism.

**Please note** that you do not need to implement Attention yourself; you can use `torch.nn.MultiheadAttention`.

Also draw conclusions about model quality, overfitting, sensitivity of the architecture to the choice of hyperparameters, and do a little comparative analysis with the previous architecture. Present the results of your experiments in the form of a mini-report (in the same ipython notebook).

**Exercise. Implement the model class BiLSTMAttn.** **<font color='red'>(1 point)</font>**

In [None]:
# YOUR CODE HERE


**Exercise. Conduct experiments and beat the metric value from part 2.** **<font color='red'>(1 point)</font>**

P.S. If quality didn't increase, this needs to be justified.

In [None]:
# YOUR CODE HERE
