<a target="_blank" href="https://colab.research.google.com/github/PaulLerner/aivancity_nlp/blob/main/pw4_eval_ie.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Installation and imports

Hit `Ctrl+S` to save a copy of the Colab notebook to your drive

Run on Google Colab GPU:
- Connect
- Modify execution
- GPU

![image.png](https://paullerner.github.io/aivancity_nlp/_static/colab_gpu.png)

In [1]:
import torch

from torch import nn

from transformers import AutoTokenizer, AutoModel

In [2]:
assert torch.cuda.is_available(), "Connect to GPU and try again (ask teacher for help)"

# Data

We'll use the [E3C dataset](https://books.openedition.org/aaccademia/pdf/8663) of Named Entity Recognition in clinical texts.

Therefore, we'll use a model pretrained on scientific texts: [SciBert](https://www.aclweb.org/anthology/D19-1371)

## Loading


In [3]:
!wget https://raw.githubusercontent.com/hltfbk/E3C-Corpus/refs/heads/main/preprocessed_data/clinical_entities/layer1/English/test.txt

--2025-03-04 13:11:49--  https://raw.githubusercontent.com/hltfbk/E3C-Corpus/refs/heads/main/preprocessed_data/clinical_entities/layer1/English/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 136768 (134K) [text/plain]
Saving to: ‘test.txt’


2025-03-04 13:11:50 (4.92 MB/s) - ‘test.txt’ saved [136768/136768]



In [4]:
!wget https://raw.githubusercontent.com/hltfbk/E3C-Corpus/refs/heads/main/preprocessed_data/clinical_entities/layer1/English/train.txt

--2025-03-04 13:12:11--  https://raw.githubusercontent.com/hltfbk/E3C-Corpus/refs/heads/main/preprocessed_data/clinical_entities/layer1/English/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103512 (101K) [text/plain]
Saving to: ‘train.txt’


2025-03-04 13:12:11 (4.25 MB/s) - ‘train.txt’ saved [103512/103512]



In [5]:
encoder = AutoModel.from_pretrained("allenai/scibert_scivocab_cased", device_map="auto", add_pooling_layer=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

In [6]:
encoder

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(31116, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [7]:
tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_cased", do_lower_case=False)

vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

In [8]:
LABELS = {'O': 0, 'B-ety': 1, 'I-ety': 2}
i2label = {i: label for label, i in LABELS.items()}

In [9]:
def read_data(path):
    with open(path,"rt") as file:
        lines = file.read().strip().split("\n")
    dataset = []
    tokens, labels = [], []
    for line in lines:
        if line:
            token, label = line.split(" ")
            tokens.append(token)
            labels.append(LABELS[label])
        else:
            dataset.append((tokens, labels))
            tokens, labels = [], []
    dataset.append((tokens, labels))
    return dataset
train_set = read_data("train.txt")

In [10]:
test_set = read_data("test.txt")

In [11]:
print(len(train_set), len(test_set))

669 851


In [12]:
item = train_set[0]
print(" ".join(item[0]))
print(list(zip(item[0], [i2label[label] for label in item[1]])))

A 14 - year old boy with no significant past medical history presented to a small district hospital in southern Sierra Leone with a 4 day history of facial puffiness , peripheral pitting oedema , abdominal pains , and reduced urine output .
[('A', 'O'), ('14', 'O'), ('-', 'O'), ('year', 'O'), ('old', 'O'), ('boy', 'O'), ('with', 'O'), ('no', 'O'), ('significant', 'O'), ('past', 'O'), ('medical', 'O'), ('history', 'O'), ('presented', 'O'), ('to', 'O'), ('a', 'O'), ('small', 'O'), ('district', 'O'), ('hospital', 'O'), ('in', 'O'), ('southern', 'O'), ('Sierra', 'O'), ('Leone', 'O'), ('with', 'O'), ('a', 'O'), ('4', 'O'), ('day', 'O'), ('history', 'O'), ('of', 'O'), ('facial', 'B-ety'), ('puffiness', 'I-ety'), (',', 'O'), ('peripheral', 'O'), ('pitting', 'B-ety'), ('oedema', 'I-ety'), (',', 'O'), ('abdominal', 'O'), ('pains', 'O'), (',', 'O'), ('and', 'O'), ('reduced', 'O'), ('urine', 'O'), ('output', 'O'), ('.', 'O')]


## Train-dev-test split
Notice that the original dataset has no dedicated validation split.

Randomly split the test set in 50% validation and 50% test

In [14]:
from sklearn.model_selection import train_test_split

# Assuming train_set and test_set are already defined
test_set, dev_set = train_test_split(test_set, test_size=0.5, random_state=42)

In [15]:
print(len(train_set), len(dev_set), len(test_set))

669 426 425


## Sequence labels

Because BERT has a subword tokenizer, we need to map the label sequence to the subwords.

Beware to change the label from B to I for subwords that are in the middle of a word.

For example `'pit', '##ting'` should be tagged `'B-ety', 'I-ety'`, not `'B-ety', 'B-ety'`

Also make sure to add BERT special tokens for beginning of sequence and end of sequence. Give them `-100` as label so they are not taken into account when computing the loss.


In [21]:
words=tokenizer.tokenize(" ".join(item[0]))

In [26]:
words

['A',
 '14',
 '-',
 'year',
 'old',
 'boy',
 'with',
 'no',
 'significant',
 'past',
 'medical',
 'history',
 'presented',
 'to',
 'a',
 'small',
 'district',
 'hospital',
 'in',
 'southern',
 'Sie',
 '##rr',
 '##a',
 'Leon',
 '##e',
 'with',
 'a',
 '4',
 'day',
 'history',
 'of',
 'facial',
 'pu',
 '##ffi',
 '##ness',
 ',',
 'peripheral',
 'pit',
 '##ting',
 'o',
 '##edema',
 ',',
 'abdominal',
 'pain',
 '##s',
 ',',
 'and',
 'reduced',
 'urine',
 'output',
 '.']

In [22]:
tokenizer.cls_token

'[CLS]'

In [23]:
tokenizer.sep_token

'[SEP]'

In [37]:
def tokenize_and_align_labels(tokenizer, tokens, labels):
    tokenized_input = []
    aligned_labels = []

    tokenized_input.append((tokenizer.cls_token, None))  # Add [CLS] token
    aligned_labels.append(-100)

    for token, label in zip(tokens, labels):
        sub_tokens = tokenizer.tokenize(token)

        for i, sub_token in enumerate(sub_tokens):
            tokenized_input.append((sub_token, label if i == 0 else f"I-{label[2:]}" if "-" in label else label))
            aligned_labels.append(label if i == 0 else f"I-{label[2:]}" if "-" in label else label)

    tokenized_input.append((tokenizer.sep_token, None))  # Add [SEP] token
    aligned_labels.append(-100)

    return tokenized_input, aligned_labels

# Example usage
tokenized, aligned = tokenize_and_align_labels(tokenizer, item[0], [i2label[label] for label in item[1]])

print(tokenized)

[('[CLS]', None), ('A', 'O'), ('14', 'O'), ('-', 'O'), ('year', 'O'), ('old', 'O'), ('boy', 'O'), ('with', 'O'), ('no', 'O'), ('significant', 'O'), ('past', 'O'), ('medical', 'O'), ('history', 'O'), ('presented', 'O'), ('to', 'O'), ('a', 'O'), ('small', 'O'), ('district', 'O'), ('hospital', 'O'), ('in', 'O'), ('southern', 'O'), ('Sie', 'O'), ('##rr', 'O'), ('##a', 'O'), ('Leon', 'O'), ('##e', 'O'), ('with', 'O'), ('a', 'O'), ('4', 'O'), ('day', 'O'), ('history', 'O'), ('of', 'O'), ('facial', 'B-ety'), ('pu', 'I-ety'), ('##ffi', 'I-ety'), ('##ness', 'I-ety'), (',', 'O'), ('peripheral', 'O'), ('pit', 'B-ety'), ('##ting', 'I-ety'), ('o', 'I-ety'), ('##edema', 'I-ety'), (',', 'O'), ('abdominal', 'O'), ('pain', 'O'), ('##s', 'O'), (',', 'O'), ('and', 'O'), ('reduced', 'O'), ('urine', 'O'), ('output', 'O'), ('.', 'O'), ('[SEP]', None)]


## Batching

In [38]:
batch_size = 4
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, collate_fn=lambda x: x)


In [39]:
batch = next(iter(train_loader))

Tokenize and get labels for all examples in the batch. Get the identifiers of tokens using `tokenizer.convert_tokens_to_ids`

Save the length (number of tokens) of each example in a separate list

You should end up with a list of list like so

In [41]:
def process_batch(batch, tokenizer, i2label):
    """
    Traite un batch d'exemples en tokenisant, alignant les labels, convertissant les tokens en identifiants et en padant les séquences.

    Paramètres:
      - batch: liste d'exemples, où chaque exemple est un tuple (tokens, label_indices)
      - tokenizer: le tokenizer BERT (ex: SciBERT)
      - i2label: dictionnaire de conversion des indices de labels en leur représentation (ex: {0: "O", 1: "B-ety", 2: "I-ety"})

    Retourne:
      - tokens_batch: liste de listes d'identifiants de tokens, padées à la même longueur
      - labels_batch: liste de listes de labels alignés, padés avec -100
      - lengths: liste contenant la longueur (nombre de tokens) de chaque exemple avant padding
    """
    tokens_batch = []
    labels_batch = []
    lengths = []

    for example in batch:
        tokens, label_indices = example
        # Conversion des indices de labels en labels de base (ex: "O", "B-ety", "I-ety")
        base_labels = [i2label[label] for label in label_indices]
        # Tokenize et alignement des labels pour l'exemple courant
        tokenized_input, aligned_labels = tokenize_and_align_labels(tokenizer, tokens, base_labels)
        # Extraire la liste des tokens sous forme de chaînes
        tokens_list = [token for token, _ in tokenized_input]
        # Conversion en identifiants
        token_ids = tokenizer.convert_tokens_to_ids(tokens_list)

        tokens_batch.append(token_ids)
        labels_batch.append(aligned_labels)
        lengths.append(len(token_ids))

    # Padding des séquences
    max_len = max(lengths)
    pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0

    for i in range(len(tokens_batch)):
        pad_len = max_len - len(tokens_batch[i])
        tokens_batch[i] = tokens_batch[i] + [pad_token_id] * pad_len
        labels_batch[i] = labels_batch[i] + [-100] * pad_len

    return tokens_batch, labels_batch, lengths

In [42]:
batch_size = 4
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, collate_fn=lambda x: x)
batch = next(iter(train_loader))

tokens_batch, labels_batch, lengths = process_batch(batch, tokenizer, i2label)

print("tokens_batch:")
print(tokens_batch)
print("\nlabels_batch:")
print(labels_batch)
print("\nLengths:", lengths)


tokens_batch:
[[101, 186, 16583, 12929, 125, 111, 333, 3135, 30116, 10689, 253, 9812, 4449, 188, 17504, 146, 111, 9794, 5903, 18325, 211, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 186, 14012, 2504, 2899, 105, 1155, 5991, 1129, 942, 23994, 16861, 30113, 28294, 430, 188, 15409, 14881, 125, 111, 2185, 18542, 1346, 430, 25549, 11189, 12199, 30108, 15396, 30108, 136, 1326, 1877, 6409, 2979, 125, 111, 12551, 19647, 10050, 430, 15463, 111, 2891, 125, 4572, 17691, 5812, 945, 211, 102], [101, 186, 2654, 19218, 125, 111, 4523, 21055, 8197, 253, 148, 30141, 243, 30129, 194, 30129, 211, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 29203, 273, 2170, 210, 19241, 19079, 16556, 211, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

labels_batch:
[[-100, 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',

Notice that each example in the batch has a different size so we will pad them. The padded tokens will be masked in the attention mechanism (already implemented in BERT, you simply need to pass `attention_mask`) and will be given the label `-100` so they are not taken into account when computing the loss.


In [45]:
lengths

[22, 51, 18, 10]



```
# Ce texte est au format code
```

Here we simply add padding so that all examples can fit in the same Tensor

In [51]:
input_ids = torch.zeros(len(batch), max(lengths), dtype=int, device="cuda")
attention_mask = torch.zeros(len(batch), max(lengths), dtype=int, device="cuda")
labels = torch.full((len(batch), max(lengths)), -100, dtype=int, device="cuda")
label_to_id=LABELS

In [54]:
for i, (token, label) in enumerate(zip(tokens_batch, labels_batch)):
    # Convertir la liste des tokens et des labels en tenseur.
    input_ids[i, :len(token)] = torch.tensor(token, dtype=torch.long)
    attention_mask[i, :len(token)] = 1

    # Convertir les labels (chaînes) en entiers à l'aide de label_to_id.
    numeric_labels = [label_to_id[l] if l in label_to_id else -100 for l in label]
    labels[i, :len(numeric_labels)] = torch.tensor(numeric_labels, dtype=torch.long)

Voilà!

In [55]:
input_ids.shape, attention_mask.shape, labels.shape

(torch.Size([4, 51]), torch.Size([4, 51]), torch.Size([4, 51]))

In [56]:
input_ids

tensor([[  101,   186, 16583, 12929,   125,   111,   333,  3135, 30116, 10689,
           253,  9812,  4449,   188, 17504,   146,   111,  9794,  5903, 18325,
           211,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0],
        [  101,   186, 14012,  2504,  2899,   105,  1155,  5991,  1129,   942,
         23994, 16861, 30113, 28294,   430,   188, 15409, 14881,   125,   111,
          2185, 18542,  1346,   430, 25549, 11189, 12199, 30108, 15396, 30108,
           136,  1326,  1877,  6409,  2979,   125,   111, 12551, 19647, 10050,
           430, 15463,   111,  2891,   125,  4572, 17691,  5812,   945,   211,
           102],
        [  101,   186,  2654, 19218,   125,   111,  4523, 21055,  8197,   253,
           148, 30141,   243, 30129,   194, 30129,   211,   102,     0,     0,
             0,   

In [57]:
labels

tensor([[-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100],
        [-100,    0,    0,    0,    0,    0,    0,    1,    2,    2,    2,    2,
            2,    2,    0,    0,    1,    2,    2,    2,    2,    2,    2,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    1,    2,    2,
            2,    0, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -1

In [58]:
attention_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]], device='cuda:0')

# Model
## Encoding with BERT

BERT provides contextualized embeddings for all tokens in the batch

In [59]:
output = encoder(input_ids=input_ids, attention_mask=attention_mask)

In [60]:
# (batch size, sequence length, embedding dimension)
output.last_hidden_state.shape

torch.Size([4, 51, 768])

## Sequence Tagging
For Named Entity Recognition, we use a simple tagging model: we simply add a linear layer on top of the encoder for multi-class classification:
- What is the input dimension of the classifier?
- How many classes are there?

Note: this Linear classifier is really simple and **assigns a label to each token independently**. A better way to do Named Entity Recognition that we did not cover in class is to use Conditional Random Field (CRF) for **structured prediction** (basically, assign the label of a token based on the other labels already assigned).

Note 2: HuggingFace's `transformers` provides built-in sequence tagging models like https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForTokenClassification which you are not allowed to use until the end of this class

In [None]:
class Tagger(nn.Module):
    def __init__(self, encoder):
        super().__init__()
        self.encoder = encoder
        raise NotImplementedError()
        self.classifier = TODO

    def forward(self, input_ids, attention_mask):
        embeddings = self.encoder(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        raise NotImplementedError()

In [None]:
tagger = Tagger(encoder).cuda()

In [None]:
logits = tagger(input_ids, attention_mask)

In [None]:
# (batch size, sequence length, number of classes)
logits.shape

torch.Size([4, 24, 3])

## Loss function

Use Cross-entropy to train the classifier. First compute the loss on a single batch.

In [None]:
# value may vary but you might get a loss like this one

tensor(1.0238, device='cuda:0', grad_fn=<NllLossBackward0>)

# Training

In [None]:
%load_ext tensorboard

In [None]:
import torch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("logs/pw4")

Run tensorboard before training. Refresh during training.

In [None]:
%tensorboard --logdir logs/pw4

In [None]:
tagger = Tagger(encoder).cuda()
# gradient checkpoint: lower memory footprint but you need to compute forward passes twice
tagger.encoder.gradient_checkpointing_enable()

loss_fct = nn.CrossEntropyLoss(ignore_index=-100)



optimizer = torch.optim.AdamW(tagger.parameters(), lr=0.0001)

batch_size = 8
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, collate_fn=lambda x: x)
validation_loader = torch.utils.data.DataLoader(dev_set, batch_size=batch_size, shuffle=False, collate_fn=lambda x: x)

steps = 0
for epoch in range(5):
    for batch in train_loader:
        raise NotImplementedError("Format batch and compute loss as you did above")
        loss = TODO

        writer.add_scalar("Loss/train", loss.item(), steps)
        steps += 1
        loss.backward()
        # gradient clipping to avoid exploding gradients
        nn.utils.clip_grad_norm_(tagger.parameters(), 1.0)
        optimizer.step()

    # validation
    with torch.no_grad():
        tagger.eval()
        valid_loss = 0
        valid_batches = 0
        for batch in validation_loader:
            raise NotImplementedError("Format batch and compute loss as you did above")
            loss = TODO
            valid_loss += loss.item()
            valid_batches += 1
        tagger.train()
        writer.add_scalar("Loss/validation", valid_loss/valid_batches, epoch)

    # saving checkpoint
    torch.save(tagger.state_dict(), f"tagger_{epoch}.pt")

# Testing

Notice that the model quickly overfits (after 1 epoch?) given the small size of the training steps. Load the best checkpoint according to the validation loss (epoch 1 in my case)

In [None]:
tagger.load_state_dict(torch.load("tagger_0.pt"))
tagger.eval()

  tagger.load_state_dict(torch.load("tagger_0.pt"))


Tagger(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31116, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_aff

In [None]:
LABELS

{'O': 0, 'B-ety': 1, 'I-ety': 2}

Now run your model on your test set. This time we're not only interested in a low loss but in the actual predictions of the models:
- How to get the class predicted by your model on each token?

Once you get classes, compute:
- precision (of classifying 'B-ety' and 'I-ety')
- recall (of classifying 'B-ety' and 'I-ety')
- F1-score (of classifying 'B-ety' and 'I-ety')

Bonus: what is the issue with computing scores on tokens tokenized using BERT tokenizer?

In [None]:
# example prediction of my model

[('[CLS]', 'O'),
 ('Given', 'O'),
 ('the', 'O'),
 ('positive', 'O'),
 ('screen', 'O'),
 ('for', 'O'),
 ('ce', 'B-ety'),
 ('##li', 'I-ety'),
 ('##ac', 'I-ety'),
 ('disease', 'I-ety'),
 ('(', 'O'),
 ('positive', 'O'),
 ('anti', 'O'),
 ('-', 'O'),
 ('tissue', 'O'),
 ('trans', 'O'),
 ('##glut', 'O'),
 ('##aminase', 'O'),
 ('antibodies', 'O'),
 ('and', 'O'),
 ('results', 'O'),
 ('of', 'O'),
 ('duoden', 'O'),
 ('##al', 'O'),
 ('biopsy', 'O'),
 (')', 'O'),
 (',', 'O'),
 ('dietary', 'O'),
 ('intervention', 'O'),
 ('was', 'O'),
 ('immediately', 'O'),
 ('comm', 'O'),
 ('##enced', 'O'),
 ('.', 'O'),
 ('[SEP]', 'O')]

In [None]:
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False, collate_fn=lambda x: x)
for batch in test_loader:
    raise NotImplementedError()

The result is pretty bad but consider we use a very small training set

In [None]:
print(f"{precision=:.1%}, {recall=:.1%}, {f_score=:.1%}")

precision=61.3%, recall=29.3%, f_score=39.7%


# Bonus

## Evaluation

- Resplit the data by keeping the 426 longest texts as out-of-distribution test set. The remaining can be randomly split in train-validation.
- Compute F1 score using detokenized outputs

## Parameter-efficient fine-tuning

Compare full fine-tuning like above with LoRA

## More encoders

Evaluate other models, e.g.:
- another encoder, e.g. BERT or RoBERTa that were trained on general domain texts or https://huggingface.co/medicalai/ClinicalBERT that was trained on clinical texts
- a decoder-only like GPT-2
- an encoder-decoder like BART

## Better classifiers

use Conditional Random Field (CRF) instead of a linear classifier