# Named entities recognition with transformers

<a target="_blank" href="https://colab.research.google.com/github/jaspock/me/blob/main/docs/materials/transformers/assets/notebooks/nerbert.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a href="http://dlsi.ua.es/~japerez/"><img src="https://img.shields.io/badge/Universitat-d'Alacant-5b7c99" style="margin-left:10px"></a>

Notebook and code written by Juan Antonio Pérez in 2023–2024.

This notebook uses the encoder-like transformer of our previous notebook to train and test a toy-like named entity recognition model from a tiny dataset. 

It is assumed that you are already familiar with the basics of PyTorch. This notebook complements a [learning guide](https://dlsi.ua.es/~japerez/materials/transformers/intro/) based on studying the math behind the models by reading the book "[Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)" (3rd edition) by Jurafsky and Martin. It is part of a series of notebooks which are supposed to be incrementally studied, so make sure you follow the right order. If your learning is being supervised by a teacher, follow the additional instructions that you may have received. Although you may use a GPU environment to execute the code, the computational requirements for the default settings are so low that you can probably run it on CPU.

In [1]:
%%capture
%pip install torch

## Mini-batch preparation

In [2]:
import torch
import itertools

def make_batch(input_sentences, output_tags, word_index, tag_index, max_len, batch_size, device):
    input_batch = []
    output_batch = []
    data_cycle = itertools.cycle(zip(input_sentences, output_tags))

    # to-do: adjust T to be minimum of the actual max length of the batch or max_len

    while True:
        for s,t in data_cycle:
            words = s.split()
            tags = t.split()
            assert len(words) == len(tags)
            inputs = [word_index[n] for n in words]
            inputs = inputs + [0] * (max_len - len(inputs))  # padded inputs
            tags = [tag_index[n] for n in tags]
            tags = tags + [0] * (max_len - len(tags))  # padded outputs
            input_batch.append(inputs)
            output_batch.append(tags)

            if len(input_batch) == batch_size:
                yield torch.LongTensor(input_batch, device=device), torch.LongTensor(output_batch, device=device)
                input_batch = []
                output_batch = []

## Import our transformer code

We load the `EncoderTransformer` class implemented in the previous notebook. If we are running this on the cloud, we download the file from GitHub. If we are running it locally, we assume that the file is in the same directory as this notebook. The seed is also set to a fixed value to ensure reproducibility.

In [3]:
%%capture
import os
colab = bool(os.getenv("COLAB_RELEASE_TAG"))  # running in Google Colab?
if not os.path.isfile('transformer.ipynb') and colab:
    %pip install wget
    %wget https://raw.githubusercontent.com/jaspock/minGPT/master/transformer.ipynb

%pip install nbformat
%run './transformer.ipynb'

set_seed(42)

## Corpus preprocessing

In [4]:
input_sentences = [
    "The cat sat on the mat .",
    "I love eating pizza .",
    "John is running in the park .",
    "She gave him a beautiful gift .",
    "They are playing soccer together .",
    "The cat is eating pizza in the park ."
]

output_tags = [
    "DET NOUN VERB ADP DET NOUN PUNCT",
    "PRON VERB VERB NOUN PUNCT",
    "PROPN AUX VERB ADP DET NOUN PUNCT",
    "PRON VERB PRON DET ADJ NOUN PUNCT",
    "PRON AUX VERB NOUN ADV PUNCT",
    "DET NOUN AUX VERB NOUN ADP DET NOUN PUNCT"
]

word_list = list(set(" ".join(input_sentences).split()))
word_index = {'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3}
special_tokens = len(word_index) 
for step, w in enumerate(word_list):
    word_index[w] = step + special_tokens
index_word = {i: w for i, w in enumerate(word_index)}
input_vocab_size = len(word_index)
tag_list = list(set(" ".join(output_tags).split()))
tag_index = {'[PAD]': 0}  # padding index must be 0
for step, t in enumerate(tag_list):
    tag_index[t] = step + 1
index_tag = {i:t for i, t in enumerate(tag_index)}
output_vocab_size = len(tag_index)
print("input vocab size: %d" % input_vocab_size)
print("output vocab size: %d" % output_vocab_size)

input vocab size: 31
output vocab size: 11


## Model training

In [5]:
n_layer = 2
n_head = 2
n_embd =  64
embd_pdrop = 0.1
resid_pdrop = 0.1
attn_pdrop = 0.1
batch_size = 3
max_len = 12
lr = 0.001
training_steps = 1000
eval_steps = 100

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = EncoderTransformer(n_embd=n_embd, n_head=n_head, n_layer=n_layer, input_vocab_size=input_vocab_size, output_vocab_size=output_vocab_size, 
                max_len=max_len, embd_pdrop=embd_pdrop, attn_pdrop=attn_pdrop, resid_pdrop=resid_pdrop)
model.to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.5, total_iters=training_steps)

model.train()
step = 0
for inputs, outputs in make_batch(input_sentences=input_sentences, output_tags=output_tags, word_index=word_index, 
                                    tag_index=tag_index, max_len=max_len, batch_size=batch_size, device=device):
    optimizer.zero_grad()
    logits = model(inputs)
    loss = criterion(logits.view(-1,logits.size(-1)), outputs.view(-1)) 
    if step % eval_steps == 0:
        print(f'Step [{step}/{training_steps}], loss: {loss.item():.4f}')
    loss.backward()
    optimizer.step()
    scheduler.step()
    step = step + 1
    if (step==training_steps):
        break

print(f'Step [{step}/{training_steps}], loss: {loss.item():.4f}')

number of parameters: 0.10M
Step [0/1000], loss: 2.3123


Step [100/1000], loss: 0.0666
Step [200/1000], loss: 0.0203
Step [300/1000], loss: 0.0103
Step [400/1000], loss: 0.0065
Step [500/1000], loss: 0.0046
Step [600/1000], loss: 0.0032
Step [700/1000], loss: 0.0027
Step [800/1000], loss: 0.0023
Step [900/1000], loss: 0.0018
Step [1000/1000], loss: 0.0019


## Model evaluation

In [9]:
# predict tags
model.eval()
inputs, outputs = make_batch(input_sentences=input_sentences, output_tags=output_tags, word_index=word_index, tag_index=tag_index, max_len=max_len, batch_size=batch_size, device=device).__next__()
print(inputs)
print(outputs)
logits = model(inputs)
_, indices = torch.max(logits, dim=-1)
# compute accuracy excluding pads:
accuracy = torch.sum(indices[outputs!=0]==outputs[outputs!=0]).item()/torch.sum(outputs!=0).item()
print(f"Accuracy: {accuracy*100:.2f}%")


predict_tags, true_tags, input_words = [], [], []  # 3 lists are required, not one
for step in range(batch_size):
    predict_tags.append(" ".join([index_tag[each.item()] for each in indices[step]]))
    true_tags.append(" ".join([index_tag[each.item()] for each in outputs[step]]))
    input_words.append(" ".join([index_word[each.item()] for each in inputs[step]]))
print("Input:\n", "\n".join(input_words))
print("Prediction: \n", "\n".join(predict_tags))
print("Target: \n", "\n".join(true_tags))

tensor([[19, 24, 18, 11,  4,  5, 13,  0,  0,  0,  0,  0],
        [25,  9, 16, 10, 13,  0,  0,  0,  0,  0,  0,  0],
        [26, 22, 17, 23,  4, 20, 13,  0,  0,  0,  0,  0]]) tensor([[ 7, 10,  1,  3,  7, 10,  2,  0,  0,  0,  0,  0],
        [ 6,  1,  1, 10,  2,  0,  0,  0,  0,  0,  0,  0],
        [ 4,  5,  1,  3,  7, 10,  2,  0,  0,  0,  0,  0]])
Accuracy: 100.00%
Input:
 The cat sat on the mat . [PAD] [PAD] [PAD] [PAD] [PAD]
I love eating pizza . [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
John is running in the park . [PAD] [PAD] [PAD] [PAD] [PAD]
Prediction: 
 DET NOUN VERB ADP DET NOUN PUNCT NOUN PUNCT ADP PUNCT PUNCT
PRON VERB VERB NOUN PUNCT NOUN PUNCT NOUN PUNCT NOUN PUNCT PUNCT
PROPN AUX VERB ADP DET NOUN PUNCT NOUN PUNCT ADP PUNCT PUNCT
Target: 
 DET NOUN VERB ADP DET NOUN PUNCT [PAD] [PAD] [PAD] [PAD] [PAD]
PRON VERB VERB NOUN PUNCT [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
PROPN AUX VERB ADP DET NOUN PUNCT [PAD] [PAD] [PAD] [PAD] [PAD]


## Exercises

If your learning path is supervised by a teacher, they may have provided you with additional instructions on how to proceed with the exercises.

✎ Compare the original pre-norm implementation of the transformer with the post-norm implementation under this task.

✎ Add a pre-training step to the model that implements the masked language model objective and is trained on a separate corpus. Note that the `MASK` token is already included in the vocabulary.