[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/train_bert_for_ner.ipynb)

# Train BERT text for Named Entity Recognition using the Transformers library

This notebook shows how to train a Named Entity Recognition (NER) model by fine-tuning a pre-trained BERT model using the Hugging Face [Transformers](https://huggingface.co/transformers/) library.

This notebook largely follows the notebook on [training BERT for text classification](https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/train_bert_for_text_classification.ipynb) that was covered in the 29.3.2021 session. Knowledge of that material is assumed here. The NER-specific code is based in part on the [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) script.

**NOTE**: it's recommended to run this using a runtime with a GPU. Select "Runtime" ➔ "Change runtime type" from the top menu in Colab and set "Hardware accelerator" to "GPU" when starting.

# Install libraries

First, we'll use [`pip`](https://pypi.org/project/pip/) to install the Python libraries that are used in this notebook: [`transformers`](https://huggingface.co/transformers/), [`datasets`](https://huggingface.co/docs/datasets/), and [`seqeval`](https://pypi.org/project/seqeval/). The first two should already be familiar to you; the last is used for NER evaluation metrics.

In [1]:
!pip --quiet install transformers
!pip --quiet install datasets
!pip --quiet install seqeval



In [None]:
!pip --quiet install torch

# Import libraries

Next, let's import the classes and functions we'll be using (click links for documentation):

* [AutoTokenizer](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer): tokenizer class with support for the tokenization conventions of various pre-trained models
* [AutoModelForTokenClassification](https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification): an [AutoModel](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel) class that can load various pre-trained models and supports fine-tuning for classifying each token in a sequence
* [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments): simple class for holding hyperparameters and other settings for model training
* [Trainer](https://huggingface.co/transformers/main_classes/trainer.html): class supporting various forms of training transformer models
* [DataCollatorForTokenClassification](https://github.com/huggingface/transformers/blob/master/src/transformers/data/data_collator.py): class for padding token classification examples to the same length for training
* [load_dataset](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset): function for loading datasets e.g. from the [datasets](https://huggingface.co/datasets) collection
* [load_metric](https://huggingface.co/docs/datasets/loading_metrics.html): function for loading metrics from the [Hugging Face collection](https://huggingface.co/metrics)

In [2]:
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification
from transformers import TrainingArguments
from transformers import Trainer
from transformers import DataCollatorForTokenClassification
from datasets import load_dataset, load_metric

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


# Settings

Let's then set some global variables such as the name of the pre-trained model and the hyperparameters we'll use for fine-tuning it:

* `MODEL_NAME`: the name of a pretrained model included in the [model repository](https://huggingface.co/models)
* `DATASET`: the path and name of a dataset included in the [dataset repository](https://huggingface.co/datasets)
* `LEARNING_RATE`, `BATCH_SIZE`, and `TRAIN_EPOCHS`: hyperparameters to use for fine-turning the model. (Try different values here!)

Here, we'll use the [CoNLL'03 English](https://huggingface.co/datasets/conll2003) dataset, a standard benchmark dataset for NER evaluation in English.

In [3]:
MODEL_NAME = 'bert-base-cased'
DATASET = 'conll2003'

# Hyperparameters
LEARNING_RATE=2e-5
BATCH_SIZE=32
TRAIN_EPOCHS=2

We'll also define a placeholder label ID for special tokens (e.g. `[CLS]`) and tokens that represent continuation wordpieces. For example, if the word `Partition` is tokenized into the parts `Part` and `##ition`, the subword token `##ition` would get this ID.

(Here the "magic" value -100 is significant: this matches the default pythorch `ignore_index`, a value that is ignored in loss functions.)

In [4]:
DUMMY_LABEL_ID = -100    # Don't change this!

# Load dataset

We'll first load the dataset with [`load_dataset`](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset), grab the labels from the dataset, and store the number of distinct labels in the data.

In [5]:
dataset = load_dataset(DATASET)
label_list = dataset["train"].features['ner_tags'].feature.names
num_labels = len(label_list)

Downloading:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6...


Downloading:   0%|          | 0.00/650k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/146k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6. Subsequent calls will reuse this data.


Before doing anything else, let's just downsample the `train` subset of the data to reduce processing time (mostly training).

In [6]:
dataset['train'] = dataset['train'].filter(lambda example, idx: idx % 5 == 0, with_indices=True)

  0%|          | 0/15 [00:00<?, ?ba/s]

The loaded dataset is a dictionary-like container for [`Dataset`](https://huggingface.co/docs/datasets/exploring.html) objects for training, development (validation), and test data. We'll here be using the `tokens` and `ner_tags` fields and ignoring `pos_tags` (POS tags) and `chunk_tags` (shallow parsing tags identifying e.g. noun phrases).

In [7]:
print(dataset)
print(f'labels: {label_list}')
print(f'number of distinct labels: {num_labels}')

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 2809
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})
labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
number of distinct labels: 9


In the text classification tasks we looked at previously, each example in the data contained a string representing a whole sentence. Here, the data comes pre-split into words (key `tokens` in the `Dataset` object -- somewhat confusingly), and the tags in the data correspond to these words. Let's have a look at an example sentence.

In [8]:
train_words = dataset['train']['tokens']
train_tags = dataset['train']['ner_tags']
for word, tag in zip(train_words[0], train_tags[0]):
    print(f'{word:8s} ➔ {label_list[tag]}')

EU       ➔ B-ORG
rejects  ➔ O
German   ➔ B-MISC
call     ➔ O
to       ➔ O
boycott  ➔ O
British  ➔ B-MISC
lamb     ➔ O
.        ➔ O


The need to keep labels (tags) aligned to words that may not match the model tokenization is a major source of added complexity. We can see this more easily after loading a tokenizer.

# Load tokenizer and tokenize data

We'll load an appropriate tokenizer for the pre-trained model with [`AutoTokenizer.from_pretrained`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer.from_pretrained). We request a ["fast" tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html) as it provides a mapping from tokens to input words, which we need for NER.

In [9]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Because the text is pre-split into words, we'll always invoke the tokenizer with the argument `is_split_into_words=True`. Now, let's tokenize an example sentence in the training data and map the token IDs back to strings and see what happens:

In [10]:
words = dataset['train']['tokens'][10][:10]
tags = [label_list[t] for t in dataset['train']['ner_tags'][10][:10]]

tokenized = tokenizer(words, is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized.input_ids)

from itertools import zip_longest    # for display
for tag, token in zip_longest(tags, tokens):
    print(f'{token:8s} ➔ {tag}')

[CLS]    ➔ B-ORG
Op       ➔ I-ORG
##el     ➔ O
AG       ➔ O
together ➔ B-ORG
with     ➔ I-ORG
General  ➔ O
Motors   ➔ O
came     ➔ O
in       ➔ O
second   ➔ None
place    ➔ None
[SEP]    ➔ None


The tokenizer output no longer matches up with the tag sequence because

1. Special symbols such as `[CLS]` and `[SEP]` have been added by the tokenizer (as required for BERT), and
2. Some words have been split into subword tokens, such as `Opel` into `Op` and `##el`

As mentioned above, we resolve this by giving special symbols and continuation word pieces (e.g. `##el`) a "dummy" label ID that will be ignored in training. We need to additionally map the tags from their word indices in the source data to token indices in the tokenized data.

Let's define an encoding function that applies the tokenizer to the words (key `tokens`) and aligns the word labels (key `ner_tags`) to the tokens. Note the `is_split_into_words` specifying that the dataset texts are lists of words (as opposed to strings).

After defining this function, we use the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.DatasetDict.map) function of the `DatasetDict` to encode the train, development, and test [`Dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=dataset#datasets.Dataset) objects.

In [11]:
def encode_dataset(data):
    tokenized = tokenizer(data['tokens'], is_split_into_words=True)
    # align data['ner_tags'] to tokens using tokenized.word_ids()
    # and store tags and dummy labels in tokenized['labels'].
    labels = []
    prev_word_idx = None
    for word_idx in tokenized.word_ids():
        if word_idx is None or word_idx == prev_word_idx:
            # Special token (e.g. [SEP]) or part of the previous word
            labels.append(DUMMY_LABEL_ID)
        else:
            # Word start
            labels.append(data['ner_tags'][word_idx])
        prev_word_idx = word_idx
    tokenized['labels'] = labels
    assert len(labels) == len(tokenized.tokens())
    return tokenized


encoded_dataset = dataset.map(encode_dataset)

  0%|          | 0/2809 [00:00<?, ?ex/s]

  0%|          | 0/3250 [00:00<?, ?ex/s]

  0%|          | 0/3453 [00:00<?, ?ex/s]

## Load pre-trained model

Next, we'll load the pre-trained model with support for token classification output. As previously, we need to pass in the number of distinct labels so that the model will be built with the correct output layer size.

In [12]:
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

ImportError: 
AutoModelForTokenClassification requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.


You'll recall that we previously used [AutoModelForSequenceClassification](https://huggingface.co/transformers/model_doc/auto.html#automodelforsequenceclassification) for sentence classification. We're here using a different "auto" model, [AutoModelForTokenClassification](https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification). One important difference between the two is the output layer: for sentence classification, we only look at the output corresponding to the special `[CLS]` token and other outputs are ignored, while for token classification we look at the outputs of all input tokens (excluding continuation tokens) and ignore the `[CLS]`, as there is no sentence level label.

<center>Illustration of BERT for sentence classification (left) and BERT for token classification (e.g. NER; right).</center>

<a href="https://raw.githubusercontent.com/TurkuNLP/Text_Mining_Course/master/figs/bert-for-text-classification.png"><img src="https://raw.githubusercontent.com/TurkuNLP/Text_Mining_Course/master/figs/bert-for-text-classification.png" width="500px" align="left"/></a>
<a href="https://raw.githubusercontent.com/TurkuNLP/Text_Mining_Course/master/figs/bert-for-token-classification.png"><img src="https://raw.githubusercontent.com/TurkuNLP/Text_Mining_Course/master/figs/bert-for-token-classification.png" width="500px" align="right"/></a>

<center>(Figures adapted from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT, ELMo, and co.</a> CC BY-NC-SA)</div><center>

Assuming correctly formatted input, the models can otherwise be used largely identically.

## Training parameters and metrics
We're almost ready to train. We'll next create a [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) object to hold the hyperparameters and other settings that are appropriate for training on Colab:

* `save_strategy`: set to `"no"` so that model checkpoints are not saved to avoids exhausting Colab storage space
* `evaluation_strategy` and `logging_strategy` set to `"epoch"` so that evaluation and logging are performed once per epoch
* The hyperparameters `LEARNING_RATE`, `BATCH_SIZE` and `TRAIN_EPOCHS` set above are passed to the training process through this object

In [None]:
train_args = TrainingArguments(
    'output_dir',    # output directory for checkpoints and predictions
    save_strategy='no',
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    num_train_epochs=TRAIN_EPOCHS,
)

Finally, we'll define a function measuring how many named entity mentions were correctly predicted using the standard precision, recall and F-score metrics as implemented in the [seqeval](https://pypi.org/project/seqeval/) library, a python version of the standard `conlleval` script.

Note that we're here explicitly ignoring the labels assigned to any token where the true label ID is `DUMMY_LABEL_ID`, i.e. we're ignoring special tokens and continuation word pieces to focus only on the tokens corresponding to word starts. We also map the label indices to strings (e.g. `B-ORG`) as required by the library.

In [None]:
seq_eval_metrics = load_metric('seqeval')


def compute_metrics(pred):
    true_ids = pred.label_ids
    pred_ids = pred.predictions.argmax(axis=2)    # note axis
    true_word_ids = [
        [label_list[l] for (p, l) in zip(sent_pred, sent_true) if l != DUMMY_LABEL_ID]
        for sent_pred, sent_true in zip(pred_ids, true_ids)
    ]
    pred_word_ids = [
        [label_list[p] for (p, l) in zip(sent_pred, sent_true) if l != DUMMY_LABEL_ID]
        for sent_pred, sent_true in zip(pred_ids, true_ids)
    ]
    results = seq_eval_metrics.compute(predictions=pred_word_ids, references=true_word_ids)
    return {
        'precision': results['overall_precision'],
        'recall': results['overall_recall'],
        'f1': results['overall_f1'],
        'accuracy': results['overall_accuracy'],
    }

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961.0, style=ProgressStyle(description…




# Training (fine-tuning)

For fine-tuning the pre-trained model, we'll create a [`Trainer`](https://huggingface.co/transformers/main_classes/trainer.html) object, providing it with the pre-trained model, settings, training and development (validation) data, and the evaluation metric created above.

This should all be familiar from the sentence classification notebook excepting for one detail: we additionally provide a [`DataCollatorForTokenClassification`](https://github.com/huggingface/transformers/blob/master/src/transformers/data/data_collator.py) to the trainer. This class takes care of padding token classification examples in batches to the same length, as required by pytorch: the tokenizer `[PAD]` symbol is used to pad the output, and the given `label_pad_token_id` to pad the output (labels).

In [None]:
data_collator = DataCollatorForTokenClassification(
    tokenizer,
    label_pad_token_id=DUMMY_LABEL_ID
)

trainer = Trainer(
    model,
    train_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Training is then performed simply by calling the `train` function of the `Trainer` object.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Runtime,Samples Per Second
1,0.5076,0.199873,0.667367,0.694884,0.680848,0.949866,26.4375,122.931
2,0.1349,0.134727,0.789152,0.825143,0.806746,0.968965,26.5708,122.315


TrainOutput(global_step=176, training_loss=0.32122746922753076, metrics={'train_runtime': 155.5446, 'train_samples_per_second': 1.132, 'total_flos': 198305712559224.0, 'epoch': 2.0, 'init_mem_cpu_alloc_delta': 468816, 'init_mem_gpu_alloc_delta': 431435264, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 635993, 'train_mem_gpu_alloc_delta': 1314402816, 'train_mem_cpu_peaked_delta': 33208540, 'train_mem_gpu_peaked_delta': 2110889472})

## Evaluation and predictions for user input

We can use `trainer.evaluate()` to evaluate the trained model on the `eval_dataset` given to the trainer:

In [None]:
results = trainer.evaluate()
print(f'Accuracy : {results["eval_accuracy"]:.1%}\n'
      f'Precision: {results["eval_precision"]:.1%}\n'
      f'Recall   : {results["eval_recall"]:.1%}\n'
      f'F1-score : {results["eval_f1"]:.1%}')

Accuracy : 96.9%
Precision: 78.9%
Recall   : 82.5%
F1-score : 80.7%


(This isn't a great result, but should improve if we didn't downsample the data to make training faster.)

The fine-tuned model is now available as `trainer.model`. Here we define functions to predict tags for a user-defined string.

The main complexity here arises from the need to map back from token labels to word labels, inverting the mapping performed in `encode_dataset`. The process is basically the same as in `compute_metrics` above.

In [None]:
model = trainer.model
model.eval()    # switch to evaluation mode
model.to('cpu')    # switch to CPU


def word_start_tokens(tokenized):
    """Return list of bool identifying which tokens start words."""
    prev_word_idx = None
    is_word_start = []
    for word_idx in tokenized.word_ids():
        if word_idx is None or word_idx == prev_word_idx:
            is_word_start.append(False)
        else:
            is_word_start.append(True)
        prev_word_idx = word_idx
    return is_word_start


def predict_ner(words):
    tokenized = tokenizer(words, is_split_into_words=True, return_tensors='pt')
    pred = model(**tokenized)
    pred_idx = pred.logits.detach().numpy().argmax(axis=2)
    token_labels = [label_list[i] for s in pred_idx for i in s]
    word_labels = []
    for label, is_word_start in zip(token_labels, word_start_tokens(tokenized)):
        if is_word_start:
            word_labels.append(label)
    return word_labels

Let's try that out on a couple of example sentences:

In [None]:
example_sentences = [
    'Turku is the oldest city in Finland',
    'Google was founded in September 1998 by Larry Page and Sergey Brin',
    'American Dennis Mitchell outclassed Olympic champion Donovan Bailey',
]

for e in example_sentences:
    words = e.split()    # Note: assumes white-space tokenization is OK
    ner_tags = predict_ner(words)
    for word, tag in zip(words, ner_tags):
        print(f'{word:10s} ➔ {tag}')
    print()

Turku      ➔ B-LOC
is         ➔ O
the        ➔ O
oldest     ➔ O
city       ➔ O
in         ➔ O
Finland    ➔ B-LOC

Google     ➔ B-ORG
was        ➔ O
founded    ➔ O
in         ➔ O
September  ➔ O
1998       ➔ O
by         ➔ O
Larry      ➔ B-PER
Page       ➔ I-PER
and        ➔ O
Sergey     ➔ B-PER
Brin       ➔ I-PER

American   ➔ B-MISC
Dennis     ➔ B-PER
Mitchell   ➔ I-PER
outclassed ➔ O
Olympic    ➔ B-MISC
champion   ➔ O
Donovan    ➔ B-PER
Bailey     ➔ I-PER



# Visualization

The IOB notation can be a bit tricky to interpret. To get a better intuitive understanding of tagging results, let's implement a visualization using the[`displacy`](https://explosion.ai/demos/displacy-ent) library.

The code here mostly maps the IOB tags to character offets and formats the data for displacy. Unless you're interested in modifying this or otherwise working with this library, there's no need to go through this in detail.

In [None]:
from spacy import displacy


# Mapping of CoNLL'03 types for displacy
type_map = {
    'PER': 'PERSON',
}

def render_with_displacy(words, tags):
    tagged, offset, start, label = [], 0, None, None
    for word, tag in zip(words, tags):
        if tag[0] in 'OB' and start is not None:    # current ends
            tagged.append({
                'start': start,
                'end': offset,
                'label': type_map.get(label, label)
            })
            start, label = None, None
        if tag[0] == 'B':
            start, label = offset, tag[2:]
        elif tag[0] == 'I':
            if start is None:    # I without B, but nevermind
                start, label = offset, tag[2:]
        else:
            assert tag == 'O', 'unexpected tag {}'.format(tag)
        offset += len(word) + 1    # +1 for space
    if start:    # span open at sentence end
        tagged.append({
                'start': start,
                'end': offset,
                'label': type_map.get(label, label)
        })
    doc = {
        'text': ' '.join(words),
        'ents': tagged
    }
    displacy.render(doc, style='ent', jupyter=True, manual=True)


for e in example_sentences:
    words = e.split()    # Note: assumes white-space tokenization is OK
    ner_tags = predict_ner(words)
    render_with_displacy(words, ner_tags)

Try to find English sentences that the model tags wrong, and then test if you can improve the model to fix these errors by e.g. using more of the training data or adjusting the model hyperparameters!