<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center> 

# Transformers for Named Entity Recognition 

**Token classification** is a task whose goal is to assign a label to each token in a sentence. Some examples of this kind of task are:
- **Named entity recognition (NER)**: The goal of this task is to find named entities (such as persons, locations, or organizations) in a sentence. The set of entity types depends on the domain. So, for example, for a biomedical domain, the entity types could be: drug, gene, disease, etc.

- **Part-of-speech tagging (POS)**: The goal is to classify each token in a sentence with its corresponding particular part of speech (such as noun, verb, adjective, etc.).

- **Chunking (or shallow parsing)**: the goal is to segment a sentence into a sequence of syntactic chunks. 

As named entities and chunks can consist of several tokens, we usually use the IOB format. In this way, each token can be represented with a IOB tag and the type (for example, PERSON, LOCATION or ORGANIZATION for the task of NER). With regard the IOB tags. Let me explain what each IOB tag means:
- B- represents a token that is at the beginning of a entity/chunk, 
- I- refers to tokens that are inside a chunk/entity, 
- O is used to represent those tokens that don't belong to any chunk/entity.



In this notebook, we will fine-tune a model (BERT) on a NER task. In particular, we will use the CoNLL-2003 dataset. This dataset is a collection of news stories from Reuters annotated with entities such as Person, Location,  Organization, and Miscellaneous.



First, we need to install some libraries provided by HuggingFace:

In [2]:
!pip install -q datasets transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m64.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.2/224.2 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

Let's load this dataset from HuggingFace:


In [3]:
from datasets import load_dataset
dataset_dict = load_dataset("conll2003")
dataset_dict

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

You can see that the dataset is distributed with three splits: train, validation and text. Moreover, each instance contains the following features:
- id: the identifier of the text
- tokens: the list of the tokens in the text.
- pos_tags: the list of the pos tags for the tokens in the text.
- chunk_tags: the list of the chunk tags for the tokens in the text.
- ner_tags: the list of the NER tags for the tokens in the text.

We will use the dataset to train a NER system, but this dataset could be also used to train a chunker or a pos tagger. 

Let's show some instances, for example, the first text of the training split. Instead of providing the text of sentences, the dataset provides tokenized sentences:

In [4]:
dataset_dict["train"][0]["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

Each token is classified with a NER tag:

In [5]:
dataset_dict["train"][0]["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

We can see that the NER tags are already encoding with integers.
What does the tag '3' mean? what ner tag is it?
To obtain the initial set of NER labels or tags, you can use the "features" field of the dataset:


In [6]:
ner_tags = dataset_dict["train"].features["ner_tags"]
ner_tags


Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

Let's get them and save them into a list

In [7]:
ner_tags = ner_tags.feature.names
ner_tags

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

The NER tags are based on IOB encoding, where: 

- O is used for tokens that don’t correspond to any entity.
- B-PER/I-PER represent the tokens corresponds to the beginning of/is inside a person entity.
- B-ORG/I-ORG represent the tokens corresponds to the beginning of/is inside a organization entity.
- B-LOC/I-LOC is used to represent thos tokens word that are at the beginning of/is inside a location entity.
 B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.



Let's decode the NER tags of the first sentence:

In [8]:
tokens = dataset_dict["train"][0]["tokens"]
tags = dataset_dict["train"][0]["ner_tags"]

text = ""
annotated_text = ""

for token, tag in zip(tokens, tags):
    text += token + " "
    text_label = ner_tags[tag]
    annotated_text += str(text_label) + "   "

print(text)
print(annotated_text)

EU rejects German call to boycott British lamb . 
B-ORG   O   B-MISC   O   O   O   B-MISC   O   O   


## Encoding the data

As usual, we need to preprocess the texts and transform them to the encodings (input_ids, input_type_ids, and segment_ids) that the model needs as input.
To do this, we can use a tokenizer, for example, that provided by BERT:

### Tokenization

In [9]:
from transformers import AutoTokenizer

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [10]:
tokenizer.is_fast

True

The texts in the dataset are already tokenized. However, we still need to encode these tokens. 
We have already worked with the Bert's tokenizer and know how to process full sentences. This tokenizer also allows us to process tokens by the parameter **is_split_into_words = True**.

For example, in the next cell, we are encoding the tokens of the first text. The method tokens() allows us to get the list of tokens. We can see that two special tokens, [CLS] and [SEP] have been added. Moreover, those words that are not in the vocabulary have been split into wordpieces. For example, 'lamb' was split into 'la' and '##mb'. 

In [12]:
inputs = tokenizer(dataset_dict["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

The method word_ids() allows us to know the index of each token in the sentence. The special tokens are indexed with None. The token 'EU' has the id 0, while the tokens 'la' and '##mb" correspond with the word at the position 7. 

In [13]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

Let us see how many word id's it contains:

In [14]:
print('Numer of word ids', len(inputs.word_ids()))
print('Numer of word ids without special tokens', len(inputs.word_ids()) - 2)

Numer of word ids 12
Numer of word ids without special tokens 10


Let's get the ner tags of this sentence


In [15]:
tags = dataset_dict["train"][0]["ner_tags"]
print(tags)
print("Number of tokens", len(tags))


[3, 0, 7, 0, 0, 0, 7, 0, 0]
Number of tokens 9


So we can see that the tokenizer provides a different number of word ids, than the number of ner tags for this sentence in the dataset. This is due to the type of tokenization performed by BERT, which split the unknown words into wordpieces. 
Therefore, we should define a function that correctly maps each word id with its corresponding NER tag. In the previous example, both 'la' and '#mb" should be annotated with O. 

Let's show a sentence containing named entities that have two or more tokens:

In [16]:
inputs = tokenizer(dataset_dict["train"][4]["tokens"], is_split_into_words=True)
print("Tokens: " , inputs.tokens())
print("word_ids: " ,inputs.word_ids())

print('Numer of word ids', len(inputs.word_ids()))
print('Numer of word ids without special tokens', len(inputs.word_ids()) - 2)


Tokens:  ['[CLS]', 'Germany', "'", 's', 'representative', 'to', 'the', 'European', 'Union', "'", 's', 'veterinary', 'committee', 'Werner', 'Z', '##wing', '##mann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']
word_ids:  [None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, None]
Numer of word ids 39
Numer of word ids without special tokens 37


In addition to the special tokens, we can see that the tokenizer has split several words into several tokens. For example:
- "'s" has been split into: "'", 's'. The two tokens share the same word id: 1
- "'s" has been split into: "'", 's'. The two tokens share the same word id: 7
- 'Zwingmann' has been split into: 'Z', '##wing', '##mann' that share the same id: 18. The full word is labeled as 'I-PER' in the dataset:


In [17]:
tags = dataset_dict["train"][4]["ner_tags"]
print(tags)
for t in tags:
    print(ner_tags[t], end = ' ')

print("Number of tokens with tags", len(tags))

[5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0]
B-LOC O O O O B-ORG I-ORG O O O B-PER I-PER O O O O O O O O O O O B-LOC O O O O O O O Number of tokens with tags 31


We can see that we have more tokens (a total of 37 without the special tokens), while we only have 31 tags for this sentence. This is due to "Werner Zwingmann" has been tokenized in four tokens: 'Werner', 'Z', '##wing', '##mann', but in the dataset, there are only two labels: B-PER (for 'Werner') and I-PER (for 'Zwingmann'). 

Therefore, we have to implement a function to align the tokens with its corresponding labels:

In [18]:
def align_labels_with_tokens(word_ids, tags):
    new_labels = []
    for word_id in word_ids:
        if word_id is None:
            new_labels.append(-100)     # for special tokens, we will use the tag -100.
        else:
            new_labels.append(tags[word_id])

    return new_labels

In [19]:
tags = dataset_dict["train"][0]["ner_tags"]
inputs = tokenizer(dataset_dict["train"][0]["tokens"], is_split_into_words=True)
word_ids = inputs.word_ids()

aligned_labels = align_labels_with_tokens(word_ids, tags)
print(inputs.tokens())
print(aligned_labels)
for t in aligned_labels:
    if t != -100:
        print(ner_tags[t], end = ' ')
    else:
        print(str(-100), end = ' ')

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
-100 B-ORG O B-MISC O O O B-MISC O O O -100 

In [20]:
tags = dataset_dict["train"][4]["ner_tags"]
inputs = tokenizer(dataset_dict["train"][4]["tokens"], is_split_into_words=True)
word_ids = inputs.word_ids()

aligned_labels = align_labels_with_tokens(word_ids, tags)
print(inputs.tokens())
print(aligned_labels)
for t in aligned_labels:
    if t != -100:
        print(ner_tags[t], end = ' ')
    else:
        print(str(-100), end = ' ')


['[CLS]', 'Germany', "'", 's', 'representative', 'to', 'the', 'European', 'Union', "'", 's', 'veterinary', 'committee', 'Werner', 'Z', '##wing', '##mann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']
[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]
-100 B-LOC O O O O O B-ORG I-ORG O O O O B-PER I-PER I-PER I-PER O O O O O O O O O O O O O B-LOC O O O O O O O -100 

However, the function does not work well when the token corresponding to a B- entity has been split into several wordpieces. In this case, only the first wordpiece would be labeled with B-, while the following ones should be annotated with 'I-':

In [21]:
tags = dataset_dict["train"][2]["ner_tags"]
inputs = tokenizer(dataset_dict["train"][2]["tokens"], is_split_into_words=True)
word_ids = inputs.word_ids()

aligned_labels = align_labels_with_tokens(word_ids, tags)
print(inputs.tokens())
print(aligned_labels)
for t in aligned_labels:
    if t != -100:
        print(ner_tags[t], end = ' ')
    else:
        print(str(-100), end = ' ')


['[CLS]', 'BR', '##US', '##SE', '##LS', '1996', '-', '08', '-', '22', '[SEP]']
[-100, 5, 5, 5, 5, 0, 0, 0, 0, 0, -100]
-100 B-LOC B-LOC B-LOC B-LOC O O O O O -100 

Let's redefine the previous function in order to correctely work for these examples:

In [22]:
def align_labels_with_tokens(word_ids, tags):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id is None:
            new_labels.append(-100)     # for special tokens, we will use the tag -100.
        elif word_id != current_word:
            current_word = word_id
            new_labels.append(tags[word_id])
        else:
            label = tags[current_word]
            # ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
            #  0        1       2       3       4           5       6       7           8
            if label % 2 != 0:       # this means that the current word was B-X, so the followin should be 'I-X'
                new_labels.append(label+1)
            else:
                new_labels.append(label)
            
            
    return new_labels

Please, check that the functio works for several sentences:

In [23]:
for i in range(5):
    print("\nSentence: ", str(i))
    tags = dataset_dict["train"][i]["ner_tags"]
    inputs = tokenizer(dataset_dict["train"][i]["tokens"], is_split_into_words=True)
    word_ids = inputs.word_ids()

    aligned_labels = align_labels_with_tokens(word_ids, tags)
    print(inputs.tokens())
    print(aligned_labels)
    for t in aligned_labels:
        if t != -100:
            print(ner_tags[t], end = ' ')
        else:
            print(str(-100), end = ' ')

    print()


    


Sentence:  0
['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
-100 B-ORG O B-MISC O O O B-MISC O O O -100 

Sentence:  1
['[CLS]', 'Peter', 'Blackburn', '[SEP]']
[-100, 1, 2, -100]
-100 B-PER I-PER -100 

Sentence:  2
['[CLS]', 'BR', '##US', '##SE', '##LS', '1996', '-', '08', '-', '22', '[SEP]']
[-100, 5, 6, 6, 6, 0, 0, 0, 0, 0, -100]
-100 B-LOC I-LOC I-LOC I-LOC O O O O O -100 

Sentence:  3
['[CLS]', 'The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 's', '##hun', 'British', 'la', '##mb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.', '[SEP]']
[-100, 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
-100 O B-ORG I-ORG O O O O O O B-MISC O O O O O O B-MISC O O O O O O O O O O O O O O O -10

The following function gets as input a dataset. It applies the tokenizer on the tokens for each instance. Moreover, for each instance, the functions aligns the labels with its tokens by using the previous function. 
Finally, this function adds a new feature to the dataset containing the aligned labels. 

In [24]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        #get the ids for the instance i
        word_ids = tokenized_inputs.word_ids(i)
        #align the words with their corresponding labels
        new_labels.append(align_labels_with_tokens(word_ids, labels))

    # we add a new feature to the dataset with the aligned labels for each instance
    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

train_encodings = tokenize_and_align_labels(dataset_dict['train'])
# train_encodings

We can directly apply this function on the whole dataset (in df). To avoid errors, we could also remove the columns of the dataset that we are not using

In [25]:
tokenized_datasets = dataset_dict.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset_dict["train"].column_names,
)


Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Let us see some records:

In [26]:
tokenized_datasets["train"][0]

{'input_ids': [101,
  7270,
  22961,
  1528,
  1840,
  1106,
  21423,
  1418,
  2495,
  12913,
  119,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]}

In [27]:
tokenized_datasets["test"][1]

{'input_ids': [101, 11896, 3309, 1306, 2001, 1181, 2293, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 1, 2, 2, 2, 2, 2, -100]}

In [28]:
dataset_dict["test"][1]["tokens"]

['Nadim', 'Ladki']

We can see that there are sentences with a different number of labels. So we need to pad these features

In [29]:
for i in range(5):
    print(tokenized_datasets["train"][i]["input_ids"])
    print(tokenized_datasets["train"][i]["labels"])

[101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[101, 1943, 14428, 102]
[-100, 1, 2, -100]
[101, 26660, 13329, 12649, 15928, 1820, 118, 4775, 118, 1659, 102]
[-100, 5, 6, 6, 6, 0, 0, 0, 0, 0, -100]
[101, 1109, 1735, 2827, 1163, 1113, 9170, 1122, 19786, 1114, 1528, 5566, 1106, 11060, 1106, 188, 17315, 1418, 2495, 12913, 1235, 6479, 4959, 2480, 6340, 13991, 3653, 1169, 1129, 12086, 1106, 8892, 119, 102]
[-100, 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[101, 1860, 112, 188, 4702, 1106, 1103, 1735, 1913, 112, 188, 27431, 3914, 14651, 163, 7635, 4119, 1163, 1113, 9031, 11060, 1431, 4417, 8892, 3263, 2980, 1121, 2182, 1168, 1190, 2855, 1235, 1103, 3812, 5566, 1108, 27830, 119, 102]
[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]


### Data collator
We need to save the data into a datacollector. In particular, we will use **DataCollatorForTokenClassification** that will pad all features: input_ids, token_type_ids, attention_mask and labels. 


We cannot just use "DataCollatorWithPadding" because this is not able to do this:

In [30]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

#let us save the first sentences into a data collector
# we can see that all labels have the same length
batch = data_collator([tokenized_datasets["train"][i] for i in range(5)])
batch["labels"]


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100],
        [-100,    5,    6,    6,    6,    0,    0,    0,    0,    0, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100],
        [-100,    0,    3,    4,    0,    0,    0,    0,    0,    0,    7,    0,
            0,    0,    0,    0,    0,    7,    0,    0,    0,    0,    0,    0,
            0,    0,    0

As we can see, the following set of labels has been padded to the length of the first one using -100.

## Model

### Metrics

We need to define the metrics that the trainer should compute in every eporch. So we need to define a a compute_metrics() function that takes the predictions and its corresponding gold labels. Then it should return a dictionary with the metric names and their scores. For token classification, we can use the seqeval library that provides precision, recall and F1 for each tag:



First we need to install the seqeval library and also the Evaluate library from HF:

In [31]:
!pip install seqeval Evaluate


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting Evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16180 sha256=5d0055bf3382db03bf271e350ecf9c20f5e37eabfd165688f56f1826e31235f9
  Stored in directory: /root/.cache/pip/wheels/e2/a5/92/2c80d1928733611c2747a9820e1324a6835524d9411510c142
Successfully built seqeval
Installing collected packages: seqeval, Evaluate
Successfully installed

In [32]:
import evaluate

metric = evaluate.load("seqeval")

import numpy as np

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[ner_tags[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [ner_tags[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

### Defining the model and its arguments

As our taks is a token classification problem, we must use the **AutoModelForTokenClassification** class. 

When we define the model, we need to provide information about the number of labels. It would be enough using the num_labels argument, but we can also define the correct label correspondences instead. So we will create two dictionaries that allow us to trasnlate easily  the ner tags:

In [33]:
# we create a dictionary where the key is the index of each tag, and the label the name of the tag
id2label = {i: label for i, label in enumerate(ner_tags)}
print(id2label)
# we create a dictionary where the key is the name of the tag, and the value is the corresponding index
label2id = {v: k for k, v in id2label.items()}
print(label2id)


{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}


Now we define the model

In [34]:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    # instead of using num_labels, for NER is better to provide the correspondence between labels and their indexes
    id2label=id2label,
    label2id=label2id,
)

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

Let's check that the number of labels is 9:

In [35]:
model.config.num_labels


9

Before training the model, we need to define training arguments:



In [36]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir='./outputs/',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=1,  # 1
    weight_decay=0.01,
)

### Training (by using Trainer)
Please, check that you are using GPU!!!

In [37]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0852,0.068595,0.901385,0.930663,0.91579,0.981589


TrainOutput(global_step=1756, training_loss=0.13683612688798838, metrics={'train_runtime': 188.4957, 'train_samples_per_second': 74.49, 'train_steps_per_second': 9.316, 'total_flos': 307643323612176.0, 'train_loss': 0.13683612688798838, 'epoch': 1.0})

If you want to learn how to train your model without using Trainer, please visit [this link](#
https://huggingface.co/course/chapter7/2#a-custom-training-loop)

### Evaluation (on validation dataset)
We can already provide a final results of the best model on the whole validation dataset.

In [38]:
# evaluate the current model after training
trainer.evaluate()

{'eval_loss': 0.06859466433525085,
 'eval_precision': 0.9013854930725347,
 'eval_recall': 0.9306630764052508,
 'eval_f1': 0.915790345284425,
 'eval_accuracy': 0.9815888620709955,
 'eval_runtime': 10.7052,
 'eval_samples_per_second': 303.591,
 'eval_steps_per_second': 38.019,
 'epoch': 1.0}

## Evaluation (on test dataset) 

In [39]:
predictions, labels, _ = trainer.predict(tokenized_datasets["test"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [[id2label[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]

true_labels = [[id2label[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]

results = metric.compute(predictions=true_predictions, references=true_labels)

results


{'LOC': {'precision': 0.9016393442622951,
  'recall': 0.9232613908872902,
  'f1': 0.9123222748815166,
  'number': 1668},
 'MISC': {'precision': 0.6719706242350061,
  'recall': 0.782051282051282,
  'f1': 0.7228439763001974,
  'number': 702},
 'ORG': {'precision': 0.8384481760277939,
  'recall': 0.8717639975918121,
  'f1': 0.8547815820543093,
  'number': 1661},
 'PER': {'precision': 0.9446135118685332,
  'recall': 0.9598021026592455,
  'f1': 0.9521472392638036,
  'number': 1617},
 'overall_precision': 0.8632739609838846,
 'overall_recall': 0.9010269121813032,
 'overall_f1': 0.8817465130382051,
 'overall_accuracy': 0.9696033784529081}

#### Results on token level

In [40]:
from sklearn.metrics import classification_report
print(classification_report(np.concatenate(true_labels), np.concatenate(true_predictions)))

              precision    recall  f1-score   support

       B-LOC       0.93      0.93      0.93      1668
      B-MISC       0.79      0.82      0.80       702
       B-ORG       0.91      0.90      0.91      1661
       B-PER       0.96      0.96      0.96      1617
       I-LOC       0.91      0.90      0.90      1748
      I-MISC       0.58      0.63      0.60       886
       I-ORG       0.89      0.91      0.90      3172
       I-PER       0.97      0.97      0.97      4082
           O       0.99      0.99      0.99     47925

    accuracy                           0.97     63461
   macro avg       0.88      0.89      0.89     63461
weighted avg       0.97      0.97      0.97     63461



#### Results on entity level

In [41]:
from seqeval.metrics import classification_report as classification_report_seqeval
print(classification_report_seqeval(true_labels, true_predictions))


              precision    recall  f1-score   support

         LOC       0.90      0.92      0.91      1668
        MISC       0.67      0.78      0.72       702
         ORG       0.84      0.87      0.85      1661
         PER       0.94      0.96      0.95      1617

   micro avg       0.86      0.90      0.88      5648
   macro avg       0.84      0.88      0.86      5648
weighted avg       0.87      0.90      0.88      5648

