# Named Entity Recognition with pipeline() function from the 🤗 Transformers and direct approach with tokenization, applying the model and processing the results

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_

 ## Batch encoding

The output of the tokenizer is not a simple Python dictionary; we actually get a special BatchEncoding object. This is a subclass of Dictionary, but with additional methods that are mainly used by fast tokenizers.

In addition to parallelization capabilities, a key functionality of fast tokenizers is that they always keep track of the original range of texts from which the final tokens are drawn - a feature we call *offset mapping*. This in turn gives access to functions such as mapping each word to the tokens it generates, or mapping each character of the source text to the token contained within, and vice versa.

Consider an example:

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

As mentioned earlier, we get a BatchEncoding object in the tokenizer output:

In [None]:
example = "I love KPI IASA and my name is Dmytro."
encoding = tokenizer(example)
print(type(encoding))

{'input_ids': [101, 146, 1567, 148, 23203, 146, 10719, 1592, 1105, 1139, 1271, 1110, 141, 4527, 8005, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Since the AutoTokenizer class chooses a fast tokenizer by default, we can use the additional methods provided by this BatchEncoding object. We have two ways to test if our tokenizer is fast or slow. We can check the is_fast attribute of the tokenizer:

In [None]:
tokenizer.is_fast

True

or check the same attribute in our coding:

In [None]:
encoding.is_fast

True

Let's see what the quick tokenizer allows us to do. First, we can access tokens without having to convert IDs back to tokens:

In [None]:
encoding.tokens()

['[CLS]',
 'I',
 'love',
 'K',
 '##PI',
 'I',
 '##AS',
 '##A',
 'and',
 'my',
 'name',
 'is',
 'D',
 '##my',
 '##tro',
 '.',
 '[SEP]']

In this case, the token under index 2 is ##PI, which is part of the abbreviation "KPI" in the original sentence. We can also use the word_ids() method to get the index of the word that each token comes from:

In [None]:
encoding.word_ids()

[None, 0, 1, 2, 2, 3, 3, 3, 4, 5, 6, 7, 8, 8, 8, 9, None]

We can see that the tokenizer's [CLS] and [SEP] special tokens are mapped to None, and then each token is mapped to the word it comes from. This is especially useful for determining whether a token is at the beginning of a word or whether two tokens are in the same word. We could rely on the ## prefix for this, but that only works for BERT-like tokenizers; this method works for any type of tokenizer as long as it is fast. Next, we'll see how we can use this capability to properly apply the labels we have for each word to the tokens in named object recognition (NER) tasks.

Similarly, there is a sentence_ids() method that we can use to match a token to the sentence it comes from (although in this case the token_type_ids returned by the tokenizer can give us the same information).

Finally, we can map any word or token to characters in the original text and vice versa using the word_to_chars() or token_to_chars() and char_to_word() or char_to_token() methods. For example, the word_ids() method told us that ##yl is part of the word at index 3, but what is that word in the sentence? We can find out like this:

In [None]:
start, end = encoding.word_to_chars(2)
example[start:end]

'KPI'

As we mentioned earlier, this is all due to the fact that the fast tokenizer keeps track of the span of text each token comes from in the shift list. To illustrate their use, we next show you how to reproduce the results of the token classification pipeline manually.

 ## Inside the token-classification pipeline

We tried to apply NER using the function 🤗 Transformers pipeline(). Next, we saw how the pipeline combines the three steps required to derive predictions from raw text: tokenization, passing the input through the model, and post-processing. The first two steps in the Token Classification pipeline are the same as any other pipeline, but the post-processing is a bit more complicated - let's take a look!

full_nlp_pipeline.svg

### Getting the base results with the pipeline

First, let's grab the token classification pipeline so we can get some results to compare manually. The default model used is dbmdz/bert-large-cased-finetuned-conll03-english; it performs NER on the sentences:

In [None]:
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("I love KPI IASA and my name is Dmytro.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'ORG',
  'score': 0.99054736,
  'word': 'KPI IASA',
  'start': 7,
  'end': 15},
 {'entity_group': 'PER',
  'score': 0.99773145,
  'word': 'Dmytro',
  'start': 31,
  'end': 37}]

The model properly identified each token generated by "KPI IASA" as an organization, each token generated by "Dmytro" as an individual. We can also ask the pipeline to group tokens that correspond to the same entity:

The selected aggregation_strategy will change the scores calculated for each aggregated entity. For "simple", the score is just the average of the scores of each token in the given entity: for example, the "KPI IASA" score is the average of the scores we saw in the previous example for the tokens K, ##PI , I, ##AS and # #A. Other available strategies are:


* "first", where the score of each entity is the score of the first token of that entity (so for "KPI IASA" it would be 0.9960531, token score K)
* "max", where the score of each entity is the maximum score of the token in that entity (so for "KPI IASA" it will be 0.9960531, the score of the same token K)
* "average", where the score of each entity is the average score of the words that make up this entity (so for "Dmytro" there would be no differences from the "simple" strategy, but "KPI IASA" will have a score of 0.9912031, the average value for "KPI" is 0.9944813 and "IASA" — 0.9879248)


Now let's see how to get these results without using the pipeline() function!

### From inputs to predictions

First we need to tokenize our input and pass it through the model. We instantiate the tokenizer and model using the AutoXxx classes and then use them in our example:

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "I love KPI IASA and my name is Dmytro."

{'input_ids': tensor([[  101,   146,  1567,   148, 23203,   146, 10719,  1592,  1105,  1139,
          1271,  1110,   141,  4527,  8005,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [None]:
tokens = tokenizer.tokenize(example)
print(tokens)

['I', 'love', 'K', '##PI', 'I', '##AS', '##A', 'and', 'my', 'name', 'is', 'D', '##my', '##tro', '.']


In [None]:
inputs = tokenizer(example, return_tensors="pt")
print("Input IDs:", inputs)

In [None]:
outputs = model(**inputs)
print(outputs.logits)

tensor([[[ 8.5987, -2.2723, -1.7703, -1.8164, -1.0974, -1.7894, -0.0731,
          -1.9788,  0.4751],
         [10.2989, -2.3537, -1.3576, -2.5748, -0.3721, -2.2373,  0.3227,
          -2.2982, -0.4618],
         [10.1152, -2.0770, -1.0784, -2.7295, -1.0400, -1.9658,  1.1737,
          -2.1848, -0.5156],
         [ 0.5800, -2.6068, -1.2598, -3.0422, -1.1074, -0.5813,  7.1818,
          -1.9558,  0.6878],
         [ 1.3509, -2.7273, -1.4815, -3.0313, -1.3565, -0.7153,  6.9782,
          -2.1722,  0.9556],
         [ 0.7015, -2.4438, -1.3839, -3.1034, -1.3865,  0.0136,  6.6990,
          -1.8907,  0.9896],
         [ 1.2606, -2.4662, -0.9454, -2.8025, -1.6704, -0.2027,  6.0982,
          -1.8591,  1.1760],
         [ 1.3620, -2.5348, -1.1794, -2.6840, -1.2094, -0.6736,  6.5684,
          -2.1349,  0.4874],
         [10.5169, -2.4071, -1.4378, -2.4828, -1.3460, -1.7373,  0.3820,
          -1.9807, -0.4918],
         [10.6922, -2.3155, -1.2691, -2.5498, -0.6087, -2.0638,  0.1229,
         

Since we use AutoModelForTokenClassification, we get one set of logits for each token in the input sequence:

In [None]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 17])
torch.Size([1, 17, 9])


We have a batch with 1 sequence of 17 tokens, and the model has 9 different labels, so the output of the model is of the form 1 x 17 x 9. As with the text classification pipeline, we use the softmax function to transform these logits. to probabilities, and we take argmax to get predictions (note that we can take argmax for logits, since softmax does not change the order):

In [None]:
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
print(probabilities)

tensor([[9.9933e-01, 1.8989e-05, 3.1371e-05, 2.9956e-05, 6.1482e-05, 3.0776e-05,
         1.7124e-04, 2.5466e-05, 2.9627e-04],
        [9.9989e-01, 3.1987e-06, 8.6612e-06, 2.5642e-06, 2.3205e-05, 3.5938e-06,
         4.6488e-05, 3.3815e-06, 2.1214e-05],
        [9.9980e-01, 5.0690e-06, 1.3759e-05, 2.6395e-06, 1.4298e-05, 5.6648e-06,
         1.3082e-04, 4.5509e-06, 2.4155e-05],
        [1.3527e-03, 5.5867e-05, 2.1487e-04, 3.6148e-05, 2.5024e-04, 4.2345e-04,
         9.9605e-01, 1.0713e-04, 1.5065e-03],
        [3.5727e-03, 6.0511e-05, 2.1032e-04, 4.4648e-05, 2.3833e-04, 4.5252e-04,
         9.9291e-01, 1.0542e-04, 2.4060e-03],
        [2.4650e-03, 1.0614e-04, 3.0631e-04, 5.4875e-05, 3.0550e-04, 1.2391e-03,
         9.9205e-01, 1.8452e-04, 3.2882e-03],
        [7.7783e-03, 1.8723e-04, 8.5672e-04, 1.3375e-04, 4.1494e-04, 1.8004e-03,
         9.8134e-01, 3.4356e-04, 7.1471e-03],
        [5.4289e-03, 1.1024e-04, 4.2754e-04, 9.4965e-05, 4.1493e-04, 7.0905e-04,
         9.9039e-01, 1.6445e-0

In [None]:
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)

[0, 0, 0, 6, 6, 6, 6, 6, 0, 0, 0, 0, 4, 4, 4, 0, 0]


The *attribute* model.config.id2label contains a mapping of indices to labels that we can use to understand predictions:

In [None]:
model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

As we saw earlier, there are 9 labels: O is the label for tokens that are not in any named entity (it means "outside"), and we have two labels for each type of entity (miscellaneous, person, organization, and location). The label B-XXX indicates that the marker is at the beginning of entity XXX, and the label I-XXX indicates that the marker is inside entity XXX. For example, in the current example, we expect our model to classify token D as B-PER (beginning of person entity) and tokens ##my and ##tro as I-PER (inside person entity).

We might think that the model was wrong in this case because it gave the label I-PER to all three of these markers, but this is not the case. There are actually two formats for these B- and I-labels: IOB1 and IOB2. The IOB2 format (pink below) is the one we introduced, while the IOB1 format (blue) uses labels starting with B- only to separate two adjacent entities of the same type. The model we use has been configured on a dataset using this format, so it assigns the I-PER label to the S marker.

IOB_versions.svg

We're ready to reproduce (almost completely) the results of the first pipeline—we can simply get the score and label of each token that wasn't classified as O:

In [None]:
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)

[{'entity': 'I-ORG', 'score': 0.9960530996322632, 'word': 'K'}, {'entity': 'I-ORG', 'score': 0.9929094910621643, 'word': '##PI'}, {'entity': 'I-ORG', 'score': 0.9920504689216614, 'word': 'I'}, {'entity': 'I-ORG', 'score': 0.9813381433486938, 'word': '##AS'}, {'entity': 'I-ORG', 'score': 0.9903860092163086, 'word': '##A'}, {'entity': 'I-PER', 'score': 0.998601496219635, 'word': 'D'}, {'entity': 'I-PER', 'score': 0.9962921142578125, 'word': '##my'}, {'entity': 'I-PER', 'score': 0.9983012080192566, 'word': '##tro'}]


This is very similar to what we had before, with one exception: the pipeline also gave us information about the start and end of each entity in the original sentence. This is where our displacement mapping comes into play. To get the offsets, we just need to set return_offsets_mapping=True when we apply the tokenizer to our input:

In [None]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

[(0, 0),
 (0, 1),
 (2, 6),
 (7, 8),
 (8, 10),
 (11, 12),
 (12, 14),
 (14, 15),
 (16, 19),
 (20, 22),
 (23, 27),
 (28, 30),
 (31, 32),
 (32, 34),
 (34, 37),
 (37, 38),
 (0, 0)]

Each tuple is the span of text corresponding to each token, where (0, 0) is reserved for special tokens. We saw earlier that the marker at index 4 is ##PI, which has (8, 10) as the offset here. If we grab the corresponding fragment in our example:

In [None]:
example[8:10]

'PI'

we get the correct range of text without ##:


```
PI
```

Using this, we can now complete the previous results:

In [None]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

print(results)

[{'entity': 'I-ORG', 'score': 0.9960530996322632, 'word': 'K', 'start': 7, 'end': 8}, {'entity': 'I-ORG', 'score': 0.9929094910621643, 'word': '##PI', 'start': 8, 'end': 10}, {'entity': 'I-ORG', 'score': 0.9920504689216614, 'word': 'I', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.9813381433486938, 'word': '##AS', 'start': 12, 'end': 14}, {'entity': 'I-ORG', 'score': 0.9903860092163086, 'word': '##A', 'start': 14, 'end': 15}, {'entity': 'I-PER', 'score': 0.998601496219635, 'word': 'D', 'start': 31, 'end': 32}, {'entity': 'I-PER', 'score': 0.9962921142578125, 'word': '##my', 'start': 32, 'end': 34}, {'entity': 'I-PER', 'score': 0.9983012080192566, 'word': '##tro', 'start': 34, 'end': 37}]


This is the same as what we got from the first assembly line!

### Entity grouping

Using offsets to define the start and end keys for each entity is convenient, but this information is not absolutely necessary. However, when we want to group entities together, offsets will save us a lot of messy code. For example, if we wanted to group the tokens K, ##PI and I, ##AS, ##A we could create special rules that the first two should be attached by removing the ## and the I should be appended with a space because it doesn't start with ##, the last two will be attached too - but this will only work for that specific type of tokenizer. We would have to write a different set of rules for SentencePiece or the Byte-Pair-Encoding tokenizer (discussed later in this section).

With offsets, all of this user code disappears: we can simply take the space in the source text that starts with the first token and ends with the last token. So, in the case of tokens K, ##PI and I, ##AS, ##A, we have to start at character 7 (the beginning of K) and end before character 15 (the end of ##A):

In [None]:
example[7:15]

'KPI IASA'

To write code that handles predictions when grouping entities, we group entities that are consecutive and labeled I-XXX except for the first one, which may be labeled B-XXX or I-XXX (so we stop grouping an entity when we get O, the new entity type, or B-XXX, which tells us that an entity of the same type is starting):

In [None]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

[{'entity_group': 'ORG', 'score': 0.9905474424362183, 'word': 'KPI IASA', 'start': 7, 'end': 15}, {'entity_group': 'PER', 'score': 0.997731606165568, 'word': 'Dmytro', 'start': 31, 'end': 37}]


And we get the same results as with our second pipeline!