#**Batch encoding**#
##The output of a tokenizer isn't a simple Python dictionary; what we get is actually a special `BatchEncoding` object. It's a subclass of a dictionary (which is why we were able to index into that result without any problem before), but with additional methods that are mostly used by fast tokenizers.

#÷Besides their parallelization capabilities, the key functionality of fast tokenizers is that they always keep track of the original span of texts the final tokens come from -- a feature we call offset mapping. This in turn unlocks features like mapping each word to the tokens it generated or mapping each character of the original text to the token it's inside, and vice versa.

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "✏️ Try it out! Create your own example text and see if you can understand which tokens are associated with word ID, and also how to extract the character spans for a single word. For bonus points, try using two sentences as input and see if the sentence IDs make sense to you."
encoding = tokenizer(example)
print(type(encoding))
print(encoding)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [101, 100, 13665, 1122, 1149, 106, 140, 15998, 1240, 1319, 1859, 3087, 1105, 1267, 1191, 1128, 1169, 2437, 1134, 22559, 1116, 1132, 2628, 1114, 1937, 10999, 117, 1105, 1145, 1293, 1106, 16143, 1103, 1959, 15533, 1111, 170, 1423, 1937, 119, 1370, 6992, 1827, 117, 2222, 1606, 1160, 12043, 1112, 7758, 1105, 1267, 1191, 1103, 5650, 10999, 1116, 1294, 2305, 1106, 1128, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


##Since the AutoTokenizer class picks a fast tokenizer by default, we can use the additional methods this `BatchEncoding` object provides. We have two ways to check if our tokenizer is a fast or a slow one. We can either check the attribute `is_fast` of the `tokenizer`:

In [2]:
tokenizer.is_fast

True

##check the same attribute of our `encoding`:

In [3]:
encoding.is_fast

True

#÷Let's see what a fast tokenizer enables us to do. First, we can access the tokens without having to convert the IDs back to tokens:

In [None]:
encoding.tokens()

['[CLS]',
 '[UNK]',
 'Try',
 'it',
 'out',
 '!',
 'C',
 '##reate',
 'your',
 'own',
 'example',
 'text',
 'and',
 'see',
 'if',
 'you',
 'can',
 'understand',
 'which',
 'token',
 '##s',
 'are',
 'associated',
 'with',
 'word',
 'ID',
 ',',
 'and',
 'also',
 'how',
 'to',
 'extract',
 'the',
 'character',
 'spans',
 'for',
 'a',
 'single',
 'word',
 '.',
 'For',
 'bonus',
 'points',
 ',',
 'try',
 'using',
 'two',
 'sentences',
 'as',
 'input',
 'and',
 'see',
 'if',
 'the',
 'sentence',
 'ID',
 '##s',
 'make',
 'sense',
 'to',
 'you',
 '.',
 '[SEP]']

##In this case the token at `index 5 is ##yl`, which is part of the word `"Sylvain"` in the original sentence. We can also use the `word_ids()` method to get the index of the word each token comes from:

In [4]:
encoding.word_ids()

[None,
 0,
 1,
 2,
 3,
 4,
 5,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 52,
 53,
 54,
 55,
 56,
 57,
 None]

##We can see that the tokenizer's special tokens `[CLS]` and `[SEP]` are mapped to `None`, and then each token is mapped to the word it originates from. This is especially useful to determine if a token is at the start of a word or if two tokens are in the same word. We could rely on the `##` prefix for that, but it only works for BERT-like tokenizers; this method works for any type of tokenizer as long as it's a fast one. In the next chapter, we'll see how we can use this capability to apply the labels we have for each word properly to the tokens in tasks like named entity recognition (NER) and part-of-speech (POS) tagging. We can also use it to mask all the tokens coming from the same word in masked language modeling (a technique called whole word masking).

##The notion of what a word is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.

##Similarly, there is a `sentence_ids()` method that we can use to map a token to the sentence it came from (though in this case, the token_type_ids returned by the tokenizer can give us the same information).

##Lastly, we can map any word or token to characters in the original text, and vice versa, via the `word_to_chars()` or `token_to_chars()` and `char_to_word()` or `char_to_token()` methods. For instance, the `word_ids()` method told us that `##yl` is part of the word at `index 31, but which word is it in the sentence? We can find out like this:

In [5]:
start, end = encoding.word_to_chars(3)
example[start:end]

'out'

#**Inside the token-classification pipeline**#
##In Chapter 1 we got our first taste of applying NER -- where the task is to identify which parts of the text correspond to entities like persons, locations, or organizations -- with the 🤗 Transformers `pipeline()` function. Then, in Chapter 2, we saw how a pipeline groups together the three stages necessary to get the predictions from a raw text: `tokenization, passing the inputs` through the model, and post-processing. The first two steps in the token-classification pipeline are the same as in any other pipeline, but the post-processing is a little more complex -- let's see how!

##**Getting the base results with the pipeline**##
##First, let's grab a token classification pipeline so we can get some results to compare manually. The model used by default is `dbmdz/bert-large-cased-finetuned-conll03-english`; it performs `NER` on sentences:

In [6]:
from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'entity': 'I-PER',
  'score': 0.99938285,
  'index': 4,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': 0.99815494,
  'index': 5,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': 0.99590707,
  'index': 6,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': 0.99923277,
  'index': 7,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': 0.9738931,
  'index': 12,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': 0.976115,
  'index': 13,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': 0.9887976,
  'index': 14,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': 0.9932106,
  'index': 16,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

##The model properly identified each token generated by "Sylvain" as a person, each token generated by "Hugging Face" as an organization, and the token "Brooklyn" as a location. We can also ask the pipeline to group together the tokens that correspond to the same entity:

In [7]:
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

##The `aggregation_strategy` picked will change the scores computed for each grouped entity. With `"simple"` the score is just the mean of the scores of each token in the given entity: for instance, the score of `"Sylvain"` is the mean of the scores we saw in the previous example for the tokens `S, ##yl, ##va, and ##in`. Other strategies available are:

##* `"first"`, where the score of each entity is the score of the first token of that entity (so for `"Sylvain"` it would be `0.993828, the score of the token S`)*##

##*"`max"`, where the score of each entity is the maximum score of the tokens in that entity (so for `"Hugging Face" it would be 0.98879766, the score of "Face"`)*
##*"`average"`, where the score of each entity is the average of the scores of the words composing that entity (so for `"Sylvain" there would be no difference from the "simple" strategy, but "Hugging Face" would have a score of 0.9819, the average of the scores for "Hugging", 0.975, and "Face", 0.98879`)

#**From inputs to predictions**#

##First we need to tokenize our input and pass it through the model. This is done exactly as in Chapter 2; we instantiate the tokenizer and the model using the `AutoXxx` classes and then use them on our example:

In [8]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


##Since we're using AutoModelForTokenClassification here, we get one set of logits for each token in the input sequence:

In [9]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 19])
torch.Size([1, 19, 9])


##First we need to tokenize our input and pass it through the model. This is done exactly as in Chapter 2; we instantiate the tokenizer and the model using the TFAutoXxx classes and then use them on our example:

In [10]:
from transformers import AutoTokenizer, TFAutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="tf")
outputs = model(**inputs)

All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


##Since we're using TFAutoModelForTokenClassification here, we get one set of logits for each token in the input sequence:

In [11]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

(1, 19)
(1, 19, 9)


##`We have a batch with 1 sequence of 19 tokens and the model has 9 different labels, so the output of the model has a shape of 1 x 19 x 9.` Like for the text classification pipeline, we use a softmax function to convert those logits to probabilities, and we take the argmax to get predictions (note that we can take the argmax on the logits because the softmax does not change the order):

In [12]:
import tensorflow as tf  # Import TensorFlow library

# Apply softmax function to logits to get probabilities
probabilities = tf.math.softmax(outputs.logits, axis=-1)[0]
# Convert probabilities to a Python list
probabilities = probabilities.numpy().tolist()

# Get the index of the highest probability (predicted class)
predictions = tf.math.argmax(outputs.logits, axis=-1)[0]
 # Convert predictions to a Python list
predictions = predictions.numpy().tolist()

print(predictions)  # Print the predictions
print(probabilities)

[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
[[0.9994322061538696, 1.6470285117975436e-05, 3.4267035516677424e-05, 1.6042313291109167e-05, 8.250691462308168e-05, 2.138227500836365e-05, 0.00015649088891223073, 1.9652095943456516e-05, 0.00022089220874477178], [0.9989631175994873, 1.851577326306142e-05, 5.240462633082643e-05, 1.253474511031527e-05, 0.00043473768164403737, 3.087438381044194e-05, 0.0003146878443658352, 2.7860780392074957e-05, 0.00014510893379338086], [0.9997084140777588, 8.308127689815592e-06, 2.874564415833447e-05, 5.6503645282646175e-06, 8.694856660440564e-05, 9.783467248780653e-06, 6.78614669595845e-05, 1.1794005331466906e-05, 7.241901039378718e-05], [0.9998350143432617, 5.645536930387607e-06, 1.3955179383629002e-05, 4.313381850806763e-06, 4.017698665848002e-05, 8.123078259814065e-06, 5.648501610266976e-05, 8.991642971523106e-06, 2.723914076341316e-05], [0.00018333389016333967, 2.5156619813060388e-05, 4.846203955821693e-05, 1.490058366471203e-05, 0.999382913

#×The `model.config.id2label` attribute contains the mapping of indexes to labels that we can use to make sense of the predictions:

In [13]:
model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

##As we saw earlier, there are **`9 labels: O is the label for the tokens that are not in any named entity (it stands for "outside"), and we then have two labels for each type of entity (miscellaneous, person, organization, and location). The label B-XXX indicates the token is at the beginning of an entity XXX and the label I-XXX indicates the token is inside the entity XXX. For instance, in the current example we would expect our model to classify the token S as B-PER (beginning of a person entity) and the tokens ##yl, ##va and ##in as I-PER (inside a person entity).`**

**You might think the model was wrong in this case as it gave the label I-PER to all four of these tokens, but that's not entirely true. There are actually two formats for those B- and I- labels: IOB1 and IOB2. The IOB2 format (in pink below), is the one we introduced whereas in the IOB1 format (in blue), the labels beginning with B- are only ever used to separate two adjacent entities of the same type. The model we are using was fine-tuned on a dataset using that format, which is why it assigns the label I-PER to the S token.**

##With this map, we are ready to reproduce (almost entirely) the results of the first pipeline -- we can just grab the score and label of each token that was not classified as O:

In [14]:
# Initialize an empty list to store the final results
results = []

# Tokenize the input text into individual tokens
tokens = inputs.tokens()

# Iterate through the predictions and corresponding tokens
for idx, pred in enumerate(predictions):
    # Convert the numeric prediction to its corresponding label
    label = model.config.id2label[pred]

    # Check if the label is not "O" (which typically means "Outside" or "No entity")
    if label != "O":
        # If it's a named entity, append a dictionary
        #with entity details to the results
        results.append({
            # The type of entity (e.g., "PERSON", "LOCATION")
            "entity": label,
            # The confidence score for this prediction
            "score": probabilities[idx][pred],
            # The actual word/token that was classified
            "word": tokens[idx]
        })

# Print the final results list containing all identified entities
print(results)

[{'entity': 'I-PER', 'score': 0.9993829131126404, 'word': 'S'}, {'entity': 'I-PER', 'score': 0.998154878616333, 'word': '##yl'}, {'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va'}, {'entity': 'I-PER', 'score': 0.9992326498031616, 'word': '##in'}, {'entity': 'I-ORG', 'score': 0.9738930463790894, 'word': 'Hu'}, {'entity': 'I-ORG', 'score': 0.9761150479316711, 'word': '##gging'}, {'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face'}, {'entity': 'I-LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn'}]


##This is very similar to what we had before, with one exception: the pipeline also gave us information about the start and end of each entity in the original sentence. This is where our offset mapping will come into play. To get the offsets, we just have to set `return_offsets_mapping=True` when we apply the tokenizer to our inputs:

In [15]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 12),
 (12, 14),
 (14, 16),
 (16, 18),
 (19, 22),
 (23, 24),
 (25, 29),
 (30, 32),
 (33, 35),
 (35, 40),
 (41, 45),
 (46, 48),
 (49, 57),
 (57, 58),
 (0, 0)]

##Each tuple is the span of text corresponding to each token, where (0, 0) is reserved for the special tokens. We saw before that the token at index 5 is ##yl, which has (12, 14) as offsets here. If we grab the corresponding slice in our example:

In [16]:
example[12:14]

'yl'

#**Using this, we can now complete the previous results:**#

In [17]:
# We're making an empty list called 'results' to store our findings
results = []

# We're using a special tool (tokenizer) to break down our example text into smaller pieces
# We're also asking it to remember where each piece starts and ends in the original text
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)

# We're getting all the individual words or parts of words from our text
tokens = inputs_with_offsets.tokens()

# We're also getting the start and end positions of each word in the original text
offsets = inputs_with_offsets["offset_mapping"]

# Now we're going to look at each word one by one
for idx, pred in enumerate(predictions):
    # We're figuring out what kind of word this is (like if it's a name or a place)
    label = model.config.id2label[pred]

    # If it's a special kind of word (not just a regular word)
    if label != "O":
        # We're finding out where this word starts and ends in the original text
        start, end = offsets[idx]

        # We're adding information about this special word to our results list
        results.append({
            # What kind of special word is it?
            "entity": label,
            # How sure are we about this?
            "score": probabilities[idx][pred],
            # What's the actual word?
            "word": tokens[idx],
            # Where does it start in the original text?
            "start": start,
            #Where does it end in the original text?
            "end": end,
        })

# Finally, we're showing all the special words we found
print(results)

[{'entity': 'I-PER', 'score': 0.9993829131126404, 'word': 'S', 'start': 11, 'end': 12}, {'entity': 'I-PER', 'score': 0.998154878616333, 'word': '##yl', 'start': 12, 'end': 14}, {'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va', 'start': 14, 'end': 16}, {'entity': 'I-PER', 'score': 0.9992326498031616, 'word': '##in', 'start': 16, 'end': 18}, {'entity': 'I-ORG', 'score': 0.9738930463790894, 'word': 'Hu', 'start': 33, 'end': 35}, {'entity': 'I-ORG', 'score': 0.9761150479316711, 'word': '##gging', 'start': 35, 'end': 40}, {'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face', 'start': 41, 'end': 45}, {'entity': 'I-LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


#**Grouping entities**#
##Using the offsets to determine the start and end keys for each entity is handy, but that information isn't strictly necessary. When we want to group the entities together, however, the offsets will save us a lot of messy code. For example, if we wanted to group together the tokens `Hu, ##gging`, and Face, we could make special rules that say the first two should be attached while removing the ##, and the Face should be added with a space since it does not begin with ## -- but that would only work for this particular type of tokenizer. We would have to write another set of rules for a SentencePiece or a Byte-Pair-Encoding tokenizer (discussed later in this chapter).

##With the offsets, all that custom code goes away: we just can take the span in the original text that begins with the first token and ends with the last token. So, in the case of the tokens Hu, ##gging, and Face, we should start at character 33 (the beginning of Hu) and end before character 45 (the end of Face):

In [18]:
example[33:45]

'Hugging Face'

#÷To write the code that post-processes the predictions while grouping entities, we will group together entities that are consecutive and labeled with I-XXX, except for the first one, which can be labeled as B-XXX or I-XXX (so, we stop grouping an entity when we get a O, a new type of entity, or a B-XXX that tells us an entity of the same type is starting):

In [19]:
import numpy as np  # We're bringing in a special tool called numpy to help us with math

results = []  # We're making an empty list to store our findings

# We're using a special tool to break our example text into smaller pieces
# and remember where each piece starts and ends
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)

# We're getting all the individual words or parts of words
tokens = inputs_with_offsets.tokens()

# We're getting the start and end positions of each word
offsets = inputs_with_offsets["offset_mapping"]

idx = 0  # We're starting a counter at 0

# We're going to look at each word one by one until we've seen them all
while idx < len(predictions):
    # We're looking at the current prediction
    pred = predictions[idx]
    # We're figuring out what kind of word this is
    label = model.config.id2label[pred]

    # If it's a special kind of word (not just a regular word)
    if label != "O":
        # We're removing the first two letters (like "B-" or "I-")
        label = label[2:]
        # We're finding out where this word starts
        start, _ = offsets[idx]

        # We're making a new list to keep track of how sure we are about each part
        all_scores = []

        # We're going to keep looking at words as long as they're part of the same
        #special group
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            # We're remembering how sure we are
            all_scores.append(probabilities[idx][pred])
            # We're updating where the word ends
            _, end = offsets[idx]
            # We're moving to the next word
            idx += 1

        # We're calculating how sure we are overall by taking
        #the average of all our scores
        score = np.mean(all_scores).item()

        # We're getting the actual word or group of words
        #from our example
        word = example[start:end]

        # We're adding all the information we found to our results list
        results.append({
            # What kind of special word or group is it?
            "entity_group": label,
            # How sure are we about it?
            "score": score,
            # What's the actual word or group of words?
            "word": word,
            # Where does it start in the original text?
            "start": start,
            # Where does it end in the original text?
            "end": end,
        })

    idx += 1  # We're moving to the next word

# Finally, we're showing all the special words or groups we found
print(results)



[{'entity_group': 'PER', 'score': 0.998169407248497, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.9796018600463867, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
