# Fast tokenizers' special powers 

Reproduce the results of token-classification(ner) and question-answering pipeline.

In [1]:
%%capture
!pip install datasets transformers[sentencepiece]

In [6]:
%%capture
!pip install fastcore
!pip install rich

In [8]:
from rich import print as rprint
from rich import inspect

- how to define Fast tokenizer(default) and Slow tokenizer using AutoTokenizer api from transformers
  - fast -> `AutoTokenizer.from_pretrained('bert-base-uncased')` 
  - slow -> `AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=False)`

- Properly using a fast tokenizer requires giving it multiple texts at the same time (batches of text).
`raw_datasets.mpa(tokenize_with_fast, batched=True)` -> this is much faster ie more than 20 times faster

- Slow tokenizers are written in Python inside transformers library whereas fast versions are provided by tokenizers library written in Rust.

| batched | Fast tokenizer | Slow tokenizer |
| --- | --- | --- |
|  True | 10.8s | 4min 41s |
| False | 59.2s | 5min 3s |  


⚠️ When tokenizing a single sentence, you won’t always see a difference in speed between the slow and fast versions of the same tokenizer. In fact, the fast version might actually be slower! 

It’s only when ***tokenizing lots of texts in parallel at the same time*** that you will be able to clearly see the difference.


## Batch encoding

- When performing tokenization, we may lose information
- Not obvious to know which word a token belongs to.
- Fast tokenizer keep track of the word each token comes from. 
```
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoding = tokenizer('Let's talk about tokenizer superpowers', return_offsets_mapping=True)
print(encoding.tokens())
print(encoding.word_ids())
``` 
- Keep track of each character span in the original text that gave each token.
```
print(encoding['offset_mapping'])
```
- Internal Pipeline of the tokenizer is **Normalization -> Pre-tokenization -> Model -> Special tokens**

- Word IDs application (Whole word masking, Token Classification)

- Offset Mapping application (question answering, token classification)

In [27]:
import transformers

In [28]:
from transformers import AutoTokenizer
from fastcore.test import *

tok = AutoTokenizer.from_pretrained('bert-base-uncased')
ex = 'My name is Manikandan and I work at Red Hat in Canada.'

encoding = tok(ex)
print(type(encoding))
test_eq(type(encoding), transformers.tokenization_utils_base.BatchEncoding)

<class 'transformers.tokenization_utils_base.BatchEncoding'>


In [29]:
# check if our tokenizer is a fast or a slow one by checking the attribute of tokenizer or encoding
test_eq(tok.is_fast, True)
test_eq(encoding.is_fast, True)

In [30]:
# access the tokens without having to convert the IDs back to tokens
print(encoding.tokens())

['[CLS]', 'my', 'name', 'is', 'mani', '##kan', '##dan', 'and', 'i', 'work', 'at', 'red', 'hat', 'in', 'canada', '.', '[SEP]']


In [31]:
expected = ['[CLS]', 'my', 'name', 'is', 'mani', '##kan', '##dan', 'and', 'i', 'work', 'at', 'red', 'hat', 'in', 'canada', '.', '[SEP]']

test_eq(encoding.tokens(), expected)

In [32]:
# use the word_ids() method to get the index of the word each token comes from
encoding.word_ids()

[None, 0, 1, 2, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, None]

- Note: Special tokens such as [CLS] & [SEP] are mapped to None.

- This is useful to determine if token is **at the start of a word** or **if two tokens are in the same word**.

- Capabilities powered by offset mapping
  - applying labels for each word properly in the tokens (NER, POS tagging)
  - mask all tokens coming from the same word in masked language modeling (**whole word masking**)

In [35]:
start, end = encoding.word_to_chars(3)
test_eq(ex[start:end], 'Manikandan')

**Tip**

The notion of what a word is is complicated. For instance, does “I’ll” (a contraction of “I will”) count as one or two words? It actually ***depends on the tokenizer and the pre-tokenization operation*** it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.

✏️ Try it out! Create a tokenizer from the bert-base-cased and roberta-base checkpoints and tokenize ”81s” with them. What do you observe? What are the word IDs?

Ans: bert-base-cased treat 81s as a single token based on '##s' and both tokens referring to the same index 0. roberta-base split 81s into two tokens. 

In [25]:
# bert-base-cased
tok = AutoTokenizer.from_pretrained('bert-base-cased')
enc = tok('81s')

test_eq(enc.tokens(), ['[CLS]', '81', '##s', '[SEP]'])
test_eq(enc.word_ids(), [None, 0, 0, None])

In [26]:
# roberta-base
tok = AutoTokenizer.from_pretrained('roberta-base')
enc = tok('81s')

test_eq(enc.tokens(),  ['<s>', '81', 's', '</s>'])
test_eq(enc.word_ids(), [None, 0, 1, None])

✏️ Try it out! Create your own example text and see if you can understand which tokens are associated with word ID, and also how to extract the character spans for a single word. For bonus points, try using two sentences as input and see if the sentence IDs make sense to you.

## Inside the token-classification pipeline

- tok cls pipeline assigns a label to each token in sentence.
- groups the tokens together belonging to the same entity.
- Tok cls pipeline : Tokenizer -> Model -> Postprocessing
- model outputs the logits. logits need to converted into proba. 
- label correspondence then lets us match each prediction to the label
- start & end char positions are found using offset mappings.
- finally group all the tokens corresponding to the same entity
- we group together tokens with the same label unles it's a B-XXX

## Getting the base results with the pipeline


In [41]:
from transformers import pipeline
from pprint import pprint

# The model used by default is dbmdz/bert-large-cased-finetuned-conll03-english; it performs NER on sentences
tok_cls = pipeline('token-classification')

pprint(tok_cls('My name is Manikandan and I work at Red Hat in Canada.'))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


[{'end': 14,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.99883395,
  'start': 11,
  'word': 'Man'},
 {'end': 17,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.98901176,
  'start': 14,
  'word': '##ika'},
 {'end': 20,
  'entity': 'I-PER',
  'index': 6,
  'score': 0.99430144,
  'start': 17,
  'word': '##nda'},
 {'end': 21,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.9943057,
  'start': 20,
  'word': '##n'},
 {'end': 39,
  'entity': 'I-ORG',
  'index': 12,
  'score': 0.9986511,
  'start': 36,
  'word': 'Red'},
 {'end': 43,
  'entity': 'I-ORG',
  'index': 13,
  'score': 0.99593204,
  'start': 40,
  'word': 'Hat'},
 {'end': 53,
  'entity': 'I-LOC',
  'index': 15,
  'score': 0.9997646,
  'start': 47,
  'word': 'Canada'}]


In [39]:
from pprint import pprint

In [37]:
# let's ask pipeline to group together the tokens that correspond to the same entity

from transformers import pipeline

# The aggregation_strategy picked will change the scores computed for each grouped entity. 
# With "simple" the score is just the mean of the scores of each token in the given entity
# aggregation_strategy = ['first', 'simple', 'max', 'average']

tok_cls = pipeline('token-classification', aggregation_strategy='simple')
print(tok_cls('My name is Manikandan and I work at Red Hat in Canada.'))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


[{'entity_group': 'PER', 'score': 0.9941132, 'word': 'Manikandan', 'start': 11, 'end': 21}, {'entity_group': 'ORG', 'score': 0.99729156, 'word': 'Red Hat', 'start': 36, 'end': 43}, {'entity_group': 'LOC', 'score': 0.9997646, 'word': 'Canada', 'start': 47, 'end': 53}]


In [40]:
pprint(tok_cls('My name is Manikandan and I work at Red Hat in Canada.'))

[{'end': 21,
  'entity_group': 'PER',
  'score': 0.9941132,
  'start': 11,
  'word': 'Manikandan'},
 {'end': 43,
  'entity_group': 'ORG',
  'score': 0.99729156,
  'start': 36,
  'word': 'Red Hat'},
 {'end': 53,
  'entity_group': 'LOC',
  'score': 0.9997646,
  'start': 47,
  'word': 'Canada'}]


## From inputs to predictions

Let's try the same without using pipeline.

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

In [4]:
model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tok = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

In [10]:
#inspect(tok, methods=True)

In [12]:
ex = 'My name is Manikandan and I work at Red Hat in Canada.'
inputs = tok(ex, return_tensors='pt')

In [13]:
type(inputs)

transformers.tokenization_utils_base.BatchEncoding

In [14]:
inspect(inputs)

In [15]:
output = model(**inputs)

In [16]:
inspect(output)

In [18]:
rprint(inputs['input_ids'].shape)

In [20]:
# batch with 1 sequence of 18 tokens with 9 different labels
rprint(output.logits.shape)

In [21]:
import torch

In [22]:
inspect(torch.nn.functional.softmax)

In [30]:
proba = torch.nn.functional.softmax(output.logits, dim=-1)[0].tolist()
# we can take the argmax on the logits because the softmax does not change the order
preds = output.logits.argmax(dim=-1)[0].tolist()
rprint(preds)

In [33]:
# The model.config.id2label attribute contains the mapping of indexes to labels
inspect(model.config, methods=True)

In [34]:
model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

In [35]:
# Let's reproduce the results of the first pipeline

In [36]:
inspect(preds)

In [49]:
tokens = inputs.tokens()
results = []
for idx, pred in enumerate(preds):
  label = model.config.id2label[pred]
  if label != 'O':
    results.append({"entity": label, "score": proba[idx][pred], "word": tokens[idx]})
rprint(results)

To get the start & end of each entity in the original stentence, let's use offset mapping

In [40]:
inputs_with_offsets = tok(ex, return_offsets_mapping=True)
inputs_with_offsets['offset_mapping']

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 14),
 (14, 17),
 (17, 20),
 (20, 21),
 (22, 25),
 (26, 27),
 (28, 32),
 (33, 35),
 (36, 39),
 (40, 43),
 (44, 46),
 (47, 53),
 (53, 54),
 (0, 0)]

In [41]:
ex[11:14]

'Man'

In [50]:
inputs_with_offsets = tok(ex, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets['offset_mapping']
results = []

for idx, pred in enumerate(preds):
  label = model.config.id2label[pred]
  if label != 'O':
    start, end = offsets[idx]
    results.append(
        {"entity": label, "score": proba[idx][pred], "word": tokens[idx],
         "start": start, "end": end})
rprint(results)

## Grouping entities

- Consecutive entities, I-XXX should be grouped together.
- First one should be ignored.

In [52]:
import numpy as np

results = []
inputs_with_offsets = tok(ex, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

In [59]:
preds, ex

([0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 0, 8, 0, 0],
 'My name is Manikandan and I work at Red Hat in Canada.')

In [60]:
predictions = preds
example = ex
probabilities = proba

In [61]:
idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

[{'entity_group': 'PER', 'score': 0.9941132068634033, 'word': 'Manikandan', 'start': 11, 'end': 21}, {'entity_group': 'ORG', 'score': 0.9972915649414062, 'word': 'Red Hat', 'start': 36, 'end': 43}, {'entity_group': 'LOC', 'score': 0.999764621257782, 'word': 'Canada', 'start': 47, 'end': 53}]


In [62]:
rprint(results)