# THE HUGGINGFACE TOKENIZERS LIBRARY

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

# Training a new tokenizer from an old one

Training a tokenizer is not the same as trianing a model!
* Model training uses SGD to make the loss a little bit smaller for each batch.
* Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It is deterministic.

## Assembling a corpus

Assume we want to train GPT-2 from scratch to write Python code.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python")

In [None]:
raw_datasets['train']

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

The dataset separates doctrings from code and suggests a tokenization of both. We will use the `whole_func_string` column to train our tokenizer.

In [None]:
# check an example
raw_datasets['train'][0]['whole_func_string']

'def __msgc_step3_discontinuity_localization(self):\n        """\n        Estimate discontinuity in basis of low resolution image segmentation.\n        :return: discontinuity in low resolution\n        """\n        import scipy\n\n        start = self._start_time\n        seg = 1 - self.segmentation.astype(np.int8)\n        self.stats["low level object voxels"] = np.sum(seg)\n        self.stats["low level image voxels"] = np.prod(seg.shape)\n        # in seg is now stored low resolution segmentation\n        # back to normal parameters\n        # step 2: discontinuity localization\n        # self.segparams = sparams_hi\n        seg_border = scipy.ndimage.filters.laplace(seg, mode="constant")\n        logger.debug("seg_border: %s", scipy.stats.describe(seg_border, axis=None))\n        # logger.debug(str(np.max(seg_border)))\n        # logger.debug(str(np.min(seg_border)))\n        seg_border[seg_border != 0] = 1\n        logger.debug("seg_border: %s", scipy.stats.describe(seg_border, a

The first thing is to transform the dataset into an *iterator* of lists of texts. Using lists of texts will enable our tokenizer to go faster (training on batches of texts instead of processing individual texts one by one), and it should be an iterator if we want to avoid having everything in memory at once.

In [None]:
# Do NOT try this if the datset is large
training_corpus = [
    raw_datasets['train'][i: i+1000]['whole_func_string']
    for i in range(0, len(raw_datasets['train']), 1000)
]

This will create a list of lists of 1,000 texts each, but would load everything in memory.

Using a Python generator, we can avoid loading anything into memory until it is actually necessary.

In [None]:
training_corpus = (
    raw_datasets['train'][i: i+1000]['whole_func_string']
    for i in range(0, len(raw_datasets['train']), 1000)
)

This does not fetch any elements of the dataset; it just creates an object we can use in a Python `for` loop. The texts will only be loaded when we need them, and only 1,000 texts at a time will be loaded. This way we will not exhaust all our memory even if we are procesing a huge dataset.


The problem with a generator object is that it can only be used once. So, the second time we call a generator, we will get an empty list:

In [None]:
gen = (i for i in range(10))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


That is why we define a function that returns a generator instead:

In [None]:
def get_training_corpus():
    return (
        raw_datasets['train'][i : i+1000]['whole_func_string']
        for i in range(0, len(raw_datasets['train']), 1000)
    )

training_corpus = get_training_corpus()

We can also define the generator inside a `for` loop by using the `yield` statement

In [None]:
def get_training_corpus():
    dataset = raw_datasets['train']

    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples['whole_func_string']

## Training a new tokenizer

In [None]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

In [None]:
# test how the tokenizer behaves with our dataset
example = '''
def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b
'''

tokens = old_tokenizer.tokenize(example)
tokens

['Ċ',
 'def',
 'Ġadd',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`',
 '."',
 '""',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb',
 'Ċ']

We will use the method `train_new_from_iterator()` to train a new tokenizer:

In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)



The `AutoTokenizer.train_new_from_iterator()` only works if the tokenizer we are using is a "fast" tokenizer.

Most of the Transformer models have a fast tokenizer available, and the `AutoTokenizer` API always selects the fast tokenizer if it is available.

In [None]:
# retry the example
tokens = tokenizer.tokenize(example)
tokens

['Ċ',
 'def',
 'Ġadd',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'ĊĠĠĠ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`."""',
 'ĊĠĠĠ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb',
 'Ċ']

In [None]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

29
38


## Saving the tokenizer

In [None]:
tokenizer.save_pretrained('code-search-net-tokenizer')

This creates a new folder named *code-search-net-tokenizer*, which will contain all the files the tokenizer needs to be reloaded.

# Fast tokenizers' special powers

Slow tokenizers are those written in Python inside the Transformers library, while the fast versions are the ones provided by Tokenizers, which are written in Rust.

## Batch encoding

The output of a tokenizer is a special `BatchEncoding` object, which is a subclass of a dictionary, but with additional methods that are mostly used by fast tokenizers.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

example = "My name is Bin and I work at anywhere I want."
encoding = tokenizer(example)

In [None]:
type(encoding)

Check if our tokenizer is a fast or a slow one:

In [None]:
tokenizer.is_fast

True

In [None]:
encoding.is_fast

True

Using a fast tokenizer, we can access the tokens without having to convert the IDs back to tokens:

In [None]:
encoding.tokens()

['[CLS]',
 'My',
 'name',
 'is',
 'Bin',
 'and',
 'I',
 'work',
 'at',
 'anywhere',
 'I',
 'want',
 '.',
 '[SEP]']

We can use the `word_ids()` method to get the index of the word each token comes from:

In [None]:
encoding.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, None]

The special tokens `[CLS]` and `[SEP]` are mapped to `None`.

The `sentence_ids()` method is used to map a token to the sentence it came from (though in this case, the `token_type_ids` returned by the tokenizer can give us the same information).

Lastly, we can map any word or token to characters in the original text, and vice versa, via the `word_to_chars()` or `token_to_chars()` and `char_to_word()` or `char_to_token()` methods.

In [None]:
start, end = encoding.word_to_chars(3)
example[start: end]

'Bin'

## Inside the token-classification pipeline

Recall taht the NER identifies which parts of the text correspond to entities like persons, locations, or organizations.

### Getting the base results with the pipeline

The default model for `token-classification` is `dbmdz/bert-large-cased-finetuned-conll03-english` and it performs NER on sentences:

In [None]:
from transformers import pipeline

token_classifier = pipeline('token-classification')

In [None]:
token_classifier('My name is Bin and I work in Houston, Texas.')

[{'entity': 'I-PER',
  'score': 0.9989341,
  'index': 4,
  'word': 'Bin',
  'start': 11,
  'end': 14},
 {'entity': 'I-LOC',
  'score': 0.99905556,
  'index': 9,
  'word': 'Houston',
  'start': 29,
  'end': 36},
 {'entity': 'I-LOC',
  'score': 0.9996307,
  'index': 11,
  'word': 'Texas',
  'start': 38,
  'end': 43}]

We can also ask the pipeline to group together the tokens that correspond to the same entity:

In [None]:
token_classifier = pipeline('token-classification', aggregation_strategy='simple')

token_classifier('My name is Bin and I work in Houston, Texas.')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9989341,
  'word': 'Bin',
  'start': 11,
  'end': 14},
 {'entity_group': 'LOC',
  'score': 0.99905556,
  'word': 'Houston',
  'start': 29,
  'end': 36},
 {'entity_group': 'LOC',
  'score': 0.9996307,
  'word': 'Texas',
  'start': 38,
  'end': 43}]

The `aggregation_strategy` will change the scores computed for each grouped entity. With `"simple"` the score is the mean of the scores of each token in the given entity.

### From inputs to predictions

We can obtain these results without using the `pipeline()` function.

First we need to tokenize our input and pass it through the model.

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = 'dbmdz/bert-large-cased-finetuned-conll03-english'

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
example = "My name is Bin and I want to play for the Golden State Warriors but work in Houston, Texas."

inputs = tokenizer(example, return_tensors='pt')
outputs = model(**inputs)

Since we used `AutoModelForTokenClassification`, we get one set of logits for each token in the input sequence:

In [None]:
print(inputs['input_ids'].shape)
print(outputs.logits.shape)

torch.Size([1, 23])
torch.Size([1, 23, 9])


We have a batch with 1 sequence of 23 tokens and the model has 9 different labels, so the output of the model has a shape of 1x23x9.

In [None]:
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
predictions

[0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 6, 6, 6, 0, 0, 0, 8, 0, 8, 0, 0]

The `model.config.id2label` attribute contains the mapping of indexes to labels that we can use to make sense of the predictions:

In [None]:
model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

0 is the label for the tokens that are not in any named entity (it stands for "outside"). The label `B-XXX` indicates the token is at the beginning of an entity `XXX` and the label `I-XXX` indicates the token is inside the entity `XXX`.

With this map, we can reproduce the results of the first pipeline:

In [None]:
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]

    if label != 'O':
        results.append(
            {'entity': label,
             'score': probabilities[idx][pred],
             'word': tokens[idx]}
        )

results

[{'entity': 'I-PER', 'score': 0.9991528987884521, 'word': 'Bin'},
 {'entity': 'I-ORG', 'score': 0.9997157454490662, 'word': 'Golden'},
 {'entity': 'I-ORG', 'score': 0.9996688365936279, 'word': 'State'},
 {'entity': 'I-ORG', 'score': 0.9996359348297119, 'word': 'Warriors'},
 {'entity': 'I-LOC', 'score': 0.9990701079368591, 'word': 'Houston'},
 {'entity': 'I-LOC', 'score': 0.9996472597122192, 'word': 'Texas'}]

The previous pipeline also gave us information about the `start` and `end` of each entity in the original sentence. To get the offsets, we have to set `return_offsets_mapping=True` when we apply the tokenizer to our inputs:

In [None]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)

inputs_with_offsets['offset_mapping']

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 14),
 (15, 18),
 (19, 20),
 (21, 25),
 (26, 28),
 (29, 33),
 (34, 37),
 (38, 41),
 (42, 48),
 (49, 54),
 (55, 63),
 (64, 67),
 (68, 72),
 (73, 75),
 (76, 83),
 (83, 84),
 (85, 90),
 (90, 91),
 (0, 0)]

Each tuple is the span of text corresponding to each token, where `(0, 0)` is reserved for the special tokens.

In [None]:
example[11:14]

'Bin'

Using this, we can complete the previous results:

In [None]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets['offset_mapping']


for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]

    if label != 'O':
        start, end = offsets[idx]

        results.append(
            {'entity': label,
             'score': probabilities[idx][pred],
             'word': tokens[idx],
             'start': start,
             'end': end,}
        )

results

[{'entity': 'I-PER',
  'score': 0.9991528987884521,
  'word': 'Bin',
  'start': 11,
  'end': 14},
 {'entity': 'I-ORG',
  'score': 0.9997157454490662,
  'word': 'Golden',
  'start': 42,
  'end': 48},
 {'entity': 'I-ORG',
  'score': 0.9996688365936279,
  'word': 'State',
  'start': 49,
  'end': 54},
 {'entity': 'I-ORG',
  'score': 0.9996359348297119,
  'word': 'Warriors',
  'start': 55,
  'end': 63},
 {'entity': 'I-LOC',
  'score': 0.9990701079368591,
  'word': 'Houston',
  'start': 76,
  'end': 83},
 {'entity': 'I-LOC',
  'score': 0.9996472597122192,
  'word': 'Texas',
  'start': 85,
  'end': 90}]

### Grouping entities

With the offsets, we can take the span in the original text that begins with the first token and ends with the last token.

To post-process the predictions while grouping entities, we will group together entities that are consecutive and labeled with `I-XXX`, except for the first one, which can be labeled as `B-XXX` or `I-XXX`.

In [None]:
example[42:63]

'Golden State Warriors'

In [None]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets['offset_mapping']

In [None]:
idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]

    if label != 'O':
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # grab all the tokens labels with I-label
        all_scores = []
        while(
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # the score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]

        results.append(
            {'entity_group': label,
             'score': score,
             'word': word,
             'start': start,
             'end': end,}
        )

    idx += 1

results

[{'entity_group': 'PER',
  'score': 0.9991528987884521,
  'word': 'Bin',
  'start': 11,
  'end': 14},
 {'entity_group': 'ORG',
  'score': 0.9996735056241354,
  'word': 'Golden State Warriors',
  'start': 42,
  'end': 63},
 {'entity_group': 'LOC',
  'score': 0.9990701079368591,
  'word': 'Houston',
  'start': 76,
  'end': 83},
 {'entity_group': 'LOC',
  'score': 0.9996472597122192,
  'word': 'Texas',
  'start': 85,
  'end': 90}]

# Fast Tokenizers in the QA pipeline

We can use `question-answering` pipeline to grab the answer to the question at hand from the context.

## Using the question-answering pipeline

In [None]:
from transformers import pipeline

question_answerer = pipeline('question-answering')

In [4]:
context = """
The HuggingFace Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""

question = 'Which deep learning libraries back HuggingFace Transformers?'

question_answerer(question=question, context=context)

{'score': 0.9851435422897339,
 'start': 92,
 'end': 120,
 'answer': 'Jax, PyTorch, and TensorFlow'}

This pipeline can deal with very long contexts and will return the answer to the question even if it's at the end:

In [5]:
long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""

question_answerer(question=question, context=long_context)

{'score': 0.9794967174530029,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

## Using a model for question answering

We start by tokenizing our input and then send it through the model. The default checkpoint for the `question-answering` pipeline is `distilbert-base-cased-distilled-squad`:

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = 'distilbert-base-cased-distilled-squad'

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [7]:
inputs = tokenizer(question, context, return_tensors='pt')
outputs = model(**inputs)

We tokenize the question and the context as a pair, with the question first.

The model has been trained to predict the index of the token starting the answer and the index of the token where the answer ends. This is why those models don't return one tensor of logits but two: one for the logits corresponding to the start token of the answer, and one for the logits corresponding to the end token of the answer.

In [8]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits

print(start_logits.shape)
print(end_logits.shape)

torch.Size([1, 74])
torch.Size([1, 74])


We need to make sure we mask the indices that are not part of the context before converting logits into probabilities.

Our input is `[CLS] question [SEP] context [SEP]`, so we need to mask the tokens of the question as well as the `[SEP]` token.

Since we will apply a softmax afterward, we need to replace the logits we want to mask with a large negative number:

In [11]:
import torch

sequence_ids = inputs.sequence_ids()
# mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]

# unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

mask

tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False,  True]])

In [12]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

We do NOT take the argmax of the start and end probabilities because we may end up with a start index that is greater than the end index.

We will compute the proababilities of each possible `start_index` and `end_index` where `start_index <= end_index`, then take the tuple `(start_index, end_index)` with the highest probabilities.

Assuming the events "The answer starts at `start_index`" and "The answer ends at `end_index`" to be independent, the probability that the answer starts at `start_index` and ends at `end_index` is:
\begin{equation}
\text{start_probabilities[start_index]}\times \text{end_probabilities[end_index]}
\end{equation}
To compute all the scores, we need to compute all the products where `start_index <= end_index`:

In [13]:
scores = start_probabilities[:, None] * end_probabilities[None, :]
scores.shape

torch.Size([74, 74])

Then we mask values where `start_index > end_index` by setting them to 0. The `torch.triu()` function returns the upper triangular part of the 2D tensor passed as an argument, so it will do that masking for us:

In [14]:
scores = torch.triu(scores)

Now we have to get the index of the maximum.

In [15]:
max_index = scores.argmax().item()

start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

tensor(0.9851, grad_fn=<SelectBackward0>)


Once we have the `start_index` and `end_index` of the answer in terms of tokens, we need to convert to the character indices in the context.

In [16]:
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets['offset_mapping']

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]

answer = context[start_char : end_char]

In [18]:
result = {
    'answer': answer,
    'start': start_char,
    'end': end_char,
    'score': scores[start_index, end_index].item()
}

print(result)

{'answer': 'Jax, PyTorch, and TensorFlow', 'start': 92, 'end': 120, 'score': 0.9851435422897339}


## Handling long contexts

If we try to tokenize the question and long context we used as an example previously, we will get a number of tokens higher than the maximum legnth used in the `question-answering` pipeline (which is 384):

In [19]:
inputs = tokenizer(question, long_context)
print(len(inputs['input_ids']))

464


So we need to truncate our inputs at that maximum length. We can use the `only_second` truncation strategy to truncate the context, but the answer to the question may not be in the truncated context:

In [20]:
inputs = tokenizer(question, long_context, max_length=384, truncation='only_second')
print(tokenizer.decode(inputs['input_ids']))

[CLS] Which deep learning libraries back HuggingFace Transformers? [SEP] [UNK] Transformers : State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting - edge NLP easier to use for everyone. [UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine - tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Why should I use transformers? 1. Easy - to - use state - of - the - art models : - High performance on NLU and NLG tasks. - Low barrier to entry for educators and practitioners. - Few user - facing abstractions with just three classes to learn. - A uni

To fix this, the `question-answering` pipeline allows us to split the context into smaller chunks, specifying the maximum length.

We can have the tokenizer do this for us by adding `return_overflowing_tokens=True`, and we can specify the overlap we want with the `stride` argument:

In [21]:
sentence = "This sentence is not too long but we are going to split it anyway."

inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs['input_ids']:
    print(tokenizer.decode(ids))

[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]


In [22]:
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])


The `overflow_to_sample_mapping` is a map that tells us which sentence each of the results corresponds to.

In [23]:
print(inputs['overflow_to_sample_mapping'])

[0, 0, 0, 0, 0, 0, 0]


This is more useful when we tokenize several sentences together:

In [24]:
sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]

inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs['overflow_to_sample_mapping'])

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


This means that the first sentence is split into 7 chunks and the next 4 chunks from the second sentence.

By default the `question-answering` pipeline uses a maximum length of 384, and a stride of 128. We will use those parameters when tokenizing. We will also add padding as well as ask for the offsets:

In [25]:
inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding='longest',
    truncation='only_second',
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

The `inputs` will contain the input IDs and attention masks the model expects, as well as the offsets and the `overflow_to_sample_mapping`. Since those two are not parameters used by the model, we will pop them out of the inputs before converting it to a tensor:

In [26]:
_ = inputs.pop('overflow_to_sample_mapping')
offsets = inputs.pop('offset_mapping')

inputs = inputs.convert_to_tensors('pt')
print(inputs['input_ids'].shape)

torch.Size([2, 384])


Our long context was split into two, which means that after it goes through our model, we  will have two sets of start and end logits:

In [27]:
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits

print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])


Like before, we first mask the tokens that are not part of the context before taking the softmax. We also mask all the padding tokens (as flagged by the attention mask):

In [28]:
sequence_ids = inputs.sequence_ids()
# mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# unmask the [CLS] token
mask[0] = False
# mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs['attention_mask']==0))

start_logits[mask] = -10000
end_logits[mask] = -10000

Then we can use the softmax to convert our logits to probabilities:

In [29]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

Lastly, we attribute a score to all possible spans of answer, then take the span with the best score:

In [30]:
candidates = []

for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]

    score = scores[start_idx, end_idx].item()

    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 0, 0.8773406744003296), (179, 190, 0.9794965982437134)]


Those two candidates correspond to the best answers the model was able to find in each chunk. The model is way more confident the right answer is in the second part.

Now we have to map those two token spans to spans of characters in the context.

The `offsets` we grabbed earlier is actually a list of offsets, with one list per chunk of text:

In [31]:
for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]

    answer = long_context[start_char:end_char]

    result = {
        'answer': answer,
        'start': start_char,
        'end': end_char,
        'score': score
    }
    print(result)

{'answer': '', 'start': 0, 'end': 0, 'score': 0.8773406744003296}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.9794965982437134}


# Normalization and pre-tokenization

Before splitting a text into subtokens (according to its model), the tokenizer performs two steps: *normalization* and *pre-tokenization*).

## Normalization

The normalization step involves some general cleanups, such as removing needless whiteshapce, lowercasing, and/or removing accents.

The Transformers `tokenizer` has an attirbute called `backend_tokenizer` that provides access to the underlying tokenizer from the Tokenizers library:

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print(type(tokenizer.backend_tokenizer))

<class 'tokenizers.Tokenizer'>


The `normalizer` attribute has a `normalize_str()` that we can use to see how the normalization is performed:

In [4]:
tokenizer.backend_tokenizer.normalizer.normalize_str('Héllò hôw are ü?')

'hello how are u?'

## Pre-tokenization

The pre-tokenization step is required before we train a tokenizer.

We can use the `pre_tokenize_str()` of the `pre_tokenizer` attribute:

In [5]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str('Hello, how are  you?')

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('how', (7, 10)),
 ('are', (11, 14)),
 ('you', (16, 19)),
 ('?', (19, 20))]

Other tokenizers can have different rules for this step:

In [7]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str('Hello, how are  you?')

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('Ġhow', (6, 10)),
 ('Ġare', (10, 14)),
 ('Ġ', (14, 15)),
 ('Ġyou', (15, 19)),
 ('?', (19, 20))]

We can also check the T5 tokenizer, which is based on the SentencePiece algorithm:

In [9]:
tokenizer = AutoTokenizer.from_pretrained('t5-small')
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str('Hello, how are  you?')

[('▁Hello,', (0, 6)),
 ('▁how', (7, 10)),
 ('▁are', (11, 14)),
 ('▁you?', (16, 20))]

# Byte-Pair Encoding tokenization

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model.

## Implementing BPE

In [23]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

We need to pre-tokenize that corpus into words. Since we are replicating a BPE tokenizer, we will use the `gpt2` tokenizer for the pre-tokenization

In [24]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')



Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:

In [25]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]

    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})


Then we compute the base vocabulary, formed by all the characters used in the corpus:

In [26]:
alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)

alphabet.sort()
print(alphabet)

[',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']


We also add the special tokens used by the model at the beginning of that vocabulary. In the case of GPT-2, the only special token is `"<|endoftext|>"`:

In [27]:
vocab = ['<|endoftext|>'] + alphabet.copy()

We now need to split each word into individual characters, to be able to start training:

In [28]:
splits = {
    word: [c for c in word] for word in word_freqs.keys()
}
splits

{'This': ['T', 'h', 'i', 's'],
 'Ġis': ['Ġ', 'i', 's'],
 'Ġthe': ['Ġ', 't', 'h', 'e'],
 'ĠHugging': ['Ġ', 'H', 'u', 'g', 'g', 'i', 'n', 'g'],
 'ĠFace': ['Ġ', 'F', 'a', 'c', 'e'],
 'ĠCourse': ['Ġ', 'C', 'o', 'u', 'r', 's', 'e'],
 '.': ['.'],
 'Ġchapter': ['Ġ', 'c', 'h', 'a', 'p', 't', 'e', 'r'],
 'Ġabout': ['Ġ', 'a', 'b', 'o', 'u', 't'],
 'Ġtokenization': ['Ġ',
  't',
  'o',
  'k',
  'e',
  'n',
  'i',
  'z',
  'a',
  't',
  'i',
  'o',
  'n'],
 'Ġsection': ['Ġ', 's', 'e', 'c', 't', 'i', 'o', 'n'],
 'Ġshows': ['Ġ', 's', 'h', 'o', 'w', 's'],
 'Ġseveral': ['Ġ', 's', 'e', 'v', 'e', 'r', 'a', 'l'],
 'Ġtokenizer': ['Ġ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'r'],
 'Ġalgorithms': ['Ġ', 'a', 'l', 'g', 'o', 'r', 'i', 't', 'h', 'm', 's'],
 'Hopefully': ['H', 'o', 'p', 'e', 'f', 'u', 'l', 'l', 'y'],
 ',': [','],
 'Ġyou': ['Ġ', 'y', 'o', 'u'],
 'Ġwill': ['Ġ', 'w', 'i', 'l', 'l'],
 'Ġbe': ['Ġ', 'b', 'e'],
 'Ġable': ['Ġ', 'a', 'b', 'l', 'e'],
 'Ġto': ['Ġ', 't', 'o'],
 'Ġunderstand': ['Ġ', 'u', 'n'

Now we are ready for training. We need to write a function taht computes the frequency of each pair. We will need to use this at each step of the training:

In [29]:
def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)

    for word, freq in word_freqs.items():
        split = splits[word]

        if len(split) == 1:
            continue

        for i in range(len(split)-1):
            pair = (split[i], split[i+1])
            pair_freqs[pair] += freq

    return pair_freqs

In [30]:
pair_freqs = compute_pair_freqs(splits)

for i,key in enumerate(pair_freqs.keys()):
    print(f"{key}: {pair_freqs[key]}")
    if i >= 5:
        break

('T', 'h'): 3
('h', 'i'): 3
('i', 's'): 5
('Ġ', 'i'): 2
('Ġ', 't'): 7
('t', 'h'): 3


Now, finding the most frequent pair only takes a quick loop:

In [31]:
best_pair = ""
max_freq = None

for pair, freq in pair_freqs.items():
    if max_freq is None or max_freq < freq:
        best_pair = pair
        max_freq = freq

print(best_pair, max_freq)

('Ġ', 't') 7


The first merge to learn:

In [32]:
merges = {("Ġ", "t"): "Ġt"}
vocab.append("Ġt")

To continue, we need to apply that merge in our `splits` dictionary.

In [33]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue

        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i+1] == b:
                split = split[:i] + [a + b] + split[i+2:]
            else:
                i += 1

        splits[word] = split

    return splits

Now we can use this function to merge and add it to the vocabulary:

In [34]:
splits = merge_pair('Ġ', 't', splits)
print(splits['Ġtrained'])

['Ġt', 'r', 'a', 'i', 'n', 'e', 'd']


Now we need to loop until we have learned all the merges we want. Let's aim for a vocab size of 50:

In [35]:
vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None

    for pair, freq in pair_freqs.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq

    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])

In [36]:
merges

{('Ġ', 't'): 'Ġt',
 ('i', 's'): 'is',
 ('e', 'r'): 'er',
 ('Ġ', 'a'): 'Ġa',
 ('Ġt', 'o'): 'Ġto',
 ('e', 'n'): 'en',
 ('T', 'h'): 'Th',
 ('Th', 'is'): 'This',
 ('o', 'u'): 'ou',
 ('s', 'e'): 'se',
 ('Ġto', 'k'): 'Ġtok',
 ('Ġtok', 'en'): 'Ġtoken',
 ('n', 'd'): 'nd',
 ('Ġ', 'is'): 'Ġis',
 ('Ġt', 'h'): 'Ġth',
 ('Ġth', 'e'): 'Ġthe',
 ('i', 'n'): 'in',
 ('Ġa', 'b'): 'Ġab',
 ('Ġtoken', 'i'): 'Ġtokeni'}

We have learned 19 merge rules (the initial vocabulary has a size of 31: 30 characters in the alphabet, plus a special token).

In [37]:
vocab

['<|endoftext|>',
 ',',
 '.',
 'C',
 'F',
 'H',
 'T',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'y',
 'z',
 'Ġ',
 'Ġt',
 'is',
 'er',
 'Ġa',
 'Ġto',
 'en',
 'Th',
 'This',
 'ou',
 'se',
 'Ġtok',
 'Ġtoken',
 'nd',
 'Ġis',
 'Ġth',
 'Ġthe',
 'in',
 'Ġab',
 'Ġtokeni']

and the vocabulary is composed of the special token, the initial alphabet, and all the results of the merges.

To tokenize a new text, we pre-tokenize it, split it, then apply all the merge rules learned:

In [38]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]

    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i=0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i+1] == pair[1]:
                    split = split[:i] + [merge] + split[i+2:]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])

We can try on any text composed of characters in the alphabet:

In [39]:
tokenize("This is not a token.")

['This', 'Ġis', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']

This is the BPE algorithm.

# WordPiece tokenization

WordPiece is the tokenization algorithm Google developed to pretrain BERT.

In [40]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

First, we need to pre-tokenize the corpus into words. We will use the `bert-base-cased` tokenizer for the pre-tokenization:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:

In [42]:
word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]

    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(int,
            {'This': 3,
             'is': 2,
             'the': 1,
             'Hugging': 1,
             'Face': 1,
             'Course': 1,
             '.': 4,
             'chapter': 1,
             'about': 1,
             'tokenization': 1,
             'section': 1,
             'shows': 1,
             'several': 1,
             'tokenizer': 1,
             'algorithms': 1,
             'Hopefully': 1,
             ',': 1,
             'you': 1,
             'will': 1,
             'be': 1,
             'able': 1,
             'to': 1,
             'understand': 1,
             'how': 1,
             'they': 1,
             'are': 1,
             'trained': 1,
             'and': 1,
             'generate': 1,
             'tokens': 1})

The alphabet here is the unique set composed of all the first letters of words, and all the other letters that appear in words prefixed by `##`:

In [43]:
alphabet = []

for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])

    for letter in word[1:]:
        if f'##{letter}' not in alphabet:
            alphabet.append(f'##{letter}')

alphabet.sort()
print(alphabet)

['##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y']


We also add the special tokens. In the case of BERT, it's the list `["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]`:

In [44]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

Next, we need to split each word, with all the letters that are not the first prefixed by `##`:

In [45]:
splits = {
    word: [c if i == 0 else f'##{c}' for i,c in enumerate(word)]
    for word in word_freqs.keys()
}
splits

{'This': ['T', '##h', '##i', '##s'],
 'is': ['i', '##s'],
 'the': ['t', '##h', '##e'],
 'Hugging': ['H', '##u', '##g', '##g', '##i', '##n', '##g'],
 'Face': ['F', '##a', '##c', '##e'],
 'Course': ['C', '##o', '##u', '##r', '##s', '##e'],
 '.': ['.'],
 'chapter': ['c', '##h', '##a', '##p', '##t', '##e', '##r'],
 'about': ['a', '##b', '##o', '##u', '##t'],
 'tokenization': ['t',
  '##o',
  '##k',
  '##e',
  '##n',
  '##i',
  '##z',
  '##a',
  '##t',
  '##i',
  '##o',
  '##n'],
 'section': ['s', '##e', '##c', '##t', '##i', '##o', '##n'],
 'shows': ['s', '##h', '##o', '##w', '##s'],
 'several': ['s', '##e', '##v', '##e', '##r', '##a', '##l'],
 'tokenizer': ['t', '##o', '##k', '##e', '##n', '##i', '##z', '##e', '##r'],
 'algorithms': ['a',
  '##l',
  '##g',
  '##o',
  '##r',
  '##i',
  '##t',
  '##h',
  '##m',
  '##s'],
 'Hopefully': ['H', '##o', '##p', '##e', '##f', '##u', '##l', '##l', '##y'],
 ',': [','],
 'you': ['y', '##o', '##u'],
 'will': ['w', '##i', '##l', '##l'],
 'be': ['b', '##e

Now we are ready for training. We need a function that computes the score of each pair:

In [46]:
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)

    for word, freq in word_freqs.items():
        split = splits[word]

        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue

        for i in range(len(split) - 1):
            pair = (split[i], split[i+1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq

        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }

    return scores

In [48]:
pair_scores = compute_pair_scores(splits)

for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.125
('##h', '##i'): 0.03409090909090909
('##i', '##s'): 0.02727272727272727
('i', '##s'): 0.1
('t', '##h'): 0.03571428571428571
('##h', '##e'): 0.011904761904761904


Now, find the pair with the best score:

In [49]:
best_pair = ""
max_score = None

for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('a', '##b') 0.2


So the first merge to learn is `('a', '##b') -> 'ab'`, and we add `'ab'` to the vocabulary:

In [50]:
vocab.append('ab')

Now we need to apply that merge in our `splits` dictionary.

In [51]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue

        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i+1] == b:
                merge = a + b[2:] if b.startswith('##') else a + b
                split = split[:i] + [merge] + split[i+2:]

            else:
                i += 1

        splits[word] = split

    return splits

In [52]:
# for the first merge
splits = merge_pair('a', '##b', splits)
splits['about']

['ab', '##o', '##u', '##t']

Now we have everything. Let's aim for a vocab size of 70:

In [53]:
vocab_size = 70

while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair = ""
    max_score = None

    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score

    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith('##') else best_pair[0] + best_pair[1]
    )

    vocab.append(new_token)

In [54]:
vocab

['[PAD]',
 '[UNK]',
 '[CLS]',
 '[SEP]',
 '[MASK]',
 '##a',
 '##b',
 '##c',
 '##d',
 '##e',
 '##f',
 '##g',
 '##h',
 '##i',
 '##k',
 '##l',
 '##m',
 '##n',
 '##o',
 '##p',
 '##r',
 '##s',
 '##t',
 '##u',
 '##v',
 '##w',
 '##y',
 '##z',
 ',',
 '.',
 'C',
 'F',
 'H',
 'T',
 'a',
 'b',
 'c',
 'g',
 'h',
 'i',
 's',
 't',
 'u',
 'w',
 'y',
 'ab',
 '##fu',
 'Fa',
 'Fac',
 '##ct',
 '##ful',
 '##full',
 '##fully',
 'Th',
 'ch',
 '##hm',
 'cha',
 'chap',
 'chapt',
 '##thm',
 'Hu',
 'Hug',
 'Hugg',
 'sh',
 'th',
 'is',
 '##thms',
 '##za',
 '##zat',
 '##ut']

Compared to BPE, this tokenizer learns parts of words as tokens a bit faster.

To tokenize a new text, we pre-tokenize it, split it, then apply the tokenization algorithm on each word. That is, we look for the biggest subword starting at the beginning of the first word and split it, then we repeat the process on the second part, and so on for the rest of that word and the following words in the text:

In [55]:
def encode_word(word):
    tokens = []

    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1

        if i == 0:
            return ['[UNK]']

        tokens.append(word[:i])
        word = word[i:]

        if len(word) > 0:
            word = f'##{word}'

    return tokens

In [56]:
encode_word('Hugging')

['Hugg', '##i', '##n', '##g']

In [58]:
encode_word('HOgging')

['[UNK]']

In [59]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]

    return sum(encoded_words, [])

In [60]:
tokenize("This is not a token.")

['Th', '##i', '##s', 'is', '[UNK]', 'a', 't', '##o', '##k', '##e', '##n', '.']

# Unigram tokenization

The Unigram algorithm is often used in SentencePiece, which is the tokenization algorithm used by models like AIBERT, T5, mBART, Big Bird, and XLNet.

## Implementing Unigram

In [2]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('xlnet-base-cased')

Like for BPE and WordPiece, we begin by counting the number of occurrences of each word in the corpus:

In [5]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]

    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(int,
            {'▁This': 3,
             '▁is': 2,
             '▁the': 1,
             '▁Hugging': 1,
             '▁Face': 1,
             '▁Course.': 1,
             '▁chapter': 1,
             '▁about': 1,
             '▁tokenization.': 1,
             '▁section': 1,
             '▁shows': 1,
             '▁several': 1,
             '▁tokenizer': 1,
             '▁algorithms.': 1,
             '▁Hopefully,': 1,
             '▁you': 1,
             '▁will': 1,
             '▁be': 1,
             '▁able': 1,
             '▁to': 1,
             '▁understand': 1,
             '▁how': 1,
             '▁they': 1,
             '▁are': 1,
             '▁trained': 1,
             '▁and': 1,
             '▁generate': 1,
             '▁tokens.': 1})

We need to initialize the vocabulary to something larger than the vocab size we want at the end. We have to include all the basic characters, but for the bigger substrings we will only keep the most common ones, so we sort them by frequency:

In [7]:
char_freqs = defaultdict(int)
subwords_freqs = defaultdict(int)

for word, freq in word_freqs.items():
    for i in range(len(word)):
        char_freqs[word[i]] += freq

        # loop through the subwords of length at least 2
        for j in range(i+2, len(word)+1):
            subwords_freqs[word[i:j]] += freq

# sort subwords by frequency
sorted_subwords = sorted(subwords_freqs.items(), key=lambda x: x[1], reverse=True)
sorted_subwords[:10]

[('▁t', 7),
 ('is', 5),
 ('er', 5),
 ('▁a', 5),
 ('▁to', 4),
 ('to', 4),
 ('en', 4),
 ('▁T', 3),
 ('▁Th', 3),
 ('▁Thi', 3)]

We group the characters with the best subwords to arrive at an initial vocabulary of size 300:

In [8]:
token_freqs = list(char_freqs.items()) + sorted_subwords[: 300 - len(char_freqs)]
token_freqs = {
    token: freq for token, freq in token_freqs
}

Next, we compute the sum of all frequencies, to convert the frequencies into probabilities.

In [9]:
from math import log

total_sum = sum([
    freq for token, freq in token_freqs.items()
])
model = {
    token: -log(freq / total_sum)
    for token, freq in token_freqs.items()
}

Now the main function is the one that tokenizes words using the Viterbi algorithm. The algorithm comptues the best segmentation of each substrings of the word, which we store in a varaible named `best_segmentations`. We store one dictionary per position in the word (from 0 to its total length), with two keys: the index of the start of the last token in the best segmentation, and the score of the best segmentation. With the index of the start of the last token, we will be able to retrieve the full segmentation once the list is completely populated.

In [10]:
def encode_word(word, model):
    best_segmentations = [{"start": 0, "score": 1}] + [
        {"start": None, "score": None} for _ in range(len(word))
    ]
    for start_idx in range(len(word)):
        # This should be properly filled by the previous steps of the loop
        best_score_at_start = best_segmentations[start_idx]["score"]
        for end_idx in range(start_idx + 1, len(word) + 1):
            token = word[start_idx:end_idx]
            if token in model and best_score_at_start is not None:
                score = model[token] + best_score_at_start
                # If we have found a better segmentation ending at end_idx, we update
                if (
                    best_segmentations[end_idx]["score"] is None
                    or best_segmentations[end_idx]["score"] > score
                ):
                    best_segmentations[end_idx] = {"start": start_idx, "score": score}

    segmentation = best_segmentations[-1]
    if segmentation["score"] is None:
        # We did not find a tokenization of the word -> unknown
        return ["<unk>"], None

    score = segmentation["score"]
    start = segmentation["start"]
    end = len(word)
    tokens = []
    while start != 0:
        tokens.insert(0, word[start:end])
        next_start = best_segmentations[start]["start"]
        end = start
        start = next_start
    tokens.insert(0, word[start:end])
    return tokens, score

In [11]:
print(encode_word("Hopefully", model))
print(encode_word("This", model))

(['H', 'o', 'p', 'e', 'f', 'u', 'll', 'y'], 41.5157494601402)
(['This'], 6.288267030694535)


Now we can compute the loss of the model on the corpus:

In [12]:
def compute_loss(model):
    loss = 0
    for word, freq in word_freqs.items():
        _, word_loss = encode_word(word, model)
        loss += freq * word_loss
    return loss

In [13]:
compute_loss(model)

413.10377642940875

Compute the scores for each token:

In [14]:
import copy

def compute_scores(model):
    scores = {}
    model_loss = compute_loss(model)
    for token, score in model.items():
        # We always keep tokens of length 1
        if len(token) == 1:
            continue
        model_without_token = copy.deepcopy(model)
        _ = model_without_token.pop(token)
        scores[token] = compute_loss(model_without_token) - model_loss
    return scores

In [15]:
scores = compute_scores(model)
print(scores["ll"])
print(scores["his"])

6.376412403623874
0.0


The last thing we need to do is add the special tokens used by the model to the vocabulary, then loop until we have pruned enough tokens from the vocabulary to reach our desired size:

In [16]:
percent_to_remove = 0.1
while len(model) > 100:
    scores = compute_scores(model)
    sorted_scores = sorted(scores.items(), key=lambda x: x[1])
    # Remove percent_to_remove tokens with the lowest scores.
    for i in range(int(len(model) * percent_to_remove)):
        _ = token_freqs.pop(sorted_scores[i][0])

    total_sum = sum([freq for token, freq in token_freqs.items()])
    model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}

Then, to tokenize some text, we just need to apply the pre-tokenization and then use our `encode_word()` function:

In [17]:
def tokenize(text, model):
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in words_with_offsets]
    encoded_words = [encode_word(word, model)[0] for word in pre_tokenized_text]
    return sum(encoded_words, [])


tokenize("This is the Hugging Face course.", model)

['▁This',
 '▁is',
 '▁the',
 '▁Hugging',
 '▁Face',
 '▁',
 'c',
 'ou',
 'r',
 's',
 'e',
 '.']

# Building a tokenizer, block by block

Tokenization comprises several steps:
* Normalization (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
* Pre-tokenization (splitting the input into words)
* Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
* Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)

The Tokenizers library is built around a central `Tokenzier` class with the building blocks regrouped in submodules:
* `normalizers` contains all the possible types of `Normalizer`
* `pre_tokenizers` contains all the possible types of `PreTokenizer`
* `models` contains the various types of `Model`, like `BPE`, `WordPiece`, and `Unigram`
* `trainers` contains all the different types of `Trainer` to train our model on a corpus
* `post_processors` contains the various types of `PostProcessor`
* `decoders` contains the various types of `Decoder` to decode the outputs of tokenization.

## Acquiring a corpus

In [None]:
from datasets import load_dataset

dataset = load_dataset('wikitext', name='wikitext-2-raw-v1', split='train')

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i+1000]['text']

The `get_training_corpus()` is a generator to yeild batches of 1,000 texts.

Tokenizers can also be trained on text files directly.

In [20]:
with open('wikitext-2.txt', 'w', encoding='utf-8') as f:
    for i in range(len(dataset)):
        f.write(dataset[i]['text'] + '\n')

## Building a WordPiece tokenizer from scratch

In [21]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

We need to specify the `unk_token` so the model knows what to return when it encounters characters it has not seen before.

The first step of tokenization is normalization.

In [23]:
# replicate the `bert-base-uncased` tokenizer
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

We can also compose several normalizers using a `Sequence`:

In [24]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

In [26]:
tokenizer.normalizer.normalize_str("Héllò hôw are ü?")

'hello how are u?'

Next is the pre-tokenization step.

In [27]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Or we can build it from scratch:

In [28]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

The `Whitespace` pre-tokenizer splits on whitespace and all characters that are not letters, digits, or the underscore character, so it splits on whitespace and puncatuation:

In [29]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('the', (11, 14)),
 ('pre', (15, 18)),
 ('-', (18, 19)),
 ('tokenizer', (19, 28)),
 ('.', (28, 29))]

If we only want to split on whitespace, we should use the `WhitespaceSplit` pre-tokenizer:

In [30]:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer.")

[("Let's", (0, 5)),
 ('test', (6, 10)),
 ('the', (11, 14)),
 ('pre-tokenizer.', (15, 29))]

We can use a `Sequence` to compose several pre-tokenizers:

In [31]:
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)

pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('the', (11, 14)),
 ('pre', (15, 18)),
 ('-', (18, 19)),
 ('tokenizer', (19, 28)),
 ('.', (28, 29))]

Next is running the inputs through the model. We have already specified our model but we still need to train it. When instantiating a trainer, we need to pass it all the special tokens we intend to use:

In [32]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.WordPieceTrainer(
    vocab_size=25000, special_tokens=special_tokens,
)

We can also set the `min_frequency` (the number of times a token must appear to be included in the vocabulary) or change the `continuing_subword_prefix` (if we want to use something different from `##`).

In [33]:
tokenizer.train_from_iterator(get_training_corpus(),
                              trainer=trainer)

We can also use text files to train our tokenizer:

In [34]:
tokenizer.model = models.WordPiece(unk_token="[UNK]")

tokenizer.train(['wikitext-2.txt'],
                trainer=trainer)

We can then test the tokenizer on a text:

In [35]:
encoding = tokenizer.encode("Let's test the tokenizer.")
print(encoding.tokens)

['let', "'", 's', 'test', 'the', 'tok', '##eni', '##zer', '.']


The last step is the post-processing. We need to add the `[CLS]` token at the beginning and the `[SEP]` token at the end (or after each sentence).

In [48]:
cls_token_id = tokenizer.token_to_id('[CLS]')
sep_token_id = tokenizer.token_to_id('[SEP]')
print(cls_token_id, sep_token_id)

2 3


We will use `TemplateProcessor` and specify how to treat a single sentence and a pair of sentences. For both, we write the special tokens we want to use; the first (or single) sentence is represented by `$A`, while the second sentence (if encoding a pair) is represented by `$B`. For each of these (special tokens and sentences), we also specify the corresponding token type ID after a colon.

The classic BERT template is defined as follows:

In [49]:
tokenizer.post_processor = processors.TemplateProcessing(
    single = f"[CLS]:0 $A:0 [SEP]:0",
    pair = f"[CLS]:0 $A:0 [SEP]:1 $B:1 [SEP]:1",
    special_tokens = [
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ]
)

Once this is added, going back to our previous example:

In [50]:
encoding = tokenizer.encode("Let's test the tokenizer.")
print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "'", 's', 'test', 'the', 'tok', '##eni', '##zer', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


And on a pair of sentences,

In [52]:
encoding = tokenizer.encode("Let's test the tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "'", 's', 'test', 'the', 'tok', '##eni', '##zer', '...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]


The last step is to include a decoder

In [53]:
tokenizer.decoder = decoders.WordPiece(prefix='##')

In [54]:
# test our previous encoding
tokenizer.decode(encoding.ids)

"let ' s test the tokenizer... on a pair of sentences."

In [None]:
# save the tokenizer
tokenizer.save

In [None]:
# reload from disk
new_tokenizer = Tokenizer.from_file('tokenizer.json')

To use this tokenizer in Transformers library, we have to wrap it in a `PreTrainedTokenizerFast`. We can either pass the tokenizer we built as a `tokenizer_object` or pass the tokenizer file we saved as `tokenizer_file`. We have to manually set all the special tokens, since that class cannot infer from the `tokenizer` object which token is the mask token, etc:

In [56]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file='tokenizer.json', # if we load from the tokenizer file
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)



If we use a specific tokenizer class (like `BertTokenizerFast`), we will only need to specify the special tokens that are different from the default ones:

In [57]:
from transformers import BertTokenizerFast

wrapper_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)



Then we can use this tokenizer like any other Transformers tokenizer.

## Building a BPE tokenizer from scratch

In [58]:
tokenizer = Tokenizer(models.BPE())

Like for BERT, we could initialize this model with a vocabulary if we had one (we would need to pass the `vocab` and `merges` in this case), but since we will train from scratch, we don't need to do that.

We also do NOT need to specify an `unk_token` because GPT-2 uses byte-level BPE, which does NOT require it.

GPT-2 does not use a normalizer, so we skip that step and go directly to the pre-tokenization:

In [59]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

The `add_prefix_space=False` we set to `ByteLevel` is to not add a space at the beginning of a sentence (which is the default).

In [60]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenization.")

[('Let', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġthe', (10, 14)),
 ('Ġpre', (14, 18)),
 ('-', (18, 19)),
 ('tokenization', (19, 31)),
 ('.', (31, 32))]

Next is the model, which needs training. For GPT-2, the only special token is the end-of-text token:

In [61]:
trainer = trainers.BpeTrainer(
    vocab_size=25000,
    special_tokens=['<|endoftext|>']
)

tokenizer.train_from_iterator(get_training_corpus(),
                              trainer=trainer)

Like with the `WordPieceTrainer`, as well as the `vocab_size` and `special_tokens`, we can specify the `min_frequency` if we want to, or if we have an end-of-word suffix (like `</w>`), we can set it with `end_of_word_suffix`.

This tokenizer can also be trained on text files:

In [62]:
tokenizer.model = models.BPE()
tokenizer.train(['wikitext-2.txt'],
                trainer=trainer)

In [64]:
# test
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']


Apply the byte-level post-processing for the GPT-2 tokenizer

In [65]:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

The `trim_offsets = False` incidates to the post-processor that we should leave the offsets of tokens that begin with ‘Ġ’ as they are: this way the start of the offsets will point to the space before the word, not the first character of the word.

In [66]:
sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

' test'

Finally, we add a byte-level decoder

In [67]:
tokenizer.decoder = decoders.ByteLevel()

In [68]:
# test
tokenizer.decode(encoding.ids)

"Let's test this tokenizer."

In [69]:
# wrap it
Wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token='<|endoftext|>',
    eos_token='<|endoftext|>',
)



In [70]:
# alternatively
from transformers import GPT2TokenizerFast

wrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)



## Buliding a Unigram tokenizer from scratch

In [71]:
tokenizer = Tokenizer(models.Unigram())

For the normalization, XLNet uses a few replacements:

In [72]:
from tokenizers import Regex

tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

This replaces `"` and `"` with `"` and any sequence of two or more spaces with a single space, as well as removing the accents in the texts to tokenize.

The pre-tokenizer to use for any SentencePiece tokenizer is `Metaspace`:

In [73]:
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

In [74]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer.")

[("▁Let's", (0, 5)),
 ('▁test', (5, 10)),
 ('▁the', (10, 14)),
 ('▁pre-tokenizer.', (14, 29))]

Next is the model. XLNet has a few special tokens:

In [75]:
special_tokens = ['<cls>', '<sep>', '<unk>', '<pad>', '<mask>', '<s>', '</s>']

trainer = trainers.UnigramTrainer(
    vocab_size=25000,
    special_tokens=special_tokens,
    unk_token='<unk>'
)

tokenizer.train_from_iterator(get_training_corpus(),
                              trainer=trainer)

For `UnigramTrainer`, we need to fill the `unk_token`. We can also pass along other arguments specific to the Unigram algorithm, such as the `shrinking_factor` for each step where we remove tokens (default to 0.75) or the `max_piece_length` to specify the maximum length of a given token (default to 16).

In [76]:
tokenizer.model = models.Unigram()

tokenizer.train(['wikitext-2.txt'],
                trainer=trainer)

In [77]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.']


XLNet puts the `<cls>` token at the end of the sentence, with a type ID of 2. It pads the text on the left.

In [78]:
cls_token_id = tokenizer.token_to_id('<cls>')
sep_token_id = tokenizer.token_to_id('<sep>')
print(cls_token_id, sep_token_id)

0 1


We can deal with all the special tokens and token type IDs with a template, like for BERT.

In [79]:
tokenizer.post_processor = processors.TemplateProcessing(
    single = "$A:0 <sep>:0 <cls>:2",
    pair = "$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens = [
        ("<sep>", sep_token_id),
        ("<cls>", cls_token_id),
    ]
)

Now we can test if it works by encoding a pair of sentences:

In [80]:
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.', '.', '.', '<sep>', '▁', 'on', '▁', 'a', '▁pair', '▁of', '▁sentence', 's', '.', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]


Finally, we add a `Metaspace` decoder:

In [81]:
tokenizer.decoder = decoders.Metaspace()

In [82]:
tokenizer.decode(encoding.ids)

"Let's test this tokenizer... on a pair of sentences."

Wrap it

In [83]:
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token='<s>',
    eos_token='</s>',
    unk_token='<unk>',
    pad_token='<pad>',
    cls_token='<cls>',
    sep_token='<sep>',
    mask_token='<mask>',
    padding_side='left',
)



In [84]:
# alternatively
from transformers import XLNetTokenizerFast

wrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)

