Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus.

In [6]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import pipeline
import numpy as np
import torch
from tokenizers import (decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer)
from transformers import PreTrainedTokenizerFast
from transformers import BertTokenizerFast

2025-05-31 10:44:30.068199: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748681070.157280     126 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748681070.180365     126 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1748681070.365532     126 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1748681070.365565     126 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1748681070.365567     126 computation_placer.cc:177] computation placer alr

In [None]:
raw_datasets = load_dataset("code_search_net", "python")
raw_datasets["train"]

In [None]:
# Only 1,000 texts at a time will be loaded
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

training_corpus = get_training_corpus()

Even though we are going to train a new tokenizer, it’s a good idea to do this to avoid starting entirely from scratch; the only thing that will change is the vocabulary.

In [5]:
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000) # Only works with fast tokenizer (backed by Tokenizers instead of pure Python code)
tokenizer.save_pretrained("code-search-net-tokenizer")
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
tokens = tokenizer.tokenize(example)
print(tokens) # Just the tokens as a list
encoding = tokenizer(example) 
print(type(encoding)) # The output of a tokenizer is a BatchEncoding object (subclass of a dict with additional methods for fast tokenizers.

<class 'transformers.tokenization_utils_base.BatchEncoding'>
['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']


Special symbols Ġ and Ċ that denote spaces and newlines, but we can also see that our tokenizer learned some tokens that are highly specific to a corpus of Python functions: for example, there is a ĊĠĠĠ token that represents an indentation, and a Ġ""" token that represents the three quotes that start a docstring. The tokenizer also correctly split the function name on _. This is quite a compact representation. In addition to the token corresponding to an indentation, here we can also see a token for a double indentation: ĊĠĠĠĠĠĠĠ. The special Python words like class, init, call, self, and return are each tokenized as one token, and we can see that as well as splitting on _ and . the tokenizer correctly splits even camel-cased names: LinearLayer is tokenized as ["ĠLinear", "Layer"].

In [None]:
# Batch encoding
print(tokenizer.is_fast) #1 True or false
print(tokens.is_fast) #2 True or false
print(encoding.word_ids()) # Get the index of the word each token, where [CLS] and [SEP] are mapped to None
start, end = encoding.word_to_chars(3) # Fetch tokenized work at index 3
print(example[start:end])

### Token-classification pipeline
It handles entities that span over several tokens using a label for the beginning and another for the continuation of an entity.

In [7]:
token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'entity': 'I-PER',
  'score': np.float32(0.99938285),
  'index': 4,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': np.float32(0.99815494),
  'index': 5,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': np.float32(0.9959072),
  'index': 6,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': np.float32(0.99923277),
  'index': 7,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': np.float32(0.97389334),
  'index': 12,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': np.float32(0.97611505),
  'index': 13,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': np.float32(0.9887977),
  'index': 14,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': np.float32(0.9932106),
  'index': 16,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [9]:
token_classifier = pipeline("token-classification", aggregation_strategy="simple", device=-1) # device tells to use CPU (GPU not available for me)
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

- `simple`: the score is just the mean of the scores of each token in the given entity.
- `first`: the score of each entity is the score of the first token.
- `max`: the score of each entity is the maximum score of the tokens in that entity.
- `average` the score of each entity is the average of the scores of the words composing that entity, so for one world the score is the same as `simple`.

In [15]:
# Post-processing the predictions while grouping entities
model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True) # When true creates a tuple in offset_mapping with start and end character positions of the token in the original text
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

outputs = model(**inputs_with_offsets) # outputs is a ModelOutput (dict-like) object
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist() # Shape of logits is (batch_size, sequence_length, num_labels)

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred] # Converting the numeric label to its string form using the model config
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx] # Getting the starting character index of the entity (from the first token).

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}" # Keep collecting tokens as long as they are labeled I-{label}
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S'},
 {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl'}, 
 {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va'}, 
 {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in'}, 
 {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu'}, 
 {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging'}, 
 {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face'}, 
 {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn'}]


### Question-answering task
It handles long contexts splitting the context into several parts (with overlap to avoid that answer would be split across two parts) and finds the maximum score for an answer in each part (average will not make sense as some parts of the context won't include the answer).

- More info on question answering task: https://huggingface.co/learn/llm-course/chapter6/3b

### Normalization and pre-tokenization

- The normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents.
- Pre-tokenization implies to split the texts into small entities, like words.

In [17]:
# See how a fast tokenizer performs this two tasks
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?"))

hello how are u?
[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]


| Feature        | BPE                                                                 | WordPiece                                                                                                                        | Unigram                                                                                              |
|----------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|
| Training       | Starts from a small vocabulary and learns rules to merge tokens    | Starts from a small vocabulary and learns rules to merge tokens                                                                 | Starts from a large vocabulary and learns rules to remove tokens                                     |
| Training step  | Merges the tokens corresponding to the most common pair            | Merges the tokens corresponding to the pair with the best score based on frequency, privileging pairs where individual tokens are less frequent | Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus    |
| Learns         | Merge rules and a vocabulary                                        | Just a vocabulary                                                                                                                 | A vocabulary with a score for each token                                                             |
| Encoding       | Splits a word into characters and applies the merges learned during training | Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word | Finds the most likely split into tokens, using the scores learned during training                    |


- Byte-Pair Encoding tokenization: https://huggingface.co/learn/llm-course/chapter6/5?fw=pt
- WordPiece tokenization: https://huggingface.co/learn/llm-course/chapter6/6
- Unigram tokenization: https://huggingface.co/learn/llm-course/chapter6/7

###  Building a WordPiece tokenizer from scratch
(Hello how are U tday?) -> Normalization -> (hello how are u tday?) -> Pre-tokenization into words -> ([hello, how, are, u, tday, ?]) -> Model -> ([hello, how, are, u, td, ##ay, ?]) -> Postprocessor -> ([CLS, hello, how, are, u, td, ##ay, ?, SEP])  
  
\## is a subword prefix that indicates a token is a continuation of the previous word.

In [None]:
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]")) # unk_token must be specified so the model knows what to return when it encounters characters it hasn’t seen before

In [1]:
# Normalization from...

# Existing model
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

# Scratch
tokenizer.normalizer = normalizers.Sequence( # You can compose several normalizers using a Sequence
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()] # NFD Unicode normalizer allows StripAccents normalizer to recognize the accented characters
)
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?


In [3]:
# Pre-tokenization from...

# ... existing model
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

# ... scratch
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()] # Splits on whitespace and all characters that are not letters, digits, or the underscore character. Use WhitespaceSplit() for only white spaces
)
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)), ("'", (3, 4)), ('s', (4, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre', (14, 17)), ('-', (17, 18)), ('tokenizer', (18, 27)), ('.', (27, 28))]


In [4]:
# Model training from...
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

# ... iterator
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

# ... file
with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

# Testing
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)
print("['let', \"'\", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']")

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']


Other parameters are in_frequency (the number of times a token must appear to be included in the vocabulary) and change the continuing_subword_prefix (if we want to use something different from ##).

In [6]:
# Post-processing
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

# BERT template
tokenizer.post_processor = processors.TemplateProcessing( # We have to specify how to treat a single sentence and a pair of sentences
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

# Testing
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']
['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


In [8]:
# Last steps
tokenizer.decoder = decoders.WordPiece(prefix="##") # Including a decoder
tokenizer.decode(encoding.ids) # Testing
tokenizer.save("tokenizer.json") # Saving
new_tokenizer = Tokenizer.from_file("tokenizer.json") # Loading

"let's test this tokenizer... on a pair of sentences."


In [None]:
# Wrapper the raw tokenizer into something compatible with transformers ...
# .. using a specific tokenizer class
wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer) # Must be specified the special tokens that are different from the default ones (here, none):

# ... using PreTrainedTokenizerFast
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]", # Key here is that class can’t infer from the tokenizer object which token is the mask token, the [CLS] token, etc.
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

In [None]:
- Building a BPE tokenizer from scratch: Building a BPE tokenizer from scratch
- Building a Unigram tokenizer from scratch: https://huggingface.co/learn/llm-course/chapter6/8#building-a-unigram-tokenizer-from-scratch