# Tokenizers

Learn about the `tokenizers` library from HuggingFace

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 17/12/2025   | Martin | Create  | Notebook created for Ch6 Tokenizers | 

# Content

* [Introduction](#introduction)
* [Training a Tokenizer](#training-a-tokenizer)

# Introduction

Learn how to train a brand new tokenizer on a corpus of texts, so it can then be used to pretrain a language model

<u>Questions to Answer</u>

- How to train a new tokenizer similar to the one used by a given checkpoint on a new corpus of texts
- The special features of fast tokenizers
- The differences between the three main subword tokenization algorithms used in NLP today
- How to build a tokenizer from scratch with the ðŸ¤— Tokenizers library and train it on some data

> Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm

# Training a Tokenizer

1. Assemble a corpus of text

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

In [None]:
raw_datasets = load_dataset("code_search_net", "default", revision="refs/convert/parquet")

In [None]:
raw_datasets['train']

In [None]:
# Prints the whole function string
print(raw_datasets["train"][123456]["whole_func_string"])

In [None]:
# Load batches of text - Create Python generator
def get_training_corpus():
  training_corpus = (
    raw_datasets['train'][i, i+1000]['whole_func_string']
    for i in range(len(raw_datasets['train']), 1000)
  )
  return training_corpus

training_corpus = get_training_corpus()

In [None]:
# Alternative version to create corpus
def get_training_corpus():
  dataset = raw_datasets["train"]
  for start_idx in range(0, len(dataset), 1000):
    samples = dataset[start_idx : start_idx + 1000]
    yield samples["whole_func_string"]


Training a new tokenizer:

- Don't retrain from scratch. Need to learn unique tokens for downstream use.
- `train_new_from_iterator` - Function used to train a new corpus or iterator datatype. Must be a "fast" tokenizer
- `AutoTokenizer` will automatically select the "fast" tokenizer if available

In [None]:
old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

In [None]:
example = '''def add_numbers(a, b):
  """Add the two numbers `a` and `b`."""
  return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

In [None]:
tokens = tokenizer.tokenize(example)
tokens

In [None]:
# Shorter tokens on new tokenizer compared to old one
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

Saving the tokenizer

In [None]:
tokenizer.save_pretrained("code-search-net-tokenizer")

# Can also push to HF repo
tokenizer.push_to_hub("code-search-net-tokenizer")

---

# Fast Tokenizers

- __Slow Tokenizers__ - Written in Python
- __Fast Tokenizers__ - Written in Rust (much fast)

> Only when tokenizing lots of texts in parallel at the same time that you will be able to clearly see the difference

- _Offset Mapping:_ Keeps track of the original span of texts the final tokens come from
  - Allows accessing more granular details about the subword positions and relationships

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

In [None]:
print(tokenizer.is_fast) # Check if using fast or slow
print(encoding.tokens()) # Access tokens without converting back
print(encoding.word_ids()) # Indicates which subwords belong to the same word
start, end = encoding.word_to_chars(3) # Convert the tokens back to words
example[start:end]

## Token classification pipeline

Understand the post-processing portion once tokens are selected, in _Named Entity Recognition (NER)_ task.

Uses: `dbmdz/bert-large-cased-finetuned-conll03-english`

- `aggregation_strategy`:
  - `"simple"`: Mean
  - `"max"`: Max of tokens
  - `"average"`: Average of scores from individual words

In [None]:
import torch

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

: 

: 

In [None]:
# Can remove "aggregation_strategy" for subword tokens
token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

Making input predictions

In [None]:
checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForTokenClassification.from_pretrained(checkpoint)

example = "My name is Martin and I am a student at the University of Michigan, but staying in Singapore"
inputs = tokenizer(example, return_tensors='pt')
outputs = model(**inputs)

In [None]:
# 9 class labels
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

In [None]:
# Convert to class labels
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
preds = outputs.logits.argmax(dim=-1)[0].tolist()
print(preds)

results = []
tokens = inputs.tokens()
for idx, pred in enumerate(preds):
  label = model.config.id2label[pred]
  if label != "O":
    results.append(
      {'entity': label, 'score': probs[idx][pred], 'word': token[idx]}
    )
print(results)

In [None]:
# Combine individual tokens into full words and calculate the score
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
  pred = predictions[idx]
  label = model.config.id2label[pred]
  if label != "O":
    # Remove the B- or I-
    label = label[2:]
    start, _ = offsets[idx]

    # Grab all the tokens labeled with I-label
    all_scores = []
    while (
      idx < len(predictions)
      and model.config.id2label[predictions[idx]] == f"I-{label}"
    ):
      all_scores.append(probabilities[idx][pred])
      _, end = offsets[idx]
      idx += 1

    # The score is the mean of all the scores of the tokens in that grouped entity
    score = np.mean(all_scores).item()
    word = example[start:end]
    results.append(
      {
        "entity_group": label,
        "score": score,
        "word": word,
        "start": start,
        "end": end,
      }
    )
  idx += 1

print(results)

## Question-Answering pipeline

TBD

---

In [2]:
%watermark

Last updated: 2025-06-18T19:03:45.452311+08:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.31.0

Compiler    : MSC v.1938 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
CPU cores   : 20
Architecture: 64bit

