# Tokenization Tutorial

This tutorial demonstrates how to use the tokenization task in the Continual Pretraining Framework. Tokenization is a crucial step in the NLP pipeline that converts raw text into numerical tokens that can be processed by language models.

## What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens. In the context of language models, these tokens are typically words, subwords, or characters that are then converted into numerical IDs using a vocabulary. These numerical representations are what the model actually processes during training and inference.

## Why is Tokenization Important?

- **Model Input Preparation**: Language models don't understand raw text; they need numerical inputs.
- **Vocabulary Management**: Tokenization helps manage the size of the vocabulary the model needs to learn.
- **Context Length Control**: It helps in managing the context length (sequence length) that will be fed to the model.
- **Performance Optimization**: Proper tokenization can significantly improve training efficiency and model performance.

In this tutorial, we'll walk through the process of tokenizing a dataset using the Continual Pretraining Framework's tokenization task.

# =============================================================================

## Load the Config File for tokenization

First, let's import the necessary modules and set up our environment.

In [1]:
import yaml
from box import Box


# Load the YAML config file for tokenization
with open("/workspace/tutorials/configs/tokenization_config.yaml", "r") as f:
    tokenization_config = Box(yaml.safe_load(f), default_box=True)

print("Loaded tokenization config:")
print(tokenization_config)

Loaded tokenization config:
{'task': 'tokenization', 'experiment_name': 'tutorial_tokenization', 'verbose_level': 4, 'tokenizer': {'name': 'gpt2', 'use_fast': True, 'task': 'clm_training', 'context_length': 1024, 'overlap': 256, 'batch_size': 1024, 'num_proc': 2, 'show_progress': True}, 'dataset': {'source': 'local', 'path': 'tutorials/data/raw_text_data', 'format': 'text'}, 'output': {'path': 'tutorials/data/sample_tokenized_dataset', 'format': 'hf', 'split': True, 'train_size': 1, 'valid_size': 0.0, 'shuffle': True, 'seed': 42}}


# =============================================================================

# Main parameters

The main parameters for tokenization are:

- **context_length**: Maximum sequence length (in tokens) for the model input
- **overlap**: Number of tokens to overlap between sequences when processing long texts
- **tokenizer_name**: Name or path of the HuggingFace tokenizer to use
- **batch_size**: Number of examples to process at once
- **num_proc**: Number of processes for parallel processing
- **show_progress**: Whether to display progress bars
- **verbose_level**: Logging verbosity level

# =============================================================================

# Run tokenization with "El Quijote"


In [8]:
import os
os.chdir("/workspace")
!python src/main.py --config tutorials/configs/tokenization_config.yaml

2025-06-16 13:42:46 - src.utils.orchestrator - [0;32mINFO[0m - [0;32mStarting tokenization workflow[0m
2025-06-16 13:42:46 - src.utils.orchestrator - [0;32mINFO[0m - [0;32mLoading dataset from files at dir 'tutorials/data/raw_text_data'[0m
2025-06-16 13:42:46 - src.utils.dataset.storage - [0;32mINFO[0m - [0;32mProcessing files from 'tutorials/data/raw_text_data' and grouping by file extension.[0m
2025-06-16 13:42:46 - src.utils.dataset.storage - [0;32mINFO[0m - [0;32mStarting directory scan in: tutorials/data/raw_text_data[0m
2025-06-16 13:42:46 - src.utils.dataset.storage - [0;36mDEBUG[0m - [0;36mDirectory: tutorials/data/raw_text_data - Found 1/1 files with supported extensions ['txt', 'csv', 'json', 'jsonl'][0m
2025-06-16 13:42:46 - src.utils.dataset.storage - [0;36mDEBUG[0m - [0;36mScan completed: Found 1 matching files across 1 directories[0m
2025-06-16 13:42:46 - src.utils.dataset.storage - [0;36mDEBUG[0m - [0;36mGrouped files by extensions: ['txt (1)']

# =============================================================================

## Inspect Tokenized Dataset

In [12]:
from datasets import load_from_disk

tokenized_path = tokenization_config.output.path
dataset = load_from_disk(tokenized_path)

# Check if it's a DatasetDict (multiple splits) or Dataset (single split)
from datasets import DatasetDict, Dataset

if isinstance(dataset, DatasetDict):
    print("Tokenized dataset splits:", list(dataset.keys()))
    print("Number of examples in train split:", len(dataset["train"]))
    print("First example from train split:", dataset["train"][0])
elif isinstance(dataset, Dataset):
    print("Single split dataset loaded.")
    print("Number of examples:", len(dataset))
    print("First example:", dataset[0])
else:
    print("Unknown dataset type:", type(dataset))

Single split dataset loaded.
Number of examples: 37453
First example: {'input_ids': [9527, 27016, 4267, 78, 289, 11624, 2188, 836, 2264, 2926, 1258, 390, 8591, 1869, 11693, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5

# =============================================================================

## Performance Considerations

Tokenization can be a performance bottleneck when processing large datasets. Here are some tips for optimizing tokenization performance:

### Fast vs. Slow Tokenizers

The framework automatically detects whether a tokenizer is "fast" (Rust-based) or "slow" (Python-based) and optimizes accordingly:

- **Fast tokenizers** use internal Rust parallelism and don't benefit from Python multiprocessing
- **Slow tokenizers** benefit from Python multiprocessing with `num_proc > 1`

### Batch Size

Batch size affects memory usage and processing speed:

- Larger batch sizes generally improve throughput but require more memory
- The default batch size is 2000, which works well for most cases
- For very large documents, you might need to reduce the batch size

### Number of Processes

The `num_proc` parameter controls parallelism:

- For fast tokenizers, `num_proc=None` is optimal (uses internal Rust parallelism)
- For slow tokenizers, setting `num_proc` to half the available CPU cores is a good starting point
- Setting `num_proc=1` forces single-process mode

# =============================================================================

## Conclusion

In this tutorial, we've explored the tokenization task in the Continual Pretraining Framework. We've learned how to:

1. Configure the tokenizer with appropriate parameters
2. Tokenize datasets using the CausalLMTokenizer
3. Use the TokenizationOrchestrator for end-to-end tokenization workflow
4. Analyze tokenized data and understand token distributions
5. Optimize tokenization performance

Tokenization is a critical step in the language model training pipeline, and the framework provides efficient tools to handle this task at scale. The tokenized datasets produced by this process can be directly used for causal language model training in the next step of the pipeline.

## Next Steps

After tokenization, the next steps in the Continual Pretraining Framework are:

1. **CLM Training**: Train a causal language model on the tokenized dataset
2. **Publishing**: Publish the trained model for use in downstream tasks

Check out the other tutorials in this series to learn more about these tasks!