# Tokenization Tutorial

This tutorial demonstrates how to use the tokenization task in the Continual Pretraining Framework. Tokenization is a crucial step in the NLP pipeline that converts raw text into numerical tokens that can be processed by language models.

## What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens. In the context of language models, these tokens are typically words, subwords, or characters that are then converted into numerical IDs using a vocabulary. These numerical representations are what the model actually processes during training and inference.

## Why is Tokenization Important?

- **Model Input Preparation**: Language models don't understand raw text; they need numerical inputs.
- **Vocabulary Management**: Tokenization helps manage the size of the vocabulary the model needs to learn.
- **Context Length Control**: It helps in managing the context length (sequence length) that will be fed to the model.
- **Performance Optimization**: Proper tokenization can significantly improve training efficiency and model performance.

In this tutorial, we'll walk through the process of tokenizing a dataset using the Continual Pretraining Framework's tokenization task.

## Setup

First, let's import the necessary modules and set up our environment.

In [None]:
# Import necessary libraries
import os
import sys
from box import Box
from datasets import load_dataset, Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer

# Add the project root to the Python path to import modules from src
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Import framework modules
from src.tasks.tokenization import execute
from src.tasks.tokenization.orchestrator import TokenizationOrchestrator
from src.tasks.tokenization.tokenizer import CausalLMTokenizer
from src.tasks.tokenization.tokenizer.config import TokenizerConfig
from src.utils.logging import VerboseLevel, get_logger

# Set up logging
logger = get_logger(__name__, VerboseLevel.INFO)

## 1. Understanding the Tokenization Configuration

The tokenization process in the framework is controlled by a configuration object. Let's explore the key parameters:

In [None]:
# Let's examine the TokenizerConfig class
help(TokenizerConfig)

The main parameters for tokenization are:

- **context_length**: Maximum sequence length (in tokens) for the model input
- **overlap**: Number of tokens to overlap between sequences when processing long texts
- **tokenizer_name**: Name or path of the HuggingFace tokenizer to use
- **batch_size**: Number of examples to process at once
- **num_proc**: Number of processes for parallel processing
- **show_progress**: Whether to display progress bars
- **verbose_level**: Logging verbosity level

Now, let's create a configuration for our tokenization task:

In [None]:
# Create a tokenizer configuration
tokenizer_config = TokenizerConfig(
    context_length=512,  # Maximum sequence length
    overlap=128,         # Overlap between sequences
    tokenizer_name="gpt2",  # Using GPT-2 tokenizer
    batch_size=32,       # Process 32 examples at once
    num_proc=2,          # Use 2 processes
    show_progress=True,  # Show progress bars
    verbose_level=VerboseLevel.INFO  # Set logging level
)

print(f"Tokenizer configuration:\n- Context length: {tokenizer_config.context_length}")
print(f"- Overlap: {tokenizer_config.overlap}")
print(f"- Tokenizer: {tokenizer_config.tokenizer_name}")

## 2. Preparing a Sample Dataset

For this tutorial, we'll create a small sample dataset. In a real-world scenario, you might load a dataset from a file or use the Hugging Face datasets library.

In [None]:
# Create a simple sample dataset
sample_texts = [
    "The quick brown fox jumps over the lazy dog. This sentence is often used as a pangram because it contains all the letters of the English alphabet.",
    "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.",
    "Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters.",
    "Large language models like GPT-3 and GPT-4 have billions of parameters and are trained on massive datasets of text from the internet.",
    "The Continual Pretraining Framework provides tools for efficient tokenization, training, and deployment of language models."
]

# Convert to a Hugging Face Dataset
sample_dataset = Dataset.from_dict({"text": sample_texts})
print(f"Sample dataset size: {len(sample_dataset)} examples")
print("\nSample example:")
print(sample_dataset[0])

## 3. Exploring the Tokenizer

Before we tokenize the entire dataset, let's explore how the tokenizer works on a single example:

In [None]:
# Initialize the tokenizer directly
tokenizer = AutoTokenizer.from_pretrained(tokenizer_config.tokenizer_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

# Tokenize a single example
example_text = sample_texts[0]
encoded = tokenizer(example_text, return_tensors="pt")

print(f"Original text: {example_text}")
print(f"\nTokenized to {len(encoded['input_ids'][0])} tokens")
print(f"Input IDs: {encoded['input_ids'][0][:10].tolist()}...")

# Decode back to text
decoded = tokenizer.decode(encoded['input_ids'][0])
print(f"\nDecoded text: {decoded}")

# Visualize token to text mapping
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print("\nFirst 20 tokens:")
for i, token in enumerate(tokens[:20]):
    print(f"{i}: {token}")

## 4. Using the CausalLMTokenizer

Now, let's use the framework's `CausalLMTokenizer` to tokenize our sample dataset:

In [None]:
# Initialize the CausalLMTokenizer with our configuration
causal_tokenizer = CausalLMTokenizer(tokenizer_config)

# Tokenize the sample dataset
tokenized_dataset = causal_tokenizer.tokenize(sample_dataset)

# Examine the tokenized dataset
print(f"Tokenized dataset features: {tokenized_dataset.features}")
print(f"Number of examples: {len(tokenized_dataset)}")

# Display a tokenized example
example = tokenized_dataset[0]
print("\nExample tokenized data:")
print(f"- Input IDs shape: {len(example['input_ids'])}")
print(f"- Attention mask shape: {len(example['attention_mask'])}")
print(f"- Labels shape: {len(example['labels'])}")

# Show the first few tokens
print("\nFirst 10 input IDs:", example['input_ids'][:10])
print("First 10 attention mask values:", example['attention_mask'][:10])
print("First 10 labels:", example['labels'][:10])

## 5. Understanding Tokenization with Overlapping

The framework supports tokenization with overlapping, which is useful for processing long texts. Let's explore how this works:

In [None]:
# Create a long text example
long_text = " ".join([sample_texts[i % len(sample_texts)] for i in range(20)])
print(f"Long text length: {len(long_text)} characters")

# Create a dataset with the long text
long_dataset = Dataset.from_dict({"text": [long_text]})

# Tokenize with different overlap settings
results = {}
for overlap in [0, 64, 128]:
    config = TokenizerConfig(
        context_length=256,
        overlap=overlap,
        tokenizer_name="gpt2",
        batch_size=1,
        show_progress=True
    )
    tokenizer = CausalLMTokenizer(config)
    tokenized = tokenizer.tokenize(long_dataset)
    results[overlap] = {
        "num_examples": len(tokenized),
        "first_example": tokenized[0]
    }

# Compare results
print("\nComparison of different overlap settings:")
for overlap, result in results.items():
    print(f"\nOverlap = {overlap}:")
    print(f"- Generated {result['num_examples']} examples")
    
    # Decode the first few tokens of each example
    if result['num_examples'] > 0:
        first_example = result['first_example']
        decoded = tokenizer._tokenizer.decode(first_example['input_ids'][:20])
        print(f"- First 20 tokens decode to: '{decoded}...'")

## 6. Using the TokenizationOrchestrator

The framework provides a `TokenizationOrchestrator` that handles the complete tokenization workflow. Let's see how to use it:

In [None]:
# Create a temporary directory for output
import tempfile
output_dir = tempfile.mkdtemp()
print(f"Output directory: {output_dir}")

# Create a configuration for the orchestrator
config = Box({
    "tokenizer": {
        "tokenizer_name": "gpt2",
        "context_length": 512,
        "overlap": 128,
        "batch_size": 32,
        "show_progress": True
    },
    "dataset": {
        "in_memory": True,  # We're providing the dataset directly
    },
    "output": {
        "path": os.path.join(output_dir, "tokenized_dataset")
    },
    "verbose_level": VerboseLevel.INFO,
    "task": "clm_training"
})

# Create the orchestrator
orchestrator = TokenizationOrchestrator(config)

# We need to provide the dataset since we're not loading it from disk
# In a real scenario, the orchestrator would load the dataset based on config
orchestrator.load_dataset = lambda: sample_dataset

# Execute the tokenization workflow
orchestrator.execute()

print(f"\nTokenized dataset saved to: {config.output.path}")

## 7. Loading and Verifying the Tokenized Dataset

Now, let's load the tokenized dataset from disk and verify its contents:

In [None]:
# Load the tokenized dataset from disk
from datasets import load_from_disk

loaded_dataset = load_from_disk(config.output.path)
print(f"Loaded dataset features: {loaded_dataset.features}")
print(f"Number of examples: {len(loaded_dataset)}")

# Verify the first example
example = loaded_dataset[0]
print("\nFirst example:")
print(f"- Input IDs shape: {len(example['input_ids'])}")
print(f"- First 10 input IDs: {example['input_ids'][:10]}")

# Decode the first example
decoded_text = tokenizer.decode(example['input_ids'])
print(f"\nDecoded text from first example:\n{decoded_text[:200]}...")

## 8. Analyzing Token Distribution

Let's analyze the distribution of tokens in our tokenized dataset:

In [None]:
# Collect token statistics
token_counts = {}
sequence_lengths = []

for example in loaded_dataset:
    # Count non-padding tokens
    non_padding = sum(example['attention_mask'])
    sequence_lengths.append(non_padding)
    
    # Count individual tokens
    for token_id in example['input_ids']:
        if token_id in token_counts:
            token_counts[token_id] += 1
        else:
            token_counts[token_id] = 1

# Plot sequence length distribution
plt.figure(figsize=(10, 6))
plt.hist(sequence_lengths, bins=20)
plt.title('Distribution of Sequence Lengths')
plt.xlabel('Number of Tokens')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

# Plot most common tokens
top_tokens = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)[:20]
token_ids, counts = zip(*top_tokens)
token_texts = [tokenizer.decode([tid]) for tid in token_ids]

plt.figure(figsize=(12, 6))
plt.bar(range(len(token_texts)), counts)
plt.xticks(range(len(token_texts)), token_texts, rotation=45, ha='right')
plt.title('Most Common Tokens')
plt.xlabel('Token')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## 9. Using the Execute Function

The framework provides a simple `execute` function that can be used to run the tokenization task with a configuration. Let's see how to use it:

In [None]:
# Create a new output directory
new_output_dir = tempfile.mkdtemp()
print(f"New output directory: {new_output_dir}")

# Create a configuration for the execute function
execute_config = Box({
    "tokenizer": {
        "tokenizer_name": "gpt2",
        "context_length": 256,
        "overlap": 64,
        "batch_size": 32,
        "show_progress": True
    },
    "dataset": {
        "path": "wikitext",  # Using a HuggingFace dataset
        "name": "wikitext-2-raw-v1",
        "split": "test",
        "text_field": "text"
    },
    "output": {
        "path": os.path.join(new_output_dir, "wikitext_tokenized")
    },
    "verbose_level": VerboseLevel.INFO,
    "task": "clm_training"
})

# Note: In a real scenario, you would run:
# from src.tasks.tokenization import execute
# execute(execute_config)

# For this tutorial, we'll just print the configuration
print("\nConfiguration for execute function:")
for key, value in execute_config.items():
    if isinstance(value, dict):
        print(f"\n{key}:")
        for k, v in value.items():
            print(f"  {k}: {v}")
    else:
        print(f"{key}: {value}")

## 10. Performance Considerations

Tokenization can be a performance bottleneck when processing large datasets. Here are some tips for optimizing tokenization performance:

### Fast vs. Slow Tokenizers

The framework automatically detects whether a tokenizer is "fast" (Rust-based) or "slow" (Python-based) and optimizes accordingly:

- **Fast tokenizers** use internal Rust parallelism and don't benefit from Python multiprocessing
- **Slow tokenizers** benefit from Python multiprocessing with `num_proc > 1`

### Batch Size

Batch size affects memory usage and processing speed:

- Larger batch sizes generally improve throughput but require more memory
- The default batch size is 2000, which works well for most cases
- For very large documents, you might need to reduce the batch size

### Number of Processes

The `num_proc` parameter controls parallelism:

- For fast tokenizers, `num_proc=None` is optimal (uses internal Rust parallelism)
- For slow tokenizers, setting `num_proc` to half the available CPU cores is a good starting point
- Setting `num_proc=1` forces single-process mode

Let's measure tokenization performance with different configurations:

In [None]:
# Create a larger dataset for performance testing
large_texts = sample_texts * 100  # Repeat the sample texts 100 times
large_dataset = Dataset.from_dict({"text": large_texts})
print(f"Large dataset size: {len(large_dataset)} examples")

# Measure tokenization performance with different configurations
performance_results = []

# Test different batch sizes
for batch_size in [10, 100, 1000]:
    config = TokenizerConfig(
        context_length=512,
        overlap=128,
        tokenizer_name="gpt2",
        batch_size=batch_size,
        num_proc=None,  # Let the tokenizer decide
        show_progress=True
    )
    
    tokenizer = CausalLMTokenizer(config)
    
    # Measure tokenization time
    import time
    start_time = time.time()
    tokenized = tokenizer.tokenize(large_dataset)
    elapsed_time = time.time() - start_time
    
    performance_results.append({
        "batch_size": batch_size,
        "num_proc": "auto",
        "elapsed_time": elapsed_time,
        "examples_per_second": len(large_dataset) / elapsed_time
    })

# Display performance results
print("\nPerformance results:")
for result in performance_results:
    print(f"Batch size: {result['batch_size']}, Num proc: {result['num_proc']}")
    print(f"  Time: {result['elapsed_time']:.2f} seconds")
    print(f"  Throughput: {result['examples_per_second']:.2f} examples/sec")

# Plot performance results
plt.figure(figsize=(10, 6))
batch_sizes = [r['batch_size'] for r in performance_results]
throughputs = [r['examples_per_second'] for r in performance_results]
plt.bar(range(len(batch_sizes)), throughputs)
plt.xticks(range(len(batch_sizes)), [f"Batch={bs}" for bs in batch_sizes])
plt.title('Tokenization Performance by Batch Size')
plt.xlabel('Configuration')
plt.ylabel('Examples per Second')
plt.grid(True, alpha=0.3)
plt.show()

## Conclusion

In this tutorial, we've explored the tokenization task in the Continual Pretraining Framework. We've learned how to:

1. Configure the tokenizer with appropriate parameters
2. Tokenize datasets using the CausalLMTokenizer
3. Use the TokenizationOrchestrator for end-to-end tokenization workflow
4. Analyze tokenized data and understand token distributions
5. Optimize tokenization performance

Tokenization is a critical step in the language model training pipeline, and the framework provides efficient tools to handle this task at scale. The tokenized datasets produced by this process can be directly used for causal language model training in the next step of the pipeline.

## Next Steps

After tokenization, the next steps in the Continual Pretraining Framework are:

1. **CLM Training**: Train a causal language model on the tokenized dataset
2. **Publishing**: Publish the trained model for use in downstream tasks

Check out the other tutorials in this series to learn more about these tasks!