# Training a Custom BPE Tokenizer for Python Code

This notebook demonstrates how to train a custom Byte Pair Encoding (BPE) tokenizer specifically optimized for Python code. The tokenizer is trained on a large corpus of Python functions and is designed to be more efficient for code-related tasks compared to general-purpose tokenizers.

## What you'll learn:
- How to load and process Python code datasets
- How to configure and train a BPE tokenizer
- How to compare tokenizers and evaluate their performance
- Best practices for tokenizing code (preserving syntax, case sensitivity, etc.)

## Dataset:
The notebook uses the Python code dataset from [Zenodo](https://zenodo.org/records/7908468), which contains Python functions from GitHub repositories.


In [2]:
# Install required packages for tokenizer training
# transformers: Hugging Face library for tokenizers and models
# datasets: Library for loading and processing datasets
!pip install transformers datasets -q

In [3]:
# Import transformers and datasets libraries
# AutoTokenizer: For loading pre-trained tokenizers
# load_dataset: For loading datasets (not used directly here, but useful for other datasets)
from transformers import AutoTokenizer
from datasets import load_dataset

In [4]:
# Import tokenizers library components for building custom BPE tokenizer
# Tokenizer: Base tokenizer class
# models: Tokenizer models (BPE, WordPiece, etc.)
# normalizers: Text normalization (lowercasing, unicode normalization, etc.)
# pre_tokenizers: Pre-tokenization strategies (ByteLevel, Whitespace, etc.)
# trainers: Training algorithms (BpeTrainer, WordPieceTrainer, etc.)
# processors: Post-processing (adding special tokens, etc.)
# decoders: Decoding strategies for converting token IDs back to text
# PreTrainedTokenizerFast: Wrapper to make tokenizer compatible with transformers library
from tokenizers import (
    Tokenizer,
    models,
    normalizers,
    pre_tokenizers,
    trainers,
    processors,
    decoders
)
from transformers import PreTrainedTokenizerFast

In [5]:
# Import os for file system operations (changing directories, etc.)
import os

# Prepare dataset

In [10]:
from google.colab import drive
drive.mount('/content/drive')


ValueError: mount failed

In [6]:
# List files in current directory to verify data location
# Note: This is specific to Google Colab environment
!ls -l

total 4
drwxr-xr-x 1 root root 4096 Dec  9 14:42 sample_data


In [8]:
!ls sample_data/

anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


In [7]:
# Change to the directory containing the dataset
# Note: This path is specific to Google Colab. Adjust for your environment.
# For local execution, you might not need this or should change to your data directory
os.chdir("drive/MyDrive/colab")

FileNotFoundError: [Errno 2] No such file or directory: 'drive/MyDrive/colab'

In [20]:
# List files in the data directory to verify python.zip exists
!ls data/ -l

total 918858
-rw------- 1 root root 940909997 Dec 30 13:36 python.zip


In [None]:
# download python.zip file from this link: https://zenodo.org/records/7908468

In [21]:
# Extract the Python code dataset
# The dataset contains Python functions from GitHub repositories
# Structure: python/final/jsonl/train/ contains training files
!unzip data/python.zip

Archive:  data/python.zip
   creating: python/
   creating: python/final/
   creating: python/final/jsonl/
   creating: python/final/jsonl/train/
  inflating: python/final/jsonl/train/python_train_9.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_12.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_10.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_0.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_6.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_2.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_4.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_8.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_11.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_5.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_13.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_3.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_1.jsonl.gz  
  inflating: pytho

In [22]:
# Import libraries for reading compressed JSON files
import gzip
import json
from pathlib import Path

def generate_examples(filepaths):
    """
    Generator function to read Python code examples from compressed JSONL files.
    
    Args:
        filepaths: List of file paths to JSONL.gz files containing Python code
        
    Yields:
        Dictionary containing parsed data from each line in the JSONL files
    """
    for filepath in filepaths:
        # Open gzipped JSONL file in text mode with UTF-8 encoding
        with gzip.open(filepath, "rt", encoding="utf-8") as f:
            for row_id_, row in enumerate(f):
                # Parse each line as JSON
                data = json.loads(row)
                # Yield a structured dictionary with relevant fields
                yield {
                    "repository_name": data["repo"],
                    "func_path_in_repository": data["path"],
                    "func_name": data["func_name"],
                    "whole_func_string": data["original_string"],  # Full function code
                    "language": data["language"],
                    "func_code_string": data["code"],
                    "func_code_tokens": data["code_tokens"],
                    "func_documentation_string": data["docstring"],
                    "func_documentation_tokens": data["docstring_tokens"],
                    "split_name": data["partition"],
                    "func_code_url": data["url"],
                }


In [23]:
# Set the path to the training data directory
train_dir = Path("python/final/jsonl/train")

# Find all training files matching the pattern
# These are compressed JSONL files containing Python code examples
train_files = sorted(train_dir.glob("python_train_*.jsonl.gz"))

# Verify we have all 14 training files
len(train_files)  # should be 14


14

In [24]:
# Import Dataset class from Hugging Face datasets library
from datasets import Dataset

# Create a Hugging Face Dataset from the generator function
# This allows lazy loading of data, which is memory-efficient for large datasets
dataset = Dataset.from_generator(
    lambda: generate_examples(train_files)
)

# Display dataset information (number of examples, features, etc.)
dataset


Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

In [25]:
def get_training_corpus():
    """
    Generator function that yields batches of Python code strings for tokenizer training.
    
    This function processes the dataset in batches of 1000 examples to avoid memory issues.
    It extracts only the 'whole_func_string' field which contains the complete function code.
    
    Yields:
        List of function code strings (batch of 1000)
    """
    # Process dataset in batches of 1000 examples
    for i in range(0, len(dataset), 1000):
        # Get a batch of examples
        batches = dataset[i: i+1000]
        # Yield only the function code strings (this is what we'll train the tokenizer on)
        yield batches['whole_func_string']

# Training

In [26]:
# Example Python code to test tokenization
# We'll use this to compare the old (GPT-2) and new (custom BPE) tokenizers
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

In [27]:
# Load the GPT-2 tokenizer as a baseline for comparison
# GPT-2 uses BPE tokenization, so this gives us a reference point
old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [28]:
# Tokenize the example with GPT-2 tokenizer to see how it handles Python code
# This shows us the baseline tokenization behavior
old_tokenizer(example).tokens()

['def',
 'Ġadd',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`',
 '."',
 '""',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

In [29]:
# Initialize a new BPE (Byte Pair Encoding) tokenizer
# BPE is a subword tokenization algorithm that learns to merge frequent byte pairs
# This tokenizer will be trained on Python code, so it should be better suited for code
new_tokenizer = Tokenizer(models.BPE())

In [30]:
# Set normalizer to None - we don't want to normalize the Python code
# Normalization (like lowercasing) would break Python syntax
# Python is case-sensitive, so we preserve the original case
new_tokenizer.normalizer = None

In [31]:
# Set the pre-tokenizer to ByteLevel
# ByteLevel tokenization splits text into bytes, which ensures we can tokenize any Unicode character
# add_prefix_space=False means we don't add a space at the beginning of text
# This is important for code where whitespace matters (Python indentation)
new_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

In [32]:
# Configure the BPE trainer
# vocab_size=25000: Target vocabulary size (number of subword tokens to learn)
# special_tokens: Special tokens to include in the vocabulary
#   - <|endoftext|>: Marks the end of a text sequence (useful for language modeling)
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])

In [34]:
# Train the BPE tokenizer on the Python code corpus
# This is the main training step where the tokenizer learns subword merges
# The iterator yields batches of Python code strings
# Training may take several minutes depending on dataset size
new_tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [35]:
# Set the post-processor to ByteLevel
# This handles the conversion between byte-level tokens and text
# trim_offsets=False: Don't trim offsets when decoding (preserves exact byte positions)
new_tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

In [36]:
# Set the decoder to ByteLevel
# The decoder converts token IDs back to text
# ByteLevel decoder handles the byte-to-character conversion
new_tokenizer.decoder = decoders.ByteLevel()

In [37]:
# Wrap the tokenizer in PreTrainedTokenizerFast to make it compatible with transformers
# This allows us to use it with Hugging Face models and save/load it easily
# bos_token: Beginning of sequence token (using endoftext token for simplicity)
# eos_token: End of sequence token
# Note: There's a typo in the original - should be "<|endoftext|>" not "|<endoftext>|"
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=new_tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>"
)

In [38]:
# Tokenize the example with our custom-trained tokenizer
# Compare this output with the GPT-2 tokenizer to see the difference
# Our tokenizer should produce fewer tokens for Python code since it was trained on code
wrapped_tokenizer(example).tokens()

['def',
 'Ġadd',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'ĊĠĠĠ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`."""',
 'ĊĠĠĠ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

## Save the Tokenizer

Once training is complete, you can save the tokenizer for later use:


In [None]:
# Save the trained tokenizer to disk
# This will create a directory with all necessary files (vocab, merges, config, etc.)
# You can later load it using: PreTrainedTokenizerFast.from_pretrained("./python_code_tokenizer")
# wrapped_tokenizer.save_pretrained("./python_code_tokenizer")


In [39]:
# Compare token counts between GPT-2 and our custom tokenizer
# The custom tokenizer should generally produce fewer tokens for Python code
# because it learned code-specific subword patterns during training
len(old_tokenizer(example).tokens()), len(wrapped_tokenizer(example).tokens())

(36, 27)