# Tokenizer Training

## Objective

Following the data preparation in the previous notebook, this notebook focuses on the next critical step: **training a tokenizer**. Using the clean, filtered training corpus we generated, we will train a custom Byte-Pair Encoding (BPE) tokenizer.

The final artifact produced by this notebook is a trained tokenizer (vocabulary and merge rules) saved to disk. This tokenizer will be used in all subsequent stages to convert raw text into token IDs that the model can understand.

## Imports and Dependencies

In [1]:
from mini_transformer.data.tokenize.bpe import BPETokenization
from mini_transformer.container import MiniTransformerContainer

In [2]:
container = MiniTransformerContainer()
container.init_resources()
container.wire(modules=[__name__])
repo = container.data.repo()

## 1. Setup and Data Loading

First, we initialize the project's dependency container to gain access to our services, specifically the `DatasetRepository`.

We then query the repository to find and load the **filtered training dataset** that was created in the `00_data.ipynb` notebook. This ensures we are training the tokenizer on the exact same clean data that will be used for training the model.

In [3]:
dataset_info = repo.show(stage="filtered", split="train")
dataset = repo.get(dataset_id=dataset_info['id'].values[0])

## 2. Train and Save the Tokenizer

With the training corpus loaded, we can now perform the core task of this notebook:

1.  **Instantiate the Tokenizer Service**: We get the `BPETokenization` service from our container.
2.  **Train**: We call the `.train()` method, passing our dataset. This process iterates through the text corpus to learn the subword vocabulary and merge rules that define our BPE tokenizer.
3.  **Save**: Once training is complete, we call the `.save()` method. This serializes the trained tokenizer's state (vocabulary, etc.) to files, making it a reusable artifact for model training and inference.

In [4]:
tokenization = container.data.tokenization()
tokenization.train(dataset=dataset)
tokenization.save()

BPE tokenizer training on wmt14-fr-en-train-filtered-50000_examples-62438e4e started.



BPE tokenizer training complete.
