# Data

## Objective

This notebook details the end-to-end data preparation pipeline for the WMT14 French-to-English translation dataset. The primary goal is to process the raw source data into clean, filtered, and appropriately sized datasets for training, validation, and testing a transformer model.

## Process Overview

The workflow is executed in two primary stages:

1.  **Raw Data Extraction**: We begin by extracting a large number of examples from the source dataset for each split (4M for training, 3k for validation, and 3k for testing). This initial step ensures we have a comprehensive base of raw data stored and managed by our dataset repository.

2.  **Filtering & Subsampling**: The raw datasets are then passed through a filtering process. This stage is crucial for quality control and creating a manageable dataset for training. It involves cleaning the data based on certain criteria and significantly subsampling the training set from 4 million down to 50,000 examples, resulting in a high-quality dataset ready for the next steps in the modeling pipeline.


## Imports and Dependencies

In [None]:
from mini_transformer.data.builder.data_filter import TranslationDatasetBuilderFilteredConfig, TranslationDatasetBuilderFiltered
from mini_transformer.data.builder.extractor import TranslationDatasetBuilderRaw, TranslationDatasetBuilderRawConfig
from mini_transformer.container import MiniTransformerContainer

## 1. Setup and Initialization

We begin by setting up the environment. This involves initializing a central `MiniTransformerContainer` which handles dependency injection for the project. By wiring this container, we gain access to a `DatasetRepository`, which is a dedicated service for managing, storing, and retrieving our datasets throughout their various processing stages.

In [2]:
container = MiniTransformerContainer()
container.init_resources()
container.wire(modules=[__name__])
repo = container.data.repo()

## 2. Stage 1: Raw Data Extraction

The first step in our data pipeline is to extract the raw data from the source, the WMT14 French-to-English dataset. We define the desired number of samples for each split:

* **Training**: 4,000,000 examples
* **Validation**: 3,000 examples
* **Test**: 3,000 examples

The `extract` function is a helper that encapsulates this logic. It checks if a dataset with the specified configuration already exists in our repository. If not, it uses the `TranslationDatasetBuilderRaw` to build it from the source and add it to the repository. The output from the subsequent cell confirms that these raw datasets have already been created and saved.

In [3]:
def extract(split, n, stage="raw"):
    config = TranslationDatasetBuilderRawConfig(split=split, stage=stage, n=n, seed=55)
    if not repo.exists(config.dataset_id):
        extractor = TranslationDatasetBuilderRaw(config=config)
        dataset = extractor.build()
        repo.add(dataset=dataset)
        print(dataset.metrics)
    else:
        print(f"Dataset {config.dataset_name} already exists.")
    

In [4]:
splits = ["train", "validation", "test"]
sizes = [4000000, 3000, 3000]
stages = ["raw", "raw", "raw"]
for split, n, stage in zip(splits, sizes, stages):
    extract(split=split, n=n, stage=stage)

Dataset wmt14-fr-en-train-raw-4000000_examples-80e3e0a4 already exists.
Dataset wmt14-fr-en-validation-raw-3000_examples-adb3cda6 already exists.
Dataset wmt14-fr-en-test-raw-3000_examples-019a7ef1 already exists.


## 3. Stage 2: Filtering and Processing

With the raw datasets available, the next stage is to apply filtering and cleaning. This is a critical step to create a high-quality, manageable dataset for training. The `dataset_filter` function handles this process. It retrieves a raw dataset, creates a new configuration for a "filtered" version, and then uses the `TranslationDatasetBuilderFiltered` to perform the actual processing.

The final code cell iterates through all available "raw" datasets and applies this filtering logic to each one. The output shows the details of the newly created filtered datasets. It's important to note the change in size:

* The training set is subsampled from **4,000,000** to **50,000** examples.
* The validation and test sets are slightly reduced, likely due to filtering out pairs that didn't meet specific criteria (e.g., sentence length, character ratios, etc.).

This results in a smaller, cleaner dataset that is more efficient for model training and validation.

In [None]:
def dataset_filter(dataset_id: str, force: bool = False):
    dataset = repo.get(dataset_id=dataset_id)
    config = TranslationDatasetBuilderFilteredConfig.from_config(dataset.config)
    if force:
        repo.remove(config.dataset_id)
    if not repo.exists(config.dataset_id):
        filter = TranslationDatasetBuilderFiltered(dataset=dataset, config=config)
        filtered_dataset = filter.build()
        repo.add(filtered_dataset)
        print(filtered_dataset)
    else:
        print(f"Dataset {config.dataset_name} already exists.")
    

In [6]:
datasets = repo.show(stage="raw")
for dataset_id in datasets['id']:
    dataset_filter(dataset_id=dataset_id)


100%|██████████| 50000/50000 [00:04<00:00, 11561.98it/s]
[08/27/2025 01:00:48 PM] [INFO] [mini_transformer.data.repo] [add] : Added dataset id: 62438e4e, name: wmt14-fr-en-train-filtered-50000_examples-62438e4e to the Dataset repository.




                       TranslationDataset                       
                              id | 62438e4e
                            name | wmt14-fr-en-train-filtered-50000_examples-62438e4e
                           split | train
                               n | 50000
                          source | HuggingFace
             source_dataset_name | wmt14
                           stage | filtered
                         created | 2025-08-27 13:00:47.317834
                            lang | fr-en
                        lang_src | en
                        lang_tgt | fr




  6%|▌         | 2895/50000 [00:00<00:00, 54613.58it/s]
[08/27/2025 01:00:54 PM] [INFO] [mini_transformer.data.repo] [add] : Added dataset id: a768da37, name: wmt14-fr-en-validation-filtered-50000_examples-a768da37 to the Dataset repository.




                       TranslationDataset                       
                              id | a768da37
                            name | wmt14-fr-en-validation-filtered-50000_examples-a768da37
                           split | validation
                               n | 2895
                          source | HuggingFace
             source_dataset_name | wmt14
                           stage | filtered
                         created | 2025-08-27 13:00:54.263693
                            lang | fr-en
                        lang_src | en
                        lang_tgt | fr




  6%|▌         | 2968/50000 [00:00<00:00, 55077.11it/s]
[08/27/2025 01:00:54 PM] [INFO] [mini_transformer.data.repo] [add] : Added dataset id: bf240e52, name: wmt14-fr-en-test-filtered-50000_examples-bf240e52 to the Dataset repository.




                       TranslationDataset                       
                              id | bf240e52
                            name | wmt14-fr-en-test-filtered-50000_examples-bf240e52
                           split | test
                               n | 2968
                          source | HuggingFace
             source_dataset_name | wmt14
                           stage | filtered
                         created | 2025-08-27 13:00:54.383958
                            lang | fr-en
                        lang_src | en
                        lang_tgt | fr


