# Data



## Imports and Dependencies

In [1]:
from mini_transformer.data.extractor.config import TranslationDatasetExtractorConfig
from mini_transformer.data.extractor.extract import TranslationDatasetExtractor
from mini_transformer.container import MiniTransformerContainer

In [2]:
container = MiniTransformerContainer()
container.init_resources()
container.wire(modules=[__name__, "mini_transformer.data.extractor.extract"])
repo = container.data.repo()

## Dataset Extractor
From the WMT14 dataset, we will extract training, validation, and test data as follows:
- training: 4,000,000
- validation: 1,000,000
- test: 1,000,000
We'll oversample by a factor of 5 to ensure that we have a valid number of samples following validation and filtering.

In [3]:
def extract(split, n, stage="raw"):
    config = TranslationDatasetExtractorConfig(split=split, stage=stage, n=n)
    extractor = TranslationDatasetExtractor(config=config)
    dataset = extractor.extract()
    repo.add(dataset=dataset)
    print(dataset.metrics)
    

In [4]:
splits = ["train", "validation", "test"]
sizes = [4000000, 3000, 3000]
stages = ["raw", "raw", "raw"]
for split, n, stage in zip(splits, sizes, stages):
    extract(split=split, n=n, stage=stage)

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

100%|██████████| 4000000/4000000 [35:43<00:00, 1866.26it/s] 
[08/26/2025 12:22:30 AM] [INFO] [mini_transformer.data.repo] [add] : Added dataset id: 76174952, name: wmt14-fr-en-train-raw-4000000_examples-76174952 to the Dataset repository.




               TranslationDatasetExtractorMetrics               
                      started_at | 2025-08-25 23:13:28.982748
                        ended_at | 2025-08-25 23:49:20.104557
                        duration | 2151.122
                               n | 4000000
                      throughput | 1859.495
                     avg_seq_len | 27.068
                     max_seq_len | 8480
                     min_seq_len | 1
                 src_avg_seq_len | 25.026
                 src_max_seq_len | 4345
                 src_min_seq_len | 1
                 tgt_avg_seq_len | 29.111
                 tgt_max_seq_len | 8480
                 tgt_min_seq_len | 1




Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

100%|██████████| 1000/1000 [00:01<00:00, 779.34it/s]
[08/26/2025 12:22:37 AM] [INFO] [mini_transformer.data.repo] [add] : Added dataset id: b8deb1ce, name: wmt14-fr-en-validation-raw-1000_examples-b8deb1ce to the Dataset repository.




               TranslationDatasetExtractorMetrics               
                      started_at | 2025-08-26 00:22:32.740874
                        ended_at | 2025-08-26 00:22:36.139364
                        duration | 3.398
                               n | 1000
                      throughput | 294.291
                     avg_seq_len | 19.66
                     max_seq_len | 84
                     min_seq_len | 1
                 src_avg_seq_len | 18.718
                 src_max_seq_len | 76
                 src_min_seq_len | 1
                 tgt_avg_seq_len | 20.603
                 tgt_max_seq_len | 84
                 tgt_min_seq_len | 1




Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

100%|██████████| 1000/1000 [00:00<00:00, 1067.39it/s]
[08/26/2025 12:22:39 AM] [INFO] [mini_transformer.data.repo] [add] : Added dataset id: d897ef5d, name: wmt14-fr-en-test-raw-1000_examples-d897ef5d to the Dataset repository.




               TranslationDatasetExtractorMetrics               
                      started_at | 2025-08-26 00:22:37.036132
                        ended_at | 2025-08-26 00:22:38.970474
                        duration | 1.934
                               n | 1000
                      throughput | 517.063
                     avg_seq_len | 22.184
                     max_seq_len | 75
                     min_seq_len | 3
                 src_avg_seq_len | 20.958
                 src_max_seq_len | 61
                 src_min_seq_len | 3
                 tgt_avg_seq_len | 23.41
                 tgt_max_seq_len | 75
                 tgt_min_seq_len | 3


