In [1]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created: January 2026
Author: Thomas Moerman
Description: Notebook for training an encoder-decoder NMT model using OpenNMT-py.
"""


'\nCreated: January 2026\nAuthor: Thomas Moerman\nDescription: Notebook for training an encoder-decoder NMT model using OpenNMT-py.\n'

# Training an Encoder-Decoder NMT Model with OpenNMT-py

This notebook walks you through **training a Transformer-based Neural Machine Translation (NMT) model** using [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py).

## What you'll learn
- How to download and prepare parallel translation data
- How to create OpenNMT-py configuration files
- How to build vocabularies for source and target languages
- How to train an encoder-decoder Transformer model
- How to translate with the trained model and evaluate with BLEU

## Encoder-Decoder vs. Decoder-Only

Unlike LLMs (decoder-only), traditional NMT uses an **encoder-decoder** architecture:
- **Encoder**: Reads the source sentence and creates a representation
- **Decoder**: Generates the target sentence based on the encoder's output

This architecture is specifically designed for sequence-to-sequence tasks like translation.

## OpenNMT-py

OpenNMT-py is a mature, research-friendly NMT framework that supports:
- Transformer and RNN architectures
- Various attention mechanisms
- Subword tokenization (BPE, SentencePiece)
- Multi-GPU training
- Model ensembling


## 1) Setup and Imports

First, let's verify that OpenNMT-py is installed and check our environment.


In [2]:
import os
import subprocess
import yaml
import torch

# Check PyTorch and CUDA
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
    print('GPU Memory:', round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 1), 'GB')

# Check OpenNMT-py installation
try:
    import onmt
    print(f'\nOpenNMT-py version: {onmt.__version__}')
except ImportError:
    print('\nOpenNMT-py not installed. Install with: pip install OpenNMT-py')

# Set random seed for reproducibility
torch.manual_seed(42)


PyTorch: 2.2.2+cu121
CUDA available: True
GPU: Tesla V100-SXM2-16GB
GPU Memory: 15.8 GB

OpenNMT-py version: 3.5.1


<torch._C.Generator at 0x72744c905990>

## 2) Download and Prepare the Data

We'll use the same English-French translation dataset from Hugging Face.

OpenNMT-py expects **parallel text files**:
- `src-train.txt`: Source language sentences (one per line)
- `tgt-train.txt`: Target language sentences (one per line)

The lines must be aligned (line N in src corresponds to line N in tgt).


In [3]:
# Configuration
REPO_NAME = "LT3/nfr_bt_nmt_english-french"
DATA_PATH = "data/en-fr"
OPENNMT_DATA_PATH = "data/opennmt-en-fr"

# Create OpenNMT data directory
os.makedirs(OPENNMT_DATA_PATH, exist_ok=True)

# Download the data if not already present
from download_data import download_and_save_dataset

if not os.path.exists(DATA_PATH):
    dataset_paths = download_and_save_dataset(REPO_NAME, DATA_PATH)
else:
    print(f"Data already exists at {DATA_PATH}")


Data already exists at data/en-fr


In [4]:
# Prepare data in OpenNMT format
# For this tutorial, we'll use a subset of the data

def prepare_opennmt_data(en_path, fr_path, out_src_path, out_tgt_path, max_samples=None):
    """
    Prepare parallel data for OpenNMT-py.
    Reads English and French files, aligns them, and writes to new files.
    """
    with open(en_path, 'r', encoding='utf-8') as f:
        en_lines = [line.strip() for line in f if line.strip()]
    with open(fr_path, 'r', encoding='utf-8') as f:
        fr_lines = [line.strip() for line in f if line.strip()]
    
    # Ensure same length
    min_len = min(len(en_lines), len(fr_lines))
    en_lines = en_lines[:min_len]
    fr_lines = fr_lines[:min_len]
    
    if max_samples:
        en_lines = en_lines[:max_samples]
        fr_lines = fr_lines[:max_samples]
    
    # Write source (English)
    with open(out_src_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(en_lines) + '\n')
    
    # Write target (French)
    with open(out_tgt_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(fr_lines) + '\n')
    
    return len(en_lines)

# Prepare training data (2000 examples for tutorial)
n_train = prepare_opennmt_data(
    f"{DATA_PATH}/en_train.txt",
    f"{DATA_PATH}/fr_train.txt",
    f"{OPENNMT_DATA_PATH}/src-train.txt",
    f"{OPENNMT_DATA_PATH}/tgt-train.txt",
    max_samples=2000
)
print(f"Training data: {n_train} sentence pairs")

# Prepare validation data (500 examples)
n_val = prepare_opennmt_data(
    f"{DATA_PATH}/en_validation.txt",
    f"{DATA_PATH}/fr_validation.txt",
    f"{OPENNMT_DATA_PATH}/src-val.txt",
    f"{OPENNMT_DATA_PATH}/tgt-val.txt",
    max_samples=500
)
print(f"Validation data: {n_val} sentence pairs")

# Prepare test data (100 examples for evaluation)
n_test = prepare_opennmt_data(
    f"{DATA_PATH}/en_test.txt",
    f"{DATA_PATH}/fr_test.txt",
    f"{OPENNMT_DATA_PATH}/src-test.txt",
    f"{OPENNMT_DATA_PATH}/tgt-test.txt",
    max_samples=100
)
print(f"Test data: {n_test} sentence pairs")


Training data: 2000 sentence pairs
Validation data: 500 sentence pairs
Test data: 100 sentence pairs


In [5]:
# Preview the data
print("=" * 60)
print("SAMPLE TRAINING DATA")
print("=" * 60)

with open(f"{OPENNMT_DATA_PATH}/src-train.txt", 'r') as f:
    src_lines = f.readlines()[:3]
with open(f"{OPENNMT_DATA_PATH}/tgt-train.txt", 'r') as f:
    tgt_lines = f.readlines()[:3]

for i, (src, tgt) in enumerate(zip(src_lines, tgt_lines)):
    print(f"\n--- Example {i+1} ---")
    print(f"SRC (EN): {src.strip()}")
    print(f"TGT (FR): {tgt.strip()}")


SAMPLE TRAINING DATA

--- Example 1 ---
SRC (EN): Article 199b is replaced by the following:
TGT (FR): l'article 199 ter est remplacé par le texte suivant:

--- Example 2 ---
SRC (EN): at consular offices:
TGT (FR): dans les bureaux consulaires:

--- Example 3 ---
SRC (EN): The Portuguese authorities have explained that this public interest mission was entrusted to the private sector in accordance with Decree-Law No 197/99 of 8 June 1999 [7], which transposed into national law European Parliament and Council Directive 97/52/EC of 13 October 1997 amending Directives 92/50/EEC, 93/36/EEC and 93/37/EEC concerning the coordination of procedures for the award of public service contracts, public supply contracts and public works contracts respectively [8].
TGT (FR): Les autorités portugaises ont précisé que cette mission d’intérêt public avait été attribuée au secteur privé, dans le respect des prescriptions établies par le décret loi no 197/1999 du 8 juin 1999 [7], qui est l’instrument nati

## 2.5) Tokenize the Data

Before building the vocabulary, we need to **tokenize** the data using a subword tokenizer.

### Why tokenize?

1. **Consistent vocabulary**: Subword tokenization ensures each token is a single unit (no multi-word expressions)
2. **Handles rare words**: Rare words are split into subword units that the model can learn
3. **Language-agnostic**: Works well across different languages

We use `xlm-roberta-base` tokenizer which:
- Uses SentencePiece under the hood
- Is multilingual and handles both English and French well
- Produces space-separated subword tokens


In [6]:
# Tokenize all data files using the TokenDetokenizer
from tok_detok import TokenDetokenizer

# Initialize the tokenizer (xlm-roberta-base is multilingual)
print("Loading tokenizer (xlm-roberta-base)...")
tok_detok = TokenDetokenizer(model_name="xlm-roberta-base", keep_unk=True)

# Tokenize training data
print("\nTokenizing training data...")
tok_detok.tokenize(f"{OPENNMT_DATA_PATH}/src-train.txt", f"{OPENNMT_DATA_PATH}/src-train.tok")
tok_detok.tokenize(f"{OPENNMT_DATA_PATH}/tgt-train.txt", f"{OPENNMT_DATA_PATH}/tgt-train.tok")

# Tokenize validation data
print("\nTokenizing validation data...")
tok_detok.tokenize(f"{OPENNMT_DATA_PATH}/src-val.txt", f"{OPENNMT_DATA_PATH}/src-val.tok")
tok_detok.tokenize(f"{OPENNMT_DATA_PATH}/tgt-val.txt", f"{OPENNMT_DATA_PATH}/tgt-val.tok")

# Tokenize test source (we'll tokenize, translate, then detokenize)
print("\nTokenizing test data...")
tok_detok.tokenize(f"{OPENNMT_DATA_PATH}/src-test.txt", f"{OPENNMT_DATA_PATH}/src-test.tok")
# Keep original target for BLEU evaluation (we compare detokenized output with original reference)

print("\n" + "=" * 60)
print("Tokenization complete!")
print("=" * 60)


Loading tokenizer (xlm-roberta-base)...

Tokenizing training data...
Tokenizing data/opennmt-en-fr/src-train.txt to data/opennmt-en-fr/src-train.tok...
Tokenization complete.
Tokenizing data/opennmt-en-fr/tgt-train.txt to data/opennmt-en-fr/tgt-train.tok...
Tokenization complete.

Tokenizing validation data...
Tokenizing data/opennmt-en-fr/src-val.txt to data/opennmt-en-fr/src-val.tok...
Tokenization complete.
Tokenizing data/opennmt-en-fr/tgt-val.txt to data/opennmt-en-fr/tgt-val.tok...
Tokenization complete.

Tokenizing test data...
Tokenizing data/opennmt-en-fr/src-test.txt to data/opennmt-en-fr/src-test.tok...
Tokenization complete.

Tokenization complete!


In [7]:
# Preview the tokenized data
print("=" * 60)
print("SAMPLE TOKENIZED TRAINING DATA")
print("=" * 60)

with open(f"{OPENNMT_DATA_PATH}/src-train.tok", 'r') as f:
    src_tok_lines = f.readlines()[:3]
with open(f"{OPENNMT_DATA_PATH}/tgt-train.tok", 'r') as f:
    tgt_tok_lines = f.readlines()[:3]

for i, (src, tgt) in enumerate(zip(src_tok_lines, tgt_tok_lines)):
    print(f"\n--- Example {i+1} ---")
    print(f"SRC (EN tokenized): {src.strip()[:100]}...")
    print(f"TGT (FR tokenized): {tgt.strip()[:100]}...")


SAMPLE TOKENIZED TRAINING DATA

--- Example 1 ---
SRC (EN tokenized): ▁Article ▁199 b ▁is ▁replace d ▁by ▁the ▁following :...
TGT (FR tokenized): ▁l ' article ▁199 ▁ter ▁est ▁rem plac é ▁par ▁le ▁texte ▁suivant :...

--- Example 2 ---
SRC (EN tokenized): ▁at ▁consul ar ▁office s :...
TGT (FR tokenized): ▁dans ▁les ▁bureau x ▁consul aires :...

--- Example 3 ---
SRC (EN tokenized): ▁The ▁Portu gues e ▁authorities ▁have ▁explained ▁that ▁this ▁public ▁interest ▁mission ▁was ▁en tru...
TGT (FR tokenized): ▁Les ▁autorités ▁portu ga ises ▁ont ▁précis é ▁que ▁cette ▁mission ▁d ’ intérêt ▁public ▁avait ▁été ...


## 3) Create Configuration Files

OpenNMT-py uses YAML configuration files for all settings. We'll create:
1. **Vocabulary config**: For building source and target vocabularies
2. **Training config**: For model architecture and training hyperparameters

### Key Configuration Options

- **Vocabulary**: Size, minimum frequency
- **Model**: Transformer layers, hidden size, attention heads
- **Training**: Batch size, learning rate, optimizer
- **Hardware**: GPU settings, precision


In [8]:
# Output directory for models and configs
OUTPUT_DIR = "../models/opennmt-en-fr"
CONFIG_DIR = f"{OPENNMT_DATA_PATH}/config"

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(CONFIG_DIR, exist_ok=True)

print(f"Output directory: {OUTPUT_DIR}")
print(f"Config directory: {CONFIG_DIR}")


Output directory: ../models/opennmt-en-fr
Config directory: data/opennmt-en-fr/config


In [9]:
# Create vocabulary configuration (matching working script format)
# Now using TOKENIZED data files (.tok)
vocab_config_content = f"""
# Where the samples will be written
save_data: {OUTPUT_DIR}

# Where the vocab(s) will be written
src_vocab: {OUTPUT_DIR}/vocab.src
tgt_vocab: {OUTPUT_DIR}/vocab.tgt

# Prevent overwriting existing files in the folder
overwrite: true

# Corpus opts: using TOKENIZED files
data:
    corpus_1:
        path_src: {OPENNMT_DATA_PATH}/src-train.tok
        path_tgt: {OPENNMT_DATA_PATH}/tgt-train.tok
    valid:
        path_src: {OPENNMT_DATA_PATH}/src-val.tok
        path_tgt: {OPENNMT_DATA_PATH}/tgt-val.tok
"""

# Write vocab config to file
vocab_config_path = f'{CONFIG_DIR}/vocab_config.yaml'
with open(vocab_config_path, 'w') as f:
    f.write(vocab_config_content)

print(f"Vocabulary config saved to: {vocab_config_path}")
print("\n--- Vocabulary Configuration ---")
print(vocab_config_content)


Vocabulary config saved to: data/opennmt-en-fr/config/vocab_config.yaml

--- Vocabulary Configuration ---

# Where the samples will be written
save_data: ../models/opennmt-en-fr

# Where the vocab(s) will be written
src_vocab: ../models/opennmt-en-fr/vocab.src
tgt_vocab: ../models/opennmt-en-fr/vocab.tgt

# Prevent overwriting existing files in the folder
overwrite: true

# Corpus opts: using TOKENIZED files
data:
    corpus_1:
        path_src: data/opennmt-en-fr/src-train.tok
        path_tgt: data/opennmt-en-fr/tgt-train.tok
    valid:
        path_src: data/opennmt-en-fr/src-val.tok
        path_tgt: data/opennmt-en-fr/tgt-val.tok



## 4) Build Vocabulary

OpenNMT-py needs to build vocabularies before training. This creates a mapping from tokens to IDs for both source and target languages.


In [10]:
# Build vocabulary using onmt_build_vocab
# -n_sample -1 means use ALL training data (like the working script)

# First, remove old vocab files to start fresh
for vocab_file in [f'{OUTPUT_DIR}/vocab.src', f'{OUTPUT_DIR}/vocab.tgt']:
    if os.path.exists(vocab_file):
        os.remove(vocab_file)
        print(f"Removed old vocab: {vocab_file}")

print("\nBuilding vocabulary...")
print("=" * 60)

result = subprocess.run(
    [
        'onmt_build_vocab',
        '-config', vocab_config_path,
        '-n_sample', '-1',  # Use ALL training data
        '-num_threads', '2',
    ],
    capture_output=True,
    text=True
)

print(result.stdout)
if result.stderr:
    print("Warnings/Errors:")
    print(result.stderr)

# Check if vocab files were created
if os.path.exists(f'{OUTPUT_DIR}/vocab.src'):
    print(f"\n✓ Source vocab created: {OUTPUT_DIR}/vocab.src")
if os.path.exists(f'{OUTPUT_DIR}/vocab.tgt'):
    print(f"✓ Target vocab created: {OUTPUT_DIR}/vocab.tgt")

# Debug: Show raw content of vocab file
print("\n" + "=" * 60)
print("RAW VOCAB FILE CONTENT (first 10 lines of src vocab):")
print("=" * 60)
with open(f'{OUTPUT_DIR}/vocab.src', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= 10:
            break
        # Show the line with visible representation of tabs and special chars
        print(f"Line {i}: {repr(line.strip())}")


Removed old vocab: ../models/opennmt-en-fr/vocab.src
Removed old vocab: ../models/opennmt-en-fr/vocab.tgt

Building vocabulary...

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2026-01-21 10:39:10,839 INFO] Counter vocab from -1 samples.
[2026-01-21 10:39:10,839 INFO] n_sample=-1: Build vocab on full datasets.
[2026-01-21 10:39:11,002 INFO] Counters src: 7338
[2026-01-21 10:39:11,002 INFO] Counters tgt: 7286


✓ Source vocab created: ../models/opennmt-en-fr/vocab.src
✓ Target vocab created: ../models/opennmt-en-fr/vocab.tgt

RAW VOCAB FILE CONTENT (first 10 lines of src vocab):
Line 0: '▁the\t3546'
Line 1: '▁of\t2299'
Line 2: ',\t1927'
Line 3: 's\t1704'
Line 4: '.\t1297'
Line 5: '▁to\t1264'
Line 6: '▁and\t1263'
Line 7: '▁in\t1146'
Line 8: '▁for\t584'
Line 9: '▁a\t575'


In [11]:
# Count vocabulary sizes (like the working script does)
# The working script just counts lines to set vocab_size, no modification needed

def count_lines(file_path):
    """Count the number of lines in a file."""
    with open(file_path, 'r', encoding='utf-8') as f:
        return sum(1 for _ in f)

# Count vocab sizes - will be used in training config
SRC_VOCAB_SIZE = count_lines(f'{OUTPUT_DIR}/vocab.src')
TGT_VOCAB_SIZE = count_lines(f'{OUTPUT_DIR}/vocab.tgt')

print(f"Source vocab size: {SRC_VOCAB_SIZE} tokens")
print(f"Target vocab size: {TGT_VOCAB_SIZE} tokens")

# Preview the vocabularies
def preview_vocab(vocab_path, n_lines=20):
    """Preview first n lines of a vocabulary file."""
    with open(vocab_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()[:n_lines]
    total = count_lines(vocab_path)
    return lines, total

print("\n" + "=" * 60)
print("SOURCE VOCABULARY (first 20 tokens)")
print("=" * 60)
src_vocab_lines, src_vocab_total = preview_vocab(f'{OUTPUT_DIR}/vocab.src')
for line in src_vocab_lines:
    print(line.strip())
print(f"\n... Total source vocab size: {src_vocab_total}")

print("\n" + "=" * 60)
print("TARGET VOCABULARY (first 20 tokens)")
print("=" * 60)
tgt_vocab_lines, tgt_vocab_total = preview_vocab(f'{OUTPUT_DIR}/vocab.tgt')
for line in tgt_vocab_lines:
    print(line.strip())
print(f"\n... Total target vocab size: {tgt_vocab_total}")


Source vocab size: 7338 tokens
Target vocab size: 7286 tokens

SOURCE VOCABULARY (first 20 tokens)
▁the	3546
▁of	2299
,	1927
s	1704
.	1297
▁to	1264
▁and	1263
▁in	1146
▁for	584
▁a	575
ing	556
▁(	542
ed	501
▁be	491
)	474
▁	442
tion	431
▁or	424
▁that	398
-	389

... Total source vocab size: 7338

TARGET VOCABULARY (first 20 tokens)
▁de	3225
s	2584
,	2371
▁la	1720
'	1628
▁l	1458
’	1410
▁et	1256
.	1229
▁à	1140
▁des	1139
▁les	1112
▁d	1087
▁du	959
▁le	912
e	782
▁	721
▁en	638
▁dans	465
es	463

... Total target vocab size: 7286


## 5) Create Training Configuration

Now we'll create the full training configuration. This includes:
- Model architecture (Transformer encoder-decoder)
- Training hyperparameters
- Optimization settings

### Transformer Architecture

For this tutorial, we use a small Transformer:
- **Encoder/Decoder layers**: 4 (default is 6)
- **Hidden size**: 256 (default is 512)
- **Attention heads**: 4 (default is 8)
- **Feed-forward size**: 1024 (default is 2048)

This smaller model trains faster and is suitable for our small dataset.

### ⚠️ Important: Small Dataset Considerations

When working with **small datasets** (< 50k sentences), you may see OpenNMT-py repeatedly logging:
```
Weighted corpora loaded so far: * corpus_1: N
```

This happens because OpenNMT-py's default `bucket_size` (262144 tokens) is larger than your dataset, causing it to reload the data many times to fill the bucket.

**Solution**: Reduce `bucket_size` to match your dataset size. For our ~2k sentence tutorial dataset, we use `bucket_size: 32768`.

See: [OpenNMT Forum Discussion](https://forum.opennmt.net/t/is-it-normal-to-see-weighted-corpora-loaded-so-far-in-a-loop)


In [None]:
# Create training configuration (matching working script format)
# Use raw YAML string to ensure proper formatting
# Now using TOKENIZED data files (.tok)

num_gpus = 1 if torch.cuda.is_available() else 0
gpu_ranks = list(range(num_gpus)) if num_gpus > 0 else []

train_config_content = f"""
# Where the samples will be written
save_data: {OUTPUT_DIR}

# Training files - using TOKENIZED data
data:
    corpus_1:
        path_src: {OPENNMT_DATA_PATH}/src-train.tok
        path_tgt: {OPENNMT_DATA_PATH}/tgt-train.tok
        transforms: [filtertoolong]
    valid:
        path_src: {OPENNMT_DATA_PATH}/src-val.tok
        path_tgt: {OPENNMT_DATA_PATH}/tgt-val.tok
        transforms: [filtertoolong]

# Vocabulary files
src_vocab: {OUTPUT_DIR}/vocab.src
tgt_vocab: {OUTPUT_DIR}/vocab.tgt

# Vocabulary size - must match the actual vocab file sizes
src_vocab_size: {SRC_VOCAB_SIZE}
tgt_vocab_size: {TGT_VOCAB_SIZE}

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 200
tgt_seq_length: 200

# Logging
log_file: {OUTPUT_DIR}/train.log
save_model: {OUTPUT_DIR}/model

# Early stopping
early_stopping: 10

# Checkpointing
save_checkpoint_steps: 500
keep_checkpoint: 5

seed: 42

# Training steps
train_steps: 2000
valid_steps: 500
warmup_steps: 400
report_every: 50

# Model architecture (Transformer)
decoder_type: transformer
encoder_type: transformer
word_vec_size: 256
hidden_size: 256
layers: 4
transformer_ff: 1024
heads: 4

# Optimization
accum_count: 4
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

# Batching - use sentences for simpler setup
batch_size: 32
valid_batch_size: 32
batch_type: sents
normalization: sents
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

# NOTE: bucket_size is important for small datasets!
# The default (262144) causes OpenNMT-py to reload the corpus many times
# trying to fill the bucket. For small datasets, reduce this value.
# See: https://forum.opennmt.net/t/is-it-normal-to-see-weighted-corpora-loaded-so-far-in-a-loop
bucket_size: 144

param_init: 0.0
param_init_glorot: true
position_encoding: true

# Hardware
world_size: {num_gpus}
gpu_ranks: {gpu_ranks}
"""

# Write training config to file
train_config_path = f'{CONFIG_DIR}/train_config.yaml'
with open(train_config_path, 'w') as f:
    f.write(train_config_content)

print(f"Training config saved to: {train_config_path}")
print("\n--- Training Configuration Summary ---")
print(f"  Source vocab size: {SRC_VOCAB_SIZE}")
print(f"  Target vocab size: {TGT_VOCAB_SIZE}")
print(f"  Model: Transformer (4 enc layers, 4 dec layers)")
print(f"  Hidden size: 256")
print(f"  Attention heads: 4")
print(f"  Training steps: 2000")
print(f"  Batch size: 32 sentences")
print(f"  Learning rate: 2.0 (with noam decay)")
print(f"  GPU: {'Yes' if num_gpus > 0 else 'No (CPU)'}")


Training config saved to: data/opennmt-en-fr/config/train_config.yaml

--- Training Configuration Summary ---
  Source vocab size: 7338
  Target vocab size: 7286
  Model: Transformer (4 enc layers, 4 dec layers)
  Hidden size: 256
  Attention heads: 4
  Training steps: 2000
  Batch size: 32 sentences
  Learning rate: 2.0 (with noam decay)
  GPU: Yes


## 6) Train the Model

Now we'll train the model using `onmt_train`. 

Training time depends on:
- Dataset size
- Model size
- Hardware (GPU vs CPU)

With our small model and 2000 examples, training should take ~5-15 minutes on a GPU.


In [13]:
# Train the model
print("Starting training...")
print("=" * 60)
print("This may take several minutes...")
print("=" * 60)

# Run training
result = subprocess.run(
    ['onmt_train', '-config', train_config_path],
    capture_output=False,  # Show output in real-time
    text=True
)

print("\n" + "=" * 60)
print("Training complete!")
print("=" * 60)


Starting training...
This may take several minutes...


[2026-01-21 10:39:15,686 INFO] Parsed 2 corpora from -data.
[2026-01-21 10:39:15,686 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2026-01-21 10:39:15,732 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', '▁the', '▁of', ',', 's', '.', '▁to']
[2026-01-21 10:39:15,732 INFO] The decoder start token is: <s>
[2026-01-21 10:39:15,733 INFO] Building model...
[2026-01-21 10:39:16,391 INFO] Switching model to float32 for amp/apex_amp
[2026-01-21 10:39:16,391 INFO] Non quantized layer compute is fp32
[2026-01-21 10:39:16,533 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(7344, 256, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-3): 4 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
       

KeyboardInterrupt: 

In [None]:
# List saved checkpoints
print("Saved model checkpoints:")
print("-" * 40)

checkpoints = sorted([f for f in os.listdir(OUTPUT_DIR) if f.startswith('model_step_')])
for ckpt in checkpoints:
    path = os.path.join(OUTPUT_DIR, ckpt)
    size_mb = os.path.getsize(path) / (1024 * 1024)
    print(f"  {ckpt}: {size_mb:.1f} MB")

# Get the latest checkpoint
if checkpoints:
    BEST_MODEL = os.path.join(OUTPUT_DIR, checkpoints[-1])
    print(f"\nLatest checkpoint: {BEST_MODEL}")


## 7) Translate with the Trained Model

Now let's use our trained model to translate the test set. We use `onmt_translate` for inference.


In [None]:
# Create translation configuration
# Using TOKENIZED test source - output will be tokenized too
translate_config = {
    'model': BEST_MODEL,
    'src': f'{OPENNMT_DATA_PATH}/src-test.tok',  # Tokenized input
    'output': f'{OUTPUT_DIR}/predictions.tok',   # Tokenized output
    
    # Decoding settings
    'beam_size': 5,
    'max_length': 200,
    
    # GPU
    'gpu': 0 if torch.cuda.is_available() else -1,
    
    # Batch size for inference
    'batch_size': 32,
    'batch_type': 'sents',
    
    # Verbosity
    'verbose': True,
}

# Write translate config
translate_config_path = f'{CONFIG_DIR}/translate_config.yaml'
with open(translate_config_path, 'w') as f:
    yaml.dump(translate_config, f, default_flow_style=False)

print(f"Translation config saved to: {translate_config_path}")


In [None]:
# Run translation
print("Translating test set...")
print("=" * 60)

result = subprocess.run(
    ['onmt_translate', '-config', translate_config_path],
    capture_output=True,
    text=True
)

print(result.stdout)
if result.stderr:
    # Filter out just the important info
    for line in result.stderr.split('\n'):
        if 'PRED' not in line:  # Skip individual predictions in verbose mode
            print(line)

print("\n" + "=" * 60)
print(f"Tokenized translations saved to: {OUTPUT_DIR}/predictions.tok")

# Detokenize the predictions for evaluation
print("\nDetokenizing predictions...")
tok_detok.detokenize(
    f'{OUTPUT_DIR}/predictions.tok',
    f'{OUTPUT_DIR}/predictions.txt'
)
print(f"Detokenized translations saved to: {OUTPUT_DIR}/predictions.txt")


In [None]:
# Compare translations with references
# Using ORIGINAL (non-tokenized) source and reference for readability
# Using DETOKENIZED predictions
print("=" * 70)
print("TRANSLATION EXAMPLES")
print("=" * 70)

# Load original source, reference, and detokenized predictions
with open(f'{OPENNMT_DATA_PATH}/src-test.txt', 'r') as f:
    sources = f.readlines()
with open(f'{OPENNMT_DATA_PATH}/tgt-test.txt', 'r') as f:
    references = f.readlines()
with open(f'{OUTPUT_DIR}/predictions.txt', 'r') as f:  # Detokenized
    predictions = f.readlines()

# Show first 10 examples
for i in range(min(10, len(sources))):
    print(f"\n--- Example {i+1} ---")
    print(f"Source (EN):     {sources[i].strip()}")
    print(f"Reference (FR):  {references[i].strip()}")
    print(f"Prediction (FR): {predictions[i].strip()}")


## 8) Evaluate with BLEU Score

BLEU (Bilingual Evaluation Understudy) is the standard automatic metric for machine translation. It measures n-gram overlap between predictions and references.


In [None]:
# Evaluate with sacrebleu
# Using DETOKENIZED predictions vs ORIGINAL references
try:
    import sacrebleu
    
    # Load detokenized predictions and original references
    with open(f'{OUTPUT_DIR}/predictions.txt', 'r') as f:  # Detokenized
        predictions = [line.strip() for line in f]
    with open(f'{OPENNMT_DATA_PATH}/tgt-test.txt', 'r') as f:  # Original
        references = [line.strip() for line in f]
    
    # Compute BLEU
    bleu = sacrebleu.corpus_bleu(predictions, [references])
    
    print("=" * 60)
    print("EVALUATION RESULTS")
    print("=" * 60)
    print(f"\nBLEU Score: {bleu.score:.2f}")
    print(f"\nDetailed: {bleu}")
    
except ImportError:
    print("sacrebleu not installed. Install with: pip install sacrebleu")
    print("\nAlternatively, you can evaluate from command line:")
    print(f"  sacrebleu {OPENNMT_DATA_PATH}/tgt-test.txt < {OUTPUT_DIR}/predictions.txt")


## 9) Translate Custom Sentences

Let's try translating some custom sentences!


In [None]:
# Custom sentences to translate
custom_sentences = [
    "The weather is beautiful today.",
    "I love learning new languages.",
    "Machine translation has improved significantly in recent years.",
    "Can you help me find the train station?",
    "This is a test sentence for our translation model.",
]

# Write to temp file and tokenize
custom_input_path = f'{OUTPUT_DIR}/custom_input.txt'
custom_input_tok_path = f'{OUTPUT_DIR}/custom_input.tok'
custom_output_tok_path = f'{OUTPUT_DIR}/custom_output.tok'
custom_output_path = f'{OUTPUT_DIR}/custom_output.txt'

with open(custom_input_path, 'w') as f:
    f.write('\n'.join(custom_sentences) + '\n')

# Tokenize input
tok_detok.tokenize(custom_input_path, custom_input_tok_path)

# Translate (tokenized input -> tokenized output)
result = subprocess.run(
    [
        'onmt_translate',
        '-model', BEST_MODEL,
        '-src', custom_input_tok_path,
        '-output', custom_output_tok_path,
        '-gpu', '0' if torch.cuda.is_available() else '-1',
        '-beam_size', '5',
    ],
    capture_output=True,
    text=True
)

# Detokenize output
tok_detok.detokenize(custom_output_tok_path, custom_output_path)

# Read detokenized translations
with open(custom_output_path, 'r') as f:
    translations = f.readlines()

# Display results
print("=" * 70)
print("CUSTOM TRANSLATIONS")
print("=" * 70)

for en, fr in zip(custom_sentences, translations):
    print(f"\nEN: {en}")
    print(f"FR: {fr.strip()}")


## Appendix: Full Training Configuration Reference

Here's a more complete configuration for production-quality models:

```yaml
# Full production configuration example
data:
  corpus_1:
    path_src: data/src-train.txt
    path_tgt: data/tgt-train.txt
    transforms: [sentencepiece, filtertoolong]
    weight: 1
  valid:
    path_src: data/src-val.txt
    path_tgt: data/tgt-val.txt
    transforms: [sentencepiece]

# Subword tokenization
src_subword_model: sentencepiece.model
tgt_subword_model: sentencepiece.model
src_subword_type: sentencepiece
tgt_subword_type: sentencepiece

# Filter long sentences
src_seq_length: 200
tgt_seq_length: 200

# Vocabulary
src_vocab: vocab.src
tgt_vocab: vocab.tgt
share_vocab: true

# Model (standard Transformer)
encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout: [0.1]
attention_dropout: [0.1]
position_encoding: true

# Training
train_steps: 100000
valid_steps: 5000
batch_type: tokens
batch_size: 8192
accum_count: [2]

# Optimization
optim: adam
learning_rate: 2.0
warmup_steps: 8000
decay_method: noam
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1

# Saving
save_model: models/model
save_checkpoint_steps: 5000
keep_checkpoint: 5

# Hardware
world_size: 1
gpu_ranks: [0]
```
