# Build a language translation model

In this notebook we're using the OPUS books dataset to train a model that can translate between English and French.

# Set up

This command installs three Python libraries used for machine learning and natural language processing.


*   Transformers: a library from Hugging Face that provides pre-trained AI models.
*   Datasets: another Hugging Face library that provides ready-to-use datasets for machine learning.
*  Torch: the core library of PyTorch, an open-source deep learning framework. Used for building and training AI models with GPU acceleration.

In [1]:
!pip install transformers datasets torch
#We may also need to: pip install --upgrade transformers accelerate

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

We import our required packages:

In [None]:
from datasets import load_dataset
from transformers import MarianMTModel, MarianTokenizer
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

This code loads a dataset, splits it into training and validation sets, and displays some sample data. "en-fr" specifies that the dataset contains English-to-French translations, and we split the dataset such that 10% of it is set aside for testing. We also print an example from each of the training and test sets.

In [None]:
# Load the dataset
dataset = load_dataset("opus_books", "en-fr")

# Split into train and validation subsets
dataset = dataset["train"].train_test_split(test_size=0.1)  # 90% train, 10% validation

train_data = dataset["train"]
val_data = dataset["test"]

print("Sample Training Example:")
print(train_data[0])
print("Sample Validation Example:")
print(val_data[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

Sample Training Example:
{'id': '10549', 'translation': {'en': '"I have not had the opportunity of speaking to him this morning."', 'fr': "-- Je n'ai pas encore eu occasion de lui parler ce matin."}}
Sample Validation Example:
{'id': '763', 'translation': {'en': "He found young men's costumes of days long gone by, frock coats with high velvet collars, dainty waistcoats cut very open, interminable white cravats, and patent-leather shoes dating from the beginning of the century.", 'fr': 'C’étaient des costumes de jeunes gens d’il y a longtemps, des redingotes à hauts cols de velours, de fins gilets très ouverts, d’interminables cravates blanches et des souliers vernis du début de ce siècle.'}}


In [None]:
# Load MarianMT tokenizer
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)

model = MarianMTModel.from_pretrained(model_name)

def preprocess_function(examples):
    # Extract lists of English (source) and French (target) sentences
    inputs = [item["en"] for item in examples["translation"]]
    targets = [item["fr"] for item in examples["translation"]]

    # Tokenize the inputs and targets
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True, padding="max_length")
    return model_inputs


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

This code prepares the dataset for training by applying preprocessing (tokenizing text), removing unnecessary columns and converting the dataset into a PyTorch-compatible format for training.

In [None]:
# Apply preprocessing to train and validation datasets
train_dataset = train_data.map(preprocess_function, batched=True)
val_dataset = val_data.map(preprocess_function, batched=True)

# Remove the unused "translation" column
train_dataset = train_dataset.remove_columns(["translation", "id"])
val_dataset = val_dataset.remove_columns(["translation", "id"])

# Convert datasets to PyTorch format
train_dataset.set_format("torch")
val_dataset.set_format("torch")

Map:   0%|          | 0/114376 [00:00<?, ? examples/s]

Map:   0%|          | 0/12709 [00:00<?, ? examples/s]

Let's display the first example in the training dataset

In [None]:
print(train_data[0])

{'id': '10549', 'translation': {'en': '"I have not had the opportunity of speaking to him this morning."', 'fr': "-- Je n'ai pas encore eu occasion de lui parler ce matin."}}


This block sets up training parameters for a sequence-to-sequence (Seq2Seq) model, such as a translation or text summarization model. It tells the model how to train, where to save results, and what settings to use for performance.



*   output_dir="./results": saves training results and checkpoints in the "./results" folder.
*   evaluation_strategy="epoch": evaluates the model after each epoch (full pass through the dataset).
*   learning_rate=5e-5: sets the step size for weight updates (0.00005) to ensure gradual learning.
*   per_device_train_batch_size=16: model processes 16 examples at a time per GPU (or CPU).
*  per_device_eval_batch_size=16: same batch size is used for evaluation.
*   weight_decay=0.01: adds a small penalty on large weights to improve generalization.
*   save_total_limit=3: keeps only the latest 3 model checkpoints to save storage space.
*   num_train_epochs=3: the model will train for 3 full cycles through the dataset.
*   predict_with_generate=True: enables text generation.
*   logging_dir="./logs": stores logs in the "./logs" folder.
*  logging_steps=500: logs training progress every 500 steps.
*   report_to="none": prevents logging to TensorBoard or Weights & Biases (WandB).
*   dataloader_num_workers=2: uses 2 CPU workers to speed up data loading.
*   no_cuda=False: uses GPU if available, otherwise falls back to CPU.
*   fp16=False: disables 16-bit precision (uses standard 32-bit for calculations). If fp16=True, training would be faster and use less memory on GPUs.


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    logging_dir="./logs",
    logging_steps=50,
    report_to="none",
    dataloader_num_workers=2,
    no_cuda=False,
    fp16=False
)



This line sets up the Trainer to manage the training and evaluation of the translation model, and then commences the training process.

In [None]:
# Create the Trainer function
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

# Train the model
trainer.train()

  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

This code tests the trained model on the validation dataset and prints performance metrics.

In [None]:
# Evaluate the model
metrics = trainer.evaluate()
print("Evaluation Metrics:", metrics)

We then test some text to see how the model translates.

In [None]:
# Example test input
test_text = "I love AI security."

# Tokenize the input text
inputs = tokenizer(test_text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Generate translation
outputs = model.generate(**inputs)

# Decode the translation
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Input: {test_text}")
print(f"Translation: {translation}")


Et voila!