In [1]:
!pip install transformers datasets accelerate sacremoses



accelerate: This library helps optimize your training process. When you train a large machine learning model, it can be slow and resource-intensive. Accelerate handles a lot of the complexity behind the scenes, like using multiple GPUs, mixed-precision training (which makes training faster), and distributing the training across multiple machines. It makes it easier to run large-scale training jobs without having to rewrite your entire code.

sacremoses: This is a text preprocessing library specifically designed for machine translation. Its main job is to tokenize and detokenize text, but it also handles other tasks like cleaning up punctuation and special characters. When you're working with multiple languages, these preprocessing steps are crucial to ensure the data is in the correct format for the model to learn from. sacremoses helps make sure that the raw text is properly prepared before it's fed to the model and that the output from the model is correctly converted back into human-readable text.

In [2]:
!pip install torch



In [3]:
!pip install --upgrade transformers



In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import Dataset, load_dataset
import torch
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

In [5]:
# Load pre-trained models and tokenizers from Hugging Face
model_name_en_fr = "Helsinki-NLP/opus-mt-en-fr"
tokenizer_en_fr = AutoTokenizer.from_pretrained(model_name_en_fr)
model_en_fr = AutoModelForSeq2SeqLM.from_pretrained(model_name_en_fr)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
# Load pre-trained models and tokenizers from Hugging Face
model_name_en_hi = "Helsinki-NLP/opus-mt-en-hi"
tokenizer_en_hi = AutoTokenizer.from_pretrained(model_name_en_hi)
model_en_hi = AutoModelForSeq2SeqLM.from_pretrained(model_name_en_hi)

In [7]:
# --- Step 3: Create a Custom Dataset for Fine-tuning ---
# This version downloads a real dataset from the Hugging Face Hub.
print("Downloading and preparing dataset for fine-tuning...")
# Downloading a small subset of a public dataset for demonstration.
dataset_fr = load_dataset("opus_books", "en-fr", split="train[:500]")
dataset_hi = load_dataset("cfilt/iitb-english-hindi", "default", split="train[:500]")

Downloading and preparing dataset for fine-tuning...


In [8]:
#--- Step 4: Tokenization ----#
def tokenize_and_prepare_fr(examples):
    # To handle the batched list of translation dictionaries
    source_sentences = [ex['en'] for ex in examples["translation"]]
    target_sentences = [ex['fr'] for ex in examples["translation"]]

    inputs = tokenizer_en_fr(source_sentences, max_length=128, truncation=True, padding="max_length")
    with tokenizer_en_fr.as_target_tokenizer():
        labels = tokenizer_en_fr(target_sentences, max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

def tokenize_and_prepare_hi(examples):
    # To handle the batched list of translations
    source_sentences = [ex['en'] for ex in examples["translation"]]
    target_sentences = [ex['hi'] for ex in examples["translation"]]
    inputs = tokenizer_en_hi(source_sentences, max_length=128, truncation=True, padding="max_length")
    with tokenizer_en_hi.as_target_tokenizer():
        labels = tokenizer_en_hi(target_sentences, max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized_datasets_fr = dataset_fr.map(tokenize_and_prepare_fr, batched=True)
tokenized_datasets_hi = dataset_hi.map(tokenize_and_prepare_hi, batched=True)

In [12]:
!pip install transformers==4.55.2



In [15]:
#--Step 5: Fine-Tuning The Model ---#
# Setup training arguments and trainer for the French model
training_args_fr = Seq2SeqTrainingArguments(
    output_dir="./fr_model",
    per_device_train_batch_size=2,
    num_train_epochs=5,
    learning_rate=5e-5,
    weight_decay=0.01,
    save_total_limit=1,
    save_strategy="epoch",
    logging_dir='./fr_logs',
    load_best_model_at_end=False,
    metric_for_best_model="loss",
    report_to="none"
)

trainer_fr = Seq2SeqTrainer(
    model=model_en_fr,
    args=training_args_fr,
    train_dataset=tokenized_datasets_fr,
    tokenizer=tokenizer_en_fr,
    data_collator=DataCollatorForSeq2Seq(tokenizer_en_fr, model=model_en_fr)
)

print("\nStarting French model fine-tuning...")
trainer_fr.train()

# Setup training arguments and trainer for the Hindi model
training_args_hi = Seq2SeqTrainingArguments(
    output_dir="./hi_model",
    per_device_train_batch_size=2,
    num_train_epochs=5,
    learning_rate=5e-5,
    weight_decay=0.01,
    save_total_limit=1,
    save_strategy="epoch",
    logging_dir='./hi_logs',
    load_best_model_at_end=False,
    metric_for_best_model="loss",
    report_to="none"
)

trainer_hi = Seq2SeqTrainer(
    model=model_en_hi,
    args=training_args_hi,
    train_dataset=tokenized_datasets_hi,
    tokenizer=tokenizer_en_hi,
    data_collator=DataCollatorForSeq2Seq(tokenizer_en_hi, model=model_en_hi)
)

print("\nStarting Hindi model fine-tuning...")
trainer_hi.train()

  trainer_fr = Seq2SeqTrainer(



Starting French model fine-tuning...


Step,Training Loss
500,0.367
1000,0.1267


  trainer_hi = Seq2SeqTrainer(



Starting Hindi model fine-tuning...


Step,Training Loss
500,0.2067
1000,0.0071




TrainOutput(global_step=1250, training_loss=0.08630350852012635, metrics={'train_runtime': 4528.8805, 'train_samples_per_second': 0.552, 'train_steps_per_second': 0.276, 'total_flos': 84745912320000.0, 'train_loss': 0.08630350852012635, 'epoch': 5.0})

In [16]:
# --- 6. Save the Fine-tuned Models ---
print("\nSaving final fine-tuned models to disk...")
model_en_fr.save_pretrained("./fine-tuned-en-fr-model")
tokenizer_en_fr.save_pretrained("./fine-tuned-en-fr-model")

model_en_hi.save_pretrained("./fine-tuned-en-hi-model")
tokenizer_en_hi.save_pretrained("./fine-tuned-en-hi-model")

print("Fine-tuned models saved successfully to disk!")



Saving final fine-tuned models to disk...
Fine-tuned models saved successfully to disk!


In [17]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
