# Fine-Tuning mBart for English to Persian Subtitle Translation 🎥📝🤖

## Introduction

<img src='https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-01_at_9.49.47_PM.png' />

In this notebook, the pre-trained [mBart 50](https://arxiv.org/abs/2008.00401) model from the [Hugging Face Model Hub](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) has been fine-tuned on a dataset of [English-Persian subtitle pairs](https://huggingface.co/datasets/Peymansoft/English-Persian-Subtitle). The primary goal of this fine-tuning process is to enhance the model's ability to generate translations that closely mimic the style and tone typical of subtitles.

Through this experimentation, we observe that the fine-tuned model successfully adapts to the nuances of subtitle language, resulting in translations that feel more natural and contextually appropriate for viewers.

The final model, demonstrating improved translation performance, has been [pushed to Hugging Face ](https://huggingface.co/Peymansoft/MBart-50-Subtitle-English-Persian)for open-source access and further development by the community. This repository aims to provide a comprehensive overview of the fine-tuning process and facilitate further advancements in subtitle translation.



In [None]:
# install dependencies
!pip install datasets sacrebleu evaluate

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

# Load Dataset 📂📊

In [None]:
import pandas as pd
import numpy as np
from datasets import load_dataset


# load the dataset (from Hugging Face datasets hub)
raw_datasets = load_dataset("Peymansoft/English-Persian-Subtitle")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/7.63M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/956k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/951k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/151420 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/18928 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18928 [00:00<?, ? examples/s]

In [None]:
# Raw dataset structure
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['source', 'target'],
        num_rows: 151420
    })
    test: Dataset({
        features: ['source', 'target'],
        num_rows: 18928
    })
    validation: Dataset({
        features: ['source', 'target'],
        num_rows: 18928
    })
})

In [None]:
# First train sample - English
raw_datasets['train'][0]['source']

'Hey, one second!'

In [None]:
# First train sample - Persian
raw_datasets['train'][0]['target']

'هی یه لحظه'

Optionally select a subset of training samples if needed for faster training/testing

In [None]:
# randomly select some of the train samples in case that you do not need all of them
#num_samples = 1000
#raw_datasets['train'] = raw_datasets['train'].shuffle(seed=42).select(range(num_samples))

In [None]:
# Raw dataset structure
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['source', 'target'],
        num_rows: 151420
    })
    test: Dataset({
        features: ['source', 'target'],
        num_rows: 18928
    })
    validation: Dataset({
        features: ['source', 'target'],
        num_rows: 18928
    })
})

# Tokenization 🔤✂️

Here, you must select the **checkpoint** path. This is crucial because the tokenization and model structure are determined based on this path.

In [None]:
from transformers import AutoTokenizer

checkpoint= 'facebook/mbart-large-50-many-to-many-mmt' # Pre-trained mBart model checkpoint
tokenizer = AutoTokenizer.from_pretrained(checkpoint, return_tensors='pt')
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "fa_IR"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]



Let's examine how the tokenizer performs on a single instance.

In [None]:
# a single tokenization example

en_sentence = raw_datasets['train'][0]['source']
fa_sentence = raw_datasets['train'][0]['target']

inputs = tokenizer(en_sentence, text_target= fa_sentence) # This is referred to as input because it will be fed to the model.

In [None]:
# The tokenization result for a single instance is as follows
inputs

{'input_ids': [250004, 28240, 4, 1632, 17932, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'labels': [250029, 60678, 21333, 88102, 2]}

In [None]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [None]:
# Tokens for the English instance:
print(tokenizer.convert_ids_to_tokens(inputs['input_ids']))

['en_XX', '▁Hey', ',', '▁one', '▁second', '!', '</s>']


In [None]:
# Tokens for the Persian instance
print(tokenizer.convert_ids_to_tokens(inputs['labels']))

['fa_IR', '▁هی', '▁یه', '▁لحظه', '</s>']


In [None]:
max_length = 128 # The maximum length of the tokenization output can be adjusted according to your data.

# Define a function to implement tokenization on the raw_datasets using the map() method.

def preprocess_function(examples):
    inputs = [ex for ex in examples['source']]
    targets = [ex for ex in examples["target"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [None]:
# Tokenize raw_datasets
tokenized_datasets = raw_datasets.map(preprocess_function, batched= True, )

Map:   0%|          | 0/151420 [00:00<?, ? examples/s]

Map:   0%|          | 0/18928 [00:00<?, ? examples/s]

Map:   0%|          | 0/18928 [00:00<?, ? examples/s]

In [None]:
# Remove the columns that are no longer needed; we only require the tokenization results.
tokenized_datasets = tokenized_datasets.remove_columns(raw_datasets["train"].column_names)

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 151420
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 18928
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 18928
    })
})

# Model 🤖🧠

In [None]:
# Load the pre-trained mBart model from Hugging Face.
from transformers import AutoModelForSeq2SeqLM

model= AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

## Freeze 🧊🔒

In [None]:
# If you want to freeze the pre-trained layers, there are different approaches to do this. In this case, the encoder layers are frozen while the decoder layers will be updated during fine-tuning.
for param in model.model.encoder.parameters():
    param.requires_grad = False

## Create Batches using DataCollator 📦

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer= tokenizer, model= model)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

# Evaluation Definition 📊🔍

In [None]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

# Training 🤖📚

In [None]:
from transformers import Seq2SeqTrainingArguments

# Training settings
args = Seq2SeqTrainingArguments(
    checkpoint,
    evaluation_strategy="steps",
    eval_steps=1000,
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    gradient_accumulation_steps=2,
    dataloader_num_workers=16,
    logging_strategy="steps",
    logging_steps=500
)



In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
# Start fine-tuning process
trainer.train()



Step,Training Loss,Validation Loss,Bleu
1000,1.5132,1.49896,18.007231


Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}


Step,Training Loss,Validation Loss,Bleu
1000,1.5132,1.49896,18.007231
2000,1.3571,1.448733,18.903832
3000,1.2572,1.427403,19.525188
4000,1.1832,1.420664,19.507284
5000,1.1418,1.422482,19.625366


Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}


TrainOutput(global_step=5910, training_loss=1.2958590331618152, metrics={'train_runtime': 3380.9134, 'train_samples_per_second': 447.867, 'train_steps_per_second': 1.748, 'total_flos': 7.196330256000614e+16, 'train_loss': 1.2958590331618152, 'epoch': 9.991546914623838})

# Evaluate the Model 📊🔍

## Scores 📊

In [None]:
trainer.evaluate(max_length=max_length)



{'eval_loss': 1.4232980012893677,
 'eval_bleu': 19.833595434409624,
 'eval_runtime': 332.9199,
 'eval_samples_per_second': 56.855,
 'eval_steps_per_second': 0.445,
 'epoch': 9.991546914623838}

## Inference 🔍

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
fine_tuned_checkpoint = "/content/facebook/mbart-large-50-many-to-many-mmt/checkpoint-5910"
translator = pipeline("translation", model=fine_tuned_checkpoint, src_lang = "en_XX", tgt_lang = "fa_IR")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
translator("I'm gonna make him an offer he can't refuse.")

[{'translation_text': 'من به اون پیشنهادی میدم که نمیتونه رد کنه'}]

In [None]:
translator("Toto, I've a feeling we're not in Kansas anymore.")

[{'translation_text': 'توتو، حس میکنم دیگه توی کانزاس نیستیم'}]

# Pushing the Model to the Hugging Face Hub 🚀🤗☁️

## Push to Hub 🚀

In [None]:
#from huggingface_hub import notebook_login

#notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#model = AutoModelForSeq2SeqLM.from_pretrained( "/content/facebook/mbart-large-50-many-to-many-mmt/checkpoint-5910")
#tokenizer = AutoTokenizer.from_pretrained("/content/facebook/mbart-large-50-many-to-many-mmt/checkpoint-5910")

In [None]:
#model.push_to_hub("Peymansoft/MBart-50-Subtitle-English-Persian")
#tokenizer.push_to_hub("Peymansoft/MBart-50-Subtitle-English-Persian")

Non-default generation parameters: {'max_length': 200, 'early_stopping': True, 'num_beams': 5, 'forced_eos_token_id': 2}


model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Peymansoft/MBart-50-Subtitle-English-Persian/commit/e64534ef6e2c71983cb15c97ee35777de46ac7d1', commit_message='Upload tokenizer', commit_description='', oid='e64534ef6e2c71983cb15c97ee35777de46ac7d1', pr_url=None, pr_revision=None, pr_num=None)

## Load the Pushed Model  📥☁️

In [None]:
model= AutoModelForSeq2SeqLM.from_pretrained( "Peymansoft/MBart-50-Subtitle-English-Persian")
tokenizer = AutoTokenizer.from_pretrained("Peymansoft/MBart-50-Subtitle-English-Persian")

config.json:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/226 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

In [None]:
from transformers import pipeline


fine_tuned_checkpoint = "Peymansoft/MBart-50-Subtitle-English-Persian"
translator = pipeline("translation", model=fine_tuned_checkpoint, src_lang = "en_XX", tgt_lang = "fa_IR")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
