# Leveraging Fine-Tuned mBart 50 for English-Persian Subtitle Translation


## Introduction

In this notebook, we demonstrate the practical application of the [fine-tuned mBart 50 model for translating English subtitles into Persian](Peymansoft/MBart-50-Subtitle-English-Persian). Leveraging the capabilities of [Gradio](https://www.gradio.app/), we have created an intuitive user interface that allows users to easily interact with the model.

This notebook features two primary functionalities:

### 1. Subtitle Translation:
Users can upload an English SRT subtitle file, and the model will process it to generate a translated Persian SRT subtitle, which can be downloaded for immediate use. (**To use this functionality, please run the code cell in the <mark style="background-color: yellow;">'Subtitle Translation With User Interface'</mark> section**)

### 2. Sentence Translation:
Users can input arbitrary English sentences to observe and compare the translation results before and after the fine-tuning of the pre-trained mBart 50 model. This functionality highlights the improvements achieved through the fine-tuning process.(**To use this functionality, please run the code cell in the <mark style="background-color: yellow;">'Sentence Translation With User Interface'</mark>  section.**)

In [None]:
# install dependencies
!pip install datasets srt gradio

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:

# Subtitle Translation Without User Interface ⚙️

In [None]:
from datasets import Dataset
import srt
from tqdm import tqdm
from transformers import pipeline

checkpoint = 'Peymansoft/MBart-50-Subtitle-English-Persian'
translator = pipeline("translation", model=checkpoint, src_lang="en_XX", tgt_lang="fa_IR", device=0)

# Read and parse the .srt file
def read_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    subtitles = list(srt.parse(content))
    return subtitles

# Create a dataset from the subtitles
def create_dataset(subtitles):
    subtitle_texts = [subtitle.content for subtitle in subtitles]
    data_dict = {"text": subtitle_texts}
    dataset = Dataset.from_dict(data_dict)
    return dataset

# Translate using dataset and Hugging Face pipeline
def translate_with_dataset(subtitles, translator, batch_size=16):
    dataset = create_dataset(subtitles)

    # Use the map function to apply the translation in batches
    def translate_batch(batch):
        translations = translator(batch['text'])
        return {"translated_text": [t['translation_text'] for t in translations]}

    translated_dataset = dataset.map(translate_batch, batched=True, batch_size=batch_size)

    # Update subtitle contents with translations
    translated_texts = translated_dataset["translated_text"]
    for idx, subtitle in enumerate(subtitles):
        subtitle.content = translated_texts[idx]

    return subtitles

# Write the translated subtitles back to .srt
def write_srt(file_path, subtitles):
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(srt.compose(subtitles))

# Full process
input_file = "persian_subtitle.srt"
output_file = "persian_subtitle.srt"

subtitles = read_srt(input_file)
translated_subtitles = translate_with_dataset(subtitles, translator)
write_srt(output_file, translated_subtitles)


# Subtitle Translation With User Interface  🖥️ 🎛️

In [None]:
import gradio as gr
from transformers import pipeline
from datasets import Dataset
import srt
from tqdm import tqdm
import zipfile
import os

# Load the model
checkpoint = 'Peymansoft/MBart-50-Subtitle-English-Persian'
translator = pipeline("translation", model=checkpoint, src_lang = "en_XX", tgt_lang = "fa_IR", device=0)

# Function to read and parse .srt file
def read_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    subtitles = list(srt.parse(content))
    return subtitles

# Create a dataset from subtitles
def create_dataset(subtitles):
    subtitle_texts = [subtitle.content for subtitle in subtitles]
    data_dict = {"text": subtitle_texts}
    dataset = Dataset.from_dict(data_dict)
    return dataset

# Translate subtitles using batches
def translate_with_dataset(subtitles, translator, batch_size=16):
    dataset = create_dataset(subtitles)

    def translate_batch(batch):
        translations = translator(batch['text'])
        return {"translated_text": [t['translation_text'] for t in translations]}

    translated_dataset = dataset.map(translate_batch, batched=True, batch_size=batch_size)

    translated_texts = translated_dataset["translated_text"]
    for idx, subtitle in enumerate(subtitles):
        subtitle.content = translated_texts[idx]

    return subtitles

# Convert back to .srt format
def write_srt(subtitles, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(srt.compose(subtitles))

# Function to create a zip file
def create_zip_file(output_srt):
    zip_filename = "translated_subtitle.zip"

    with zipfile.ZipFile(zip_filename, 'w') as zipf:
        zipf.write(output_srt)

    return zip_filename

# Gradio interface function
def translate_subtitles(srt_file):
    subtitles = read_srt(srt_file.name)
    translated_subtitles = translate_with_dataset(subtitles, translator)

    # Create output file path for the translated subtitles
    output_srt = "translated_subtitle.srt"
    write_srt(translated_subtitles, output_srt)

    # Create a zip file for download
    zip_file = create_zip_file(output_srt)

    # Clean up the .srt file if needed
    os.remove(output_srt)

    # Return the path to the zip file for download
    return zip_file

# Gradio Interface setup
interface = gr.Interface(
    fn=translate_subtitles,
    inputs=gr.File(label="Upload .srt File"),
    outputs=gr.File(label="Download Translated .zip File"),
    title="Subtitle Translation (English to Persian)",
    description="Upload an .srt file in English, and the translated version will be generated using a fine-tuned model.",
    allow_flagging= 'never',
    show_progress= 'full'
)

# Launch the interface
interface.launch(share=True, debug=True)


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://458793258851c986c2.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Map:   0%|          | 0/1430 [00:00<?, ? examples/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


# Sentence Translation With User Interface 🖥️ 🎛️

In [None]:
import gradio as gr
from transformers import pipeline
from datasets import Dataset
import srt
from tqdm import tqdm
import zipfile
import os


checkpoint_base = 'facebook/mbart-large-50-many-to-many-mmt'
checkpoint_finetuned = 'Peymansoft/MBart-50-Subtitle-English-Persian'



# Load the pre-trained and fine-tuned translation pipelines
pretrained_translator = pipeline("translation", model=checkpoint_base, src_lang="en_XX", tgt_lang="fa_IR", device=0)
finetuned_translator = pipeline("translation", model=checkpoint_finetuned, src_lang="en_XX", tgt_lang="fa_IR", device=0)

# Define the translation function
def translate(input_text):
    # Pre-trained translation
    pre_translation = pretrained_translator(input_text)[0]['translation_text']

    # Fine-tuned translation
    fine_translation = finetuned_translator(input_text)[0]['translation_text']

    return pre_translation, fine_translation

# Create the Gradio interface
interface = gr.Interface(
    fn=translate,
    inputs=gr.Textbox(lines=2, placeholder="Enter an English sentence..."),
    outputs=[
        gr.Textbox(label="Before fine-tuning 😔"),
        gr.Textbox(label="After  fine-tuning 🥳")
    ],
    title="Translation before vs after fine-tuning",
    description="Compare translation results before and after fine-tuning.",
)

# Launch the Gradio interface
interface.launch()



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/226 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://b2c0752f0aae9086ff.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


