# 🌍 Multilingual Translation with mBART50
This notebook demonstrates how to translate between English and Indian languages (Tamil, Hindi, Malayalam) using Facebook's mBART50 model.

In [1]:
# 📦 Install Required Libraries
!pip install transformers sentencepiece torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [2]:
# 🔄 Load mBART50 Model and Tokenizer
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model_name = 'facebook/mbart-large-50-many-to-many-mmt'
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

In [3]:
# 🌐 Define Language Map and Translation Function
lang_map = {
    'tamil': 'ta_IN',
    'hindi': 'hi_IN',
    'malayalam': 'ml_IN',
    'english': 'en_XX'
}

def translate(text, source_lang='english', target_lang='tamil'):
    src_lang_code = lang_map[source_lang.lower()]
    tgt_lang_code = lang_map[target_lang.lower()]

    tokenizer.src_lang = src_lang_code
    encoded = tokenizer(text, return_tensors='pt')
    generated_tokens = model.generate(
        **encoded,
        forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang_code]
    )
    return tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

In [17]:
# 🔤 Try Translations
source_text = 'I have headache'
print('English ➝ Tamil:', translate(source_text, 'english', 'tamil'))
print('English ➝ Hindi:', translate(source_text, 'english', 'hindi'))
print('English ➝ Malayalam:', translate(source_text, 'english', 'malayalam'))

English ➝ Tamil: எனக்கு தலை வலி இருக்கிறது.
English ➝ Hindi: मुझे सिर दर्द है
English ➝ Malayalam: എനിക്ക് തലവേദനയുണ്ട്.
