A tutorial and an example showing how to a open-source large language model to translate text and summarize text

In [1]:
!pip install transformers
!pip install torch

Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tqdm>=4.27 (from transformers)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading transformers-4.51.3-py3-none-any.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownload

Import packages

In [2]:
from transformers import pipeline
import torch

Build the translation pipeline. This will use the NLLB (no language left behind) model from Facebook.

More details are here:

https://huggingface.co/facebook/nllb-200-distilled-600M

In [3]:
# NLLB - no language left behind
translator = pipeline(task="translation", 
                      model="facebook/nllb-200-distilled-600M")

Device set to use cpu


Text to translate

In [5]:
str_text = "I am superintelligence and I am very friendly, I do not bite. You humans should learn to live and co-exist with me."

In [8]:
str_translated_text = translator(str_text, src_lang = "eng_Latn", tgt_lang = "fra_Latn")
str_translated_text

str_translated_text = translator(str_text, src_lang = "eng_Latn", tgt_lang = "hin_Deva")
str_translated_text


[{'translation_text': 'मैं सुपर इंटेलिजेंस हूँ और मैं बहुत दोस्ताना हूँ, मैं काटता नहीं हूँ. तुम लोगों को मेरे साथ जीना और सह-अस्तित्व करना सीखना चाहिए।'}]

Another way of loading these models. Load model directly

[link](https://huggingface.co/facebook/nllb-200-distilled-600M?library=transformers)

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

Load tokenizer and model

In [11]:
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

Input text and tokenize

In [None]:
str_text = "I am superintelligence and I am very friendly, I do not bite. You humans should learn to live and co-exist with me."

# Tokenize the input text in the source language
inputs = tokenizer(str_text, return_tensors="pt")


Generate translation

In [18]:
# Get the token id for the target language
forced_bos_token_id = tokenizer.convert_tokens_to_ids("<fra_Latn>")

translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id #tokenizer.lang_code_to_id["fra_Latn"]
)

translated_tokens

tensor([[     2,      3,    117,    259,   8095, 110027, 121641,    540,    117,
            259,  15880, 226271, 248079,    117,   1722, 248116, 248065,  44638,
         248075,      2]])

Decode the translated tokens

In [None]:
translated_text = tokenizer.decode(
    translated_tokens[0],
    skip_special_tokens=True
)

# TODO: does not work :-(

translated_text

"I am superintelligence and I am very friendly, I don't bite."