NLLB trunc translation #24169

FranPuentes · 2023-06-11T12:31:16Z

System Info

transformers version: 4.28.1
Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.13.2
Safetensors version: not installed
PyTorch version (GPU?): 1.13.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker
@younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

#!/bin/python3

import sys,os;

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM;

import torch;

#NLLB_MODEL="facebook/nllb-200-3.3B";
#NLLB_MODEL="facebook/nllb-200-distilled-600M";
NLLB_MODEL="facebook/nllb-200-distilled-1.3B";

tokenizer = AutoTokenizer.from_pretrained(NLLB_MODEL);
model     = AutoModelForSeq2SeqLM.from_pretrained(NLLB_MODEL);
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu");

model     = model.to(device);

def translate(lang:str, text:str): 
      inputs = tokenizer(text,return_tensors="pt").to(device);
      tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[lang], max_new_tokens=4096);
      texts=tokenizer.batch_decode(tokens, skip_special_tokens=True);
      return texts[0];

if __name__=="__main__":

   import readline;

   text="""La Voz de Galicia es el cuarto periódico generalista de España.
           Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020.
           En el territorio gallego es la cabecera hegemónica.
           Su edición digital es la primera web informativa de la comunidad.""";
   text=translate("eng_Latn", text);
   print(text);

Its output is "La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios of 2020." witch is only the first and second line. The same output when remove the carriage returns.

Expected behavior

Translate all the text, not only the first and second line, for example:

"La Voz de Galicia is the fourth largest generalist newspaper in Spain. It has a readership of 492,000 readers throughout the country, according to data from the first wave of the General Media Study 2020. In Galicia, it is the leading newspaper.
Its digital edition is the leading news website in the region."

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-06-12T12:18:38Z

Hi @FranPuentes
I tried to play a bit with the model and its behavior is quite interesting. If you managed to fit everything in a single line (without .) the model seems to successfully translate the entire sentence but it seems that the model stops generating (in your case) after the second sentence.
Also I advise you to run the generation in lower precision such as in 4bit so that you can use the largest model (if you run the script under a GPU device)

Below is the script I played with (4bit model after installing bitsandbytes)

# pip install bitsandbytes
import sys,os;

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM;

import torch;

NLLB_MODEL="facebook/nllb-200-3.3B";
#NLLB_MODEL="facebook/nllb-200-distilled-600M";
# NLLB_MODEL="facebook/nllb-200-distilled-1.3B";

tokenizer = AutoTokenizer.from_pretrained(NLLB_MODEL);
model     = AutoModelForSeq2SeqLM.from_pretrained(NLLB_MODEL, torch_dtype=torch.float16, load_in_4bit=True);
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu");

# model     = model.to(device);

def translate(lang:str, text:str): 
      inputs = tokenizer(text,return_tensors="pt").to(device);
      tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[lang], max_new_tokens=4096);
      texts=tokenizer.batch_decode(tokens, skip_special_tokens=True);
      return texts[0];

if __name__=="__main__":

   import readline;

   text="""La Voz de Galicia es el cuarto periódico generalista de España. Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020. En el territorio gallego es la cabecera hegemónica. Su edición digital es la primera web informativa de la comunidad.""";
   text=translate("eng_Latn", text);
   print(text);
   >>> La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios de 2020.

   text="""La Voz de Galicia es el cuarto periódico generalista de España. Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020, en el territorio gallego es la cabecera hegemónica, su edición digital es la primera web informativa de la comunidad.""";
   text=translate("eng_Latn", text);
   print(text);
   >>> La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios de 2020, in the Galician territory it is the hegemonic headquarters, its digital edition is the first informative web of the community.

I am really not sure about this behaviour, maybe it is related to the way NLLB models have been trained

github-actions · 2023-07-11T15:02:12Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLLB trunc translation #24169

NLLB trunc translation #24169

FranPuentes commented Jun 11, 2023 •

edited

younesbelkada commented Jun 12, 2023 •

edited

github-actions bot commented Jul 11, 2023

NLLB trunc translation #24169

NLLB trunc translation #24169

Comments

FranPuentes commented Jun 11, 2023 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

younesbelkada commented Jun 12, 2023 • edited

github-actions bot commented Jul 11, 2023

FranPuentes commented Jun 11, 2023 •

edited

younesbelkada commented Jun 12, 2023 •

edited