Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLLB trunc translation #24169

Closed
2 of 4 tasks
FranPuentes opened this issue Jun 11, 2023 · 2 comments
Closed
2 of 4 tasks

NLLB trunc translation #24169

FranPuentes opened this issue Jun 11, 2023 · 2 comments

Comments

@FranPuentes
Copy link

FranPuentes commented Jun 11, 2023

System Info

  • transformers version: 4.28.1
  • Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.13.2
  • Safetensors version: not installed
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker
@younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

#!/bin/python3

import sys,os;

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM;

import torch;

#NLLB_MODEL="facebook/nllb-200-3.3B";
#NLLB_MODEL="facebook/nllb-200-distilled-600M";
NLLB_MODEL="facebook/nllb-200-distilled-1.3B";

tokenizer = AutoTokenizer.from_pretrained(NLLB_MODEL);
model     = AutoModelForSeq2SeqLM.from_pretrained(NLLB_MODEL);
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu");

model     = model.to(device);

def translate(lang:str, text:str): 
      inputs = tokenizer(text,return_tensors="pt").to(device);
      tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[lang], max_new_tokens=4096);
      texts=tokenizer.batch_decode(tokens, skip_special_tokens=True);
      return texts[0];

if __name__=="__main__":

   import readline;

   text="""La Voz de Galicia es el cuarto periódico generalista de España.
           Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020.
           En el territorio gallego es la cabecera hegemónica.
           Su edición digital es la primera web informativa de la comunidad.""";
   text=translate("eng_Latn", text);
   print(text);

Its output is "La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios of 2020." witch is only the first and second line. The same output when remove the carriage returns.

Expected behavior

Translate all the text, not only the first and second line, for example:

"La Voz de Galicia is the fourth largest generalist newspaper in Spain. It has a readership of 492,000 readers throughout the country, according to data from the first wave of the General Media Study 2020. In Galicia, it is the leading newspaper.
Its digital edition is the leading news website in the region.
"

@younesbelkada
Copy link
Contributor

younesbelkada commented Jun 12, 2023

Hi @FranPuentes
I tried to play a bit with the model and its behavior is quite interesting. If you managed to fit everything in a single line (without .) the model seems to successfully translate the entire sentence but it seems that the model stops generating (in your case) after the second sentence.
Also I advise you to run the generation in lower precision such as in 4bit so that you can use the largest model (if you run the script under a GPU device)

Below is the script I played with (4bit model after installing bitsandbytes)

# pip install bitsandbytes
import sys,os;

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM;

import torch;

NLLB_MODEL="facebook/nllb-200-3.3B";
#NLLB_MODEL="facebook/nllb-200-distilled-600M";
# NLLB_MODEL="facebook/nllb-200-distilled-1.3B";

tokenizer = AutoTokenizer.from_pretrained(NLLB_MODEL);
model     = AutoModelForSeq2SeqLM.from_pretrained(NLLB_MODEL, torch_dtype=torch.float16, load_in_4bit=True);
device    = torch.device("cuda" if torch.cuda.is_available() else "cpu");

# model     = model.to(device);

def translate(lang:str, text:str): 
      inputs = tokenizer(text,return_tensors="pt").to(device);
      tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[lang], max_new_tokens=4096);
      texts=tokenizer.batch_decode(tokens, skip_special_tokens=True);
      return texts[0];

if __name__=="__main__":

   import readline;

   text="""La Voz de Galicia es el cuarto periódico generalista de España. Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020. En el territorio gallego es la cabecera hegemónica. Su edición digital es la primera web informativa de la comunidad.""";
   text=translate("eng_Latn", text);
   print(text);
   >>> La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios de 2020.

   text="""La Voz de Galicia es el cuarto periódico generalista de España. Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020, en el territorio gallego es la cabecera hegemónica, su edición digital es la primera web informativa de la comunidad.""";
   text=translate("eng_Latn", text);
   print(text);
   >>> La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios de 2020, in the Galician territory it is the hegemonic headquarters, its digital edition is the first informative web of the community.

I am really not sure about this behaviour, maybe it is related to the way NLLB models have been trained

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants