Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'NllbTokenizerFast' object has no attribute 'lang_code_to_id' #31348

Open
1 of 4 tasks
rajanish4 opened this issue Jun 10, 2024 · 9 comments
Open
1 of 4 tasks

Comments

@rajanish4
Copy link

rajanish4 commented Jun 10, 2024

System Info

  • transformers version: 4.42.0.dev0
  • Platform: Windows-10-10.0.20348-SP0
  • Python version: 3.9.7
  • Huggingface_hub version: 0.23.3
  • Safetensors version: 0.4.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 1.13.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA RTX A6000

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang="ron_Latn",
token=token)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=token)

article = "Şeful ONU spune că nu există o soluţie militară în Siria"
inputs = tokenizer(article, return_tensors="pt")
translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Expected behavior

It should output translated text: UN-Chef sagt, es gibt keine militärische Lösung in Syrien

Complete error:

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30)
AttributeError: 'NllbTokenizerFast' object has no attribute 'lang_code_to_id'

@ArthurZucker
Copy link
Collaborator

Yes, we had a deprecation cycle and this attribute was removed 😉

@rajanish4
Copy link
Author

Thanks, but then how can i provide the language code for translation?

@ArthurZucker
Copy link
Collaborator

you should simply do tokenizer.encode("deu_Latn")[0]

@tokenizer-decode
Copy link
Contributor

Then why the doc says otherwise? This is V4.42.0.
I also don't understand how to use tokenizer.encode("deu_Latn")[0]. What's the keyword? Is this a positional argument? @ArthurZucker

@fe1ixxu
Copy link

fe1ixxu commented Jul 2, 2024

It seems there is an error: whatever the language code I gave to the NLLB tokenizer, it will always output English token id. My version is V4.42.3 @ArthurZucker :

image

@ShayekhBinIslam
Copy link

ShayekhBinIslam commented Jul 2, 2024

I think, tokenizer.encode("deu_Latn")[0] is the regular BOS token, tokenizer.encode("deu_Latn")[1] is the expected token. @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Yes! You should use convert_token_to_id rather than encode sorry 😉

@tnitn
Copy link

tnitn commented Jul 12, 2024

What worked for me is
translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30)

@ArthurZucker
Copy link
Collaborator

yep this is what we expect!

blackmesataiwan added a commit to blackmesataiwan/OneRingTranslator that referenced this issue Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants