Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some MarianMT models broken and output garbage #26216

Closed
3 of 4 tasks
fergusq opened this issue Sep 18, 2023 · 18 comments · Fixed by #30173
Closed
3 of 4 tasks

Some MarianMT models broken and output garbage #26216

fergusq opened this issue Sep 18, 2023 · 18 comments · Fixed by #30173

Comments

@fergusq
Copy link

fergusq commented Sep 18, 2023

System Info

  • transformers version: 4.30.2
  • Platform: macOS-13.4.1-arm64-arm-64bit
  • Python version: 3.11.4
  • Huggingface_hub version: 0.16.3
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.1.0.dev20230413 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @younesbelkada @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Some of the MarianMT models I use like Helsinki-NLP/opus-mt-tc-big-fi-en are currently broken. This bug is reproduced in the Huggingface Hub inference widget: https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-fi-en?text=Kissa+k%C3%A4velee+kadulla You can see that it gives a garbage translation.

This model worked just last week! I also see that the model hasn't been updated in the Hub for over a month, so the bug must be in the transformers library.

Simple code to reproduce locally:

>>> tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-tc-big-fi-en")
>>> model = transformers.AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-tc-big-fi-en")
>>> model.generate(**tokenizer("kissa kävelee kadulla", return_tensors="pt"))
tensor([[57829, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542, 16542,
         16542, 16542, 16542, 16542, 16542, 16542, 16542, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074, 19074,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825, 11825,
         11825, 41756]])

Expected behavior

The model should give a proper translation.

@fergusq fergusq changed the title Some MarianMT broken and output garbage Some MarianMT models broken and output garbage Sep 18, 2023
@fergusq
Copy link
Author

fergusq commented Sep 18, 2023

I can confirm that the bug exists even after reinstalling the models and updating transformers to 4.33.2.

@ArthurZucker
Copy link
Collaborator

Hey! As I answered on the issue that was opened here, the Helsinki team should merge the PR we opened to fix the models on the hub!

@ArthurZucker
Copy link
Collaborator

In the mean time use model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-tatoeba-en-ja", revision = "refs/pr/3" 😉

@fergusq
Copy link
Author

fergusq commented Sep 21, 2023

The Finnish–English model doesn't seem to have a similar PR as in the English–Japanese model. Could you share the script used to create the fixed version?

@ArthurZucker
Copy link
Collaborator

I'll upload PRs to all models, working on it!

@thfrkielikone
Copy link

Hi. Is the updated script available somewhere? We are using some models that have not been published to huggingface and would need to convert them ourselves.

@ArthurZucker
Copy link
Collaborator

I think that the conversion script works as expected, but some models have to be converted again! Really sorry I could not go through this yet. Will update here soon!

@thfrkielikone
Copy link

thfrkielikone commented Oct 2, 2023

I attempted to convert a model per the master branch of the transformers library with the following steps, but the result was broken:

  • Cloned transformers and the tatoeba challenge repo
  • Installed transformers via pip install . and then the packages that it was missing with torch wget sentencepiece gitpython.
  • Ran the conversion with python src/transformers/models/marian/convert_marian_tatoeba_to_pytorch.py --models fin-eng --save_dir converted, resulting in converted/opus-mt-fin-eng was created without issue.
  • Loaded the model into a pipeline and ran it with:
from transformers import pipeline
p = pipeline("translation", model="/path/to/converted/opus-mt-fin-eng", max_new_tokens=50)
p("Mitähän tein väärin?")

The result it gave was [{'translation_text': ',,,,,,, W W W W W W W W,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'}]. It also gave the warning

spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between added_tokens, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.

Am I doing something wrong, or is there something broken in the conversion script itself? (Am I perhaps using a wrong version?)

@ArthurZucker
Copy link
Collaborator

No I think the warning can be safely ignored in that case. Mostprobably an issue in the conversion. If you can push the raw model to thehub it will help me in debugging!

@thfrkielikone
Copy link

The model is the one downloaded by the comand above. Do you want the converted result or for me to fish out whatever the command downloaded (it is mt-tc-big-fi-en that is publicly available)?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@thfrkielikone
Copy link

thfrkielikone commented Oct 27, 2023

Thank you @artzucker for uploading the fixed version of the Finnish-English model.

Converting the as of yet not on the Huggingface hub Finnish-Swedish model still gives garbage results, but so does (for me) converting the Finnish-English model. However, using the revision of the Finnish-English model from the pr like so gives correct results:

p = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-fi-en", revision = "refs/pr/6", max_new_tokens=50)

This leads me to believe that there's something wrong (at least different) with the steps I am taking to convert it. Is the correct way to convert the models documented somewhere? I know of this: https://github.com/huggingface/transformers/blob/main/scripts/tatoeba/README.md and followed it (plus also installing torch and SentencePiece which is not mentioned there) Before this breaking change, I used these steps to successfully convert the Swedish model once.

I will contact the Helsinki people and ask, why the Finnish-Swedish model hasn't been uploaded to the Huggingface hub (or if it has been and I missed it, where). If it can be put there and you could take a look at it, that'd be great.

In the meantime, if you want to check if there is something surprising about the Finnish-Swedish model itself after all, it is this one: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/fin-swe#opustcv20210807bt-2021-12-08zip

@ArthurZucker
Copy link
Collaborator

Hey! I think I need to update the conversion script : readmes etc. Might not have time next week (will be off) but will definitely work on fixing this once and forall!
Feel free to share whatever you can (repo with both weights for example) that would help!

@huggingface huggingface deleted a comment from github-actions bot Nov 21, 2023
@ArthurZucker
Copy link
Collaborator

I think that the last I'll keep open to track the conversion script update 😉

@ArthurZucker
Copy link
Collaborator

No time to tackle this yet sorry, adding the good difficult issue tag!

@PinzhenChen
Copy link

PinzhenChen commented Mar 13, 2024

We converted some Marian models here unfortunately with the old scripts. As of now, a solution worked for us is to downgrade transformers, e.g. to 4.28.

Adding the revision="xyz" can work for specific models on HF but not universally. I wonder if there is a plan to update the conversion script? Also is there a plan to make the conversion script as well as the Marian class more "universal" so that it 1) can work with any Marian-trained models (not just OPUS models from Helsinki), and 2) can let user specify the paths to Marian vocabulary file, model file, decoder file when converting weights.

@ArthurZucker
Copy link
Collaborator

It would be very nice but I'm super low on bandwidth for that. Maybe @zucchini-nlp could be a nice challenge! Otherwise community contributions are really welcome!

@fergusq
Copy link
Author

fergusq commented Apr 23, 2024

Related to this issue, I made a conversion script that can correctly convert non-OPUS/Tatoeba Marian models to the transformers format: https://gist.github.com/fergusq/bc8e4f2d3d74027df9232f05d08f2ce4

Currently, convert_marian_to_pytorch.py is Opus/Tatoeba-specific and cannot really be used for other models, such as the HLPT Marian models added recently to the hub. Somehow, even these other models have the same issue with tied weights. I'm not sure why. While the issue is being corrected, the linked script can be used to convert arbitrary Marian models to the correct format so that the weights that are tied are correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants