Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation issues with seq2seq LMs #23413

Closed
2 of 4 tasks
abarbet opened this issue May 16, 2023 · 7 comments
Closed
2 of 4 tasks

Generation issues with seq2seq LMs #23413

abarbet opened this issue May 16, 2023 · 7 comments

Comments

@abarbet
Copy link

abarbet commented May 16, 2023

System Info

  • transformers version: 4.27.1
  • Platform: Linux-5.19.0-41-generic-x86_64-with-glibc2.35
  • Python version: 3.9.12
  • Huggingface_hub version: 0.13.2
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes, parallel (accelerate auto-mapping)

Who can help?

@ArthurZucker @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

This has most recently arisen in using trlX to do reinforcement learning on flan-T5. I wrote an issue on their own repo, but there seems to be no response, and it is somewhat more suited to be an issue in this repo since it has to do with transformers code at its core.

The main issue is that generate with a seq2seq model, namely flan-t5, sometimes generates the following error: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0. This has been well documented in other issues like this one, but the behavior in that issue is more custom than calling generate in its standard configuration.

Here is a code example to reproduce:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

m = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large", device_map="auto")
t = AutoTokenizer.from_pretrained("google/flan-t5-large")

in_text = """You are a highly intelligent and accurate HVAC domain Resource Description Framework (RDF) data model. You take Passage as input and convert it into HVAC domain RDF triples. A triple is a set of three entities that codifies a statement about semantic data in the form of subject–predicate–object expressions.
Your output format is only [[ subject, predicate, object ], ...] nothing else

Examples: 
Input: The HV123 heating unit can supply 50W of power
Output: [[HV123, powerSupply, 50W]]

Input: Unit: ft. (m)
Model | Cooling Mode | Heating Mode
ABC123 | 28.8 (8.8) | 19.0 (5.8)
ABC456 | 28.8 (8.8) | 19.0 (5.8)
ABC789 | 28.8 (8.8) | 21.3 (6.5)
ABC987 | 29.0 (8.9) | 22.9 (7.0)
Output:"""

ins = t(in_text, return_tensors="pt").input_ids.to("cuda")
outs = m.generate(ins, do_sample=True, max_length=512, top_k=0, temperature=0.7, num_beams=2)

NB:
temperature seems to be one of the main causes of this issue, as removing this kwarg from the generate call does not produce the error in the above case. However, that is not true of all cases. I have seen the error in my trlX training loops with kwargs as simple as: {"max_new_tokens": 512, "do_sample": True, "top_k": 0, "top_p": 1}. Thus it seems this error is not always related to temperature.

Expected behavior

The expected behavior in this case would be for the sampling to work every time instead of having strange edge cases where tokens are unreachable.

@gante
Copy link
Member

gante commented May 16, 2023

Hey @abarbet 👋

This issue may arise when beam search, sampling, and long outputs are used together. A potential bug on PyTorch itself compounds it. You can read the full story in this issue.

TL;DR -- my immediate suggestion would be to avoid using num_beams and do_sample together. If you want to use them both, you'll have to read the issue linked above, which describes the problem and solutions :)

@abarbet
Copy link
Author

abarbet commented May 16, 2023

Ah thank you, that issue is very helpful! Do you have any idea why we would see a similar error in trlX training despite not using beam sampling? I know you don't have access to my training script and also are most likely not familiar with their codebase, so this is a complete longshot.

The only thing I can think of if it's not caused by a sampling bug is some kind of destructive learning in the PPO step that causes token distributions to get completely out of whack.

@gante
Copy link
Member

gante commented May 17, 2023

@abarbet It may be due to this PyTorch issue, where the sampling step may pick very low probability tokens that it shouldn't and, in turn, cause computations to derail.

Try running your script with PT 1.x instead of 2.0!

@Daryl149
Copy link

@abarbet It may be due to this PyTorch issue, where the sampling step may pick very low probability tokens that it shouldn't and, in turn, cause computations to derail.

Try running your script with PT 1.x instead of 2.0!

For me, this issue also occurs with pytorch 1.13.1
#22914 (comment)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@yungsinatra0
Copy link

yungsinatra0 commented Aug 23, 2023

Hello, has a fix been found for this issue? Using the latest version of transformers and can confirm that when running inference using model.generate() with parameters such as temperature and do_sample causes this issue.

  summary_ids = model.generate(
      inputs["input_ids"],
      max_length=max_length,
      min_length=128,
      temperature=0.1,
      do_sample=True,
      # top_p=0.3
      )

edit: can confirm now that do_sample and temperature is the cause of the issue as top_p works fine for me
edit2: I forgot to mention that the model that I'm using is BRIO, loading pre-trained weights from HF

@gante
Copy link
Member

gante commented Aug 23, 2023

@yungsinatra0 The issue should only be gone with the next PT release (i.e. torch>2.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants