Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior in generate when output_scores=True #17424

Closed
2 of 4 tasks
shijie-wu opened this issue May 25, 2022 · 6 comments · Fixed by #17442
Closed
2 of 4 tasks

Inconsistent behavior in generate when output_scores=True #17424

shijie-wu opened this issue May 25, 2022 · 6 comments · Fixed by #17442
Labels

Comments

@shijie-wu
Copy link
Contributor

shijie-wu commented May 25, 2022

System Info

main branch

Who can help?

@patrickvonplaten, @Narsil, @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

In generate when output_scores=True, the behavior is inconsistent. In greedy_search mode, the scores are raw logits

next_token_logits = outputs.logits[:, -1, :]
# Store scores, attentions and hidden_states when required
if return_dict_in_generate:
if output_scores:
scores += (next_token_logits,)

but in sample mode (and various beam search modes), the scores are processed logits

next_token_logits = outputs.logits[:, -1, :]
# pre-process distribution
next_token_scores = logits_processor(input_ids, next_token_logits)
next_token_scores = logits_warper(input_ids, next_token_scores)
# Store scores, attentions and hidden_states when required
if return_dict_in_generate:
if output_scores:
scores += (next_token_scores,)

Expected behavior

In generate when output_scores=True, the returned scores should be consistent. It could either be raw logits or the processed logits. While for my usecase, I only need raw logits. There might be some usecases which require the processed logits. So there're multiple options:

  1. Return raw logits when output_scores=True
  2. Return processed logits when output_scores=True
  3. Return processed logits when output_scores=True, and raw logits when output_raw_scores=True
@shijie-wu shijie-wu added the bug label May 25, 2022
@patrickvonplaten
Copy link
Contributor

Great find @shijie-wu,

We've settled on outputting the processed scores since those are the ones that determine the next token, e.g. argmax and sample is taken on those scores. Given the name of the flag (output_scores), I think this makes the most sense.
Will open a PR to fix greedy_search here.

It's a good point that people might need the "raw scores" though. I think it's sensible to output the output logits of the model in this case as this would be the most understandable & consistent across generation methods. E.g. every LM model outputs logits which is the "rawest" score, so I'd be fine with adding a output_logits=True/False flag for this.
What do you think @patil-suraj @gante @shijie-wu ?

@gante
Copy link
Member

gante commented May 26, 2022

@patrickvonplaten regarding flag for the logits: on paper yes... but we are starting to get many boolean flags to control the output of internal variables (related issue: #17016, where it is requested the output of past key values). I wonder whether there is a better way to collect and expose the internals of generate for advanced uses 🤔

@ydshieh
Copy link
Collaborator

ydshieh commented May 26, 2022

For testing purpose, especially for PT/TF generation equivalence test, I think it would be better to be able to return the raw scores from the models --> so we can identify which parts get wrong if any test failure occur.
(But I understand that we have a lot of flags in generate already.)

@shijie-wu
Copy link
Contributor Author

Having output_logits=True/False flag for raw logits sounds good. In terms of too many flags in generate, we could have something like output_flags: Set[ModelInternal]=set(["logits", "scores"])?

@Narsil
Copy link
Contributor

Narsil commented May 27, 2022

Just a general comment that may seem obvious to some but I feel like it's always good to restate common options when dealing with such issues (rampant too many options + enabling users to do powerful things),
I don't intend to say that any idea should be applied, just those are my go to options when dealing with such issues, and might provide insights to you on how to deal with this !

#Idea number 1:

  • If you have too many arguments, usually some combinations do no make any sense. For instance here ( output_logits=True with output_scores=False, don't make any sense, you're not outputting scores so why output_logits value would be of any interest). Having invalid, bogus combinations is a great place for fusing two arguments into 1 that's an enum. For instance output_scores: ["none", "logits", "scores"] (and keep False, None, True for BC) . Now you can see that there's no way to express the previous bogus combination.

#Idea number 2:

  • Grouping arguments is a good option too, since users are usually likely to touch more some arguments than others. Some users are really interested in looking at the scores, while some are much more interested in the generation options like top_k or decoder_input_ids. Having some form of groups makes things easier: generate(input_ids, logits_options, model_options,return_options ).It's super important to be sure that the groups are extremely clear (so users don't have to question where option X lives). Even better options for power users is exposing directly some objects like LogitsProcessor or StoppingCriteria (enables full freedom).

#Idea number 3:

In general for power users wanting to access internals, I think, enabling tons of options to flag what needs to be outputted is just asking for general computing as parameters. Exposing the internals seem like a better option.
For instance one could add a LogitsProcessor so see the raw models scores (and at each step at that !) and manually save them himself. It is a bit of work, but then the user is empowered to save exactly what he wants without relying on our code to enable his option.

#Idea number 4:

It's OK to say no, more is not always better.

@shijie-wu
Copy link
Contributor Author

Thank you for sharing the options! Option 3 seems to be the fastest way to enable returning raw logits without any code change. However, I just go though the relevant path. It seems the user provided logits_processor is appended to an new instance of LogitsProcessorList. As a result, user cannot get the raw logits using the current implementation even with a custom LogitsProcessor. I might be missing something.

IMO, callbacks like custom LogitsProcessor seems to be the best way to enable advance usage while keeping the main generate code clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants