Inconsistent behavior in generate when output_scores=True #17424

shijie-wu · 2022-05-25T19:33:40Z

System Info

main branch

Who can help?

@patrickvonplaten, @Narsil, @gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

In generate when output_scores=True, the behavior is inconsistent. In greedy_search mode, the scores are raw logits

transformers/src/transformers/generation_utils.py

Lines 1690 to 1695 in 740a157

    
           next_token_logits = outputs.logits[:, -1, :] 
        
           # Store scores, attentions and hidden_states when required 
        
           if return_dict_in_generate: 
        
               if output_scores: 
        
                   scores += (next_token_logits,)

but in sample mode (and various beam search modes), the scores are processed logits

transformers/src/transformers/generation_utils.py

Lines 1945 to 1954 in 740a157

    
           next_token_logits = outputs.logits[:, -1, :] 
        
           # pre-process distribution 
        
           next_token_scores = logits_processor(input_ids, next_token_logits) 
        
           next_token_scores = logits_warper(input_ids, next_token_scores) 
        
           # Store scores, attentions and hidden_states when required 
        
           if return_dict_in_generate: 
        
               if output_scores: 
        
                   scores += (next_token_scores,)

Expected behavior

In generate when output_scores=True, the returned scores should be consistent. It could either be raw logits or the processed logits. While for my usecase, I only need raw logits. There might be some usecases which require the processed logits. So there're multiple options:

Return raw logits when output_scores=True
Return processed logits when output_scores=True
Return processed logits when output_scores=True, and raw logits when output_raw_scores=True

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2022-05-26T15:51:48Z

Great find @shijie-wu,

We've settled on outputting the processed scores since those are the ones that determine the next token, e.g. argmax and sample is taken on those scores. Given the name of the flag (output_scores), I think this makes the most sense.
Will open a PR to fix greedy_search here.

It's a good point that people might need the "raw scores" though. I think it's sensible to output the output logits of the model in this case as this would be the most understandable & consistent across generation methods. E.g. every LM model outputs logits which is the "rawest" score, so I'd be fine with adding a output_logits=True/False flag for this.
What do you think @patil-suraj @gante @shijie-wu ?

gante · 2022-05-26T16:15:07Z

@patrickvonplaten regarding flag for the logits: on paper yes... but we are starting to get many boolean flags to control the output of internal variables (related issue: #17016, where it is requested the output of past key values). I wonder whether there is a better way to collect and expose the internals of generate for advanced uses 🤔

ydshieh · 2022-05-26T17:26:19Z

For testing purpose, especially for PT/TF generation equivalence test, I think it would be better to be able to return the raw scores from the models --> so we can identify which parts get wrong if any test failure occur.
(But I understand that we have a lot of flags in generate already.)

shijie-wu · 2022-05-26T19:04:52Z

Having output_logits=True/False flag for raw logits sounds good. In terms of too many flags in generate, we could have something like output_flags: Set[ModelInternal]=set(["logits", "scores"])?

Narsil · 2022-05-27T09:22:23Z

Just a general comment that may seem obvious to some but I feel like it's always good to restate common options when dealing with such issues (rampant too many options + enabling users to do powerful things),
I don't intend to say that any idea should be applied, just those are my go to options when dealing with such issues, and might provide insights to you on how to deal with this !

#Idea number 1:

If you have too many arguments, usually some combinations do no make any sense. For instance here ( output_logits=True with output_scores=False, don't make any sense, you're not outputting scores so why output_logits value would be of any interest). Having invalid, bogus combinations is a great place for fusing two arguments into 1 that's an enum. For instance output_scores: ["none", "logits", "scores"] (and keep False, None, True for BC) . Now you can see that there's no way to express the previous bogus combination.

#Idea number 2:

Grouping arguments is a good option too, since users are usually likely to touch more some arguments than others. Some users are really interested in looking at the scores, while some are much more interested in the generation options like top_k or decoder_input_ids. Having some form of groups makes things easier: generate(input_ids, logits_options, model_options,return_options ).It's super important to be sure that the groups are extremely clear (so users don't have to question where option X lives). Even better options for power users is exposing directly some objects like LogitsProcessor or StoppingCriteria (enables full freedom).

#Idea number 3:

In general for power users wanting to access internals, I think, enabling tons of options to flag what needs to be outputted is just asking for general computing as parameters. Exposing the internals seem like a better option.
For instance one could add a LogitsProcessor so see the raw models scores (and at each step at that !) and manually save them himself. It is a bit of work, but then the user is empowered to save exactly what he wants without relying on our code to enable his option.

#Idea number 4:

It's OK to say no, more is not always better.

shijie-wu · 2022-05-27T18:52:51Z

Thank you for sharing the options! Option 3 seems to be the fastest way to enable returning raw logits without any code change. However, I just go though the relevant path. It seems the user provided logits_processor is appended to an new instance of LogitsProcessorList. As a result, user cannot get the raw logits using the current implementation even with a custom LogitsProcessor. I might be missing something.

IMO, callbacks like custom LogitsProcessor seems to be the best way to enable advance usage while keeping the main generate code clean.

shijie-wu added the bug label May 25, 2022

patrickvonplaten mentioned this issue May 26, 2022

[Generate] Greedy Search, fix output scores from logits to scores #17442

Merged

5 tasks

patrickvonplaten closed this as completed in #17442 May 31, 2022

shijie-wu mentioned this issue Jun 2, 2022

Support returning raw logits in generate #17521

Closed

Narsil mentioned this issue Oct 17, 2022

[WIP] Adding logprobs to text-generation pipeline. #19673

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior in generate when output_scores=True #17424

Inconsistent behavior in generate when output_scores=True #17424

shijie-wu commented May 25, 2022 •

edited

patrickvonplaten commented May 26, 2022

gante commented May 26, 2022 •

edited

ydshieh commented May 26, 2022

shijie-wu commented May 26, 2022

Narsil commented May 27, 2022

shijie-wu commented May 27, 2022

Inconsistent behavior in generate when output_scores=True #17424

Inconsistent behavior in generate when output_scores=True #17424

Comments

shijie-wu commented May 25, 2022 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

patrickvonplaten commented May 26, 2022

gante commented May 26, 2022 • edited

ydshieh commented May 26, 2022

shijie-wu commented May 26, 2022

Narsil commented May 27, 2022

shijie-wu commented May 27, 2022

shijie-wu commented May 25, 2022 •

edited

gante commented May 26, 2022 •

edited