Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between beam search score transition probabilities and beam sequence scores #15869

Closed
2 of 4 tasks
PastelBelem8 opened this issue Mar 1, 2022 · 9 comments · Fixed by #17311
Closed
2 of 4 tasks

Comments

@PastelBelem8
Copy link

PastelBelem8 commented Mar 1, 2022

Environment info

  • transformers version: 4.16.2
  • Platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.31
  • Python version: 3.9.7
  • PyTorch version (GPU?): 1.10.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

I'm tagging @patrickvonplaten because he has recently worked on this #14654

Information

The model is a T5. I'm using its conditional generation properties and testing beam search decoding outputs:

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

Purpose: Create two simple dummy events and test whether the transition probability scores are the same as the probability sequence scores. The goal is to understand what the sequence scores represent (are they unnormalized by length? normalized by length?). From the PR #14654 discussion, I had the impression that it would be enough to sum the transition probabilities but these do not seem to match. Would you please help me understand?

To reproduce

Steps to reproduce the behavior:

  1. Create two simple dummy input examples.
  2. Encode them using a t5 model (model) and generate the output scores using beam search.
  3. Obtain transition scores through the use of model.compute_transition_beam_scores
  4. Sum the results of 3 and compare with the ones returned with the BeamSearchEncoderDecoderOutput.sequence_scores.
import transformers

model_name = "t5-small"
encoding_hyperparameters = {
    "padding": "max_length",
    "max_length": 512,
    "truncation": True,
    "add_special_tokens": True,
    "return_attention_mask": True,
    "return_tensors": "pt",
}

tokenizer = transformers.T5TokenizerFast.from_pretrained(model_name)
model = transformers.T5ForConditionalGeneration.from_pretrained(model_name)

EXAMPLE = ["question: How are you? \n context: I had a long day, she said. I am so exhausted.", 
           "question: What did the fox do? \n context: The fox jumped over the fence into a very green lawn."]

BEAM_SEARCH_KWARGS = {
    "num_beams": 4,
    "do_sample": False,
    "num_return_sequences": 1,
}

# Encode inputs
inputs_ids = tokenizer(EXAMPLE, **encoding_hyperparameters)

# Generate using beam search
beamsearch_results = model.generate(
    input_ids=inputs_ids["input_ids"],
    attention_mask=inputs_ids["attention_mask"],    
    max_length=10,
    return_dict_in_generate=True,
    output_scores=True,
    # the id of the token to force as the last generated token when max_length is reached
    forced_eos_token_id=tokenizer.eos_token_id,
    **BEAM_SEARCH_KWARGS
)

trs_bs = model.compute_transition_beam_scores(
    sequences=beamsearch_results.sequences,
    scores=beamsearch_results.scores, 
    beam_indices=beamsearch_results.beam_indices
)

print("Summ:", torch.sum(trs_bs, dim=1), "Expected:", beamsearch_results.sequences_scores)
print("Sum/length:", torch.sum(trs_bs, dim=1)/beamsearch_results.beam_indices.shape[-1], "Expected:", beamsearch_results.sequences_scores)
# output
# Sum: tensor([-1.5411, -0.3851]) Expected: tensor([-0.1712, -0.0428])
# Sum/length: tensor([-0.1712, -0.0428]) Expected: tensor([-0.1712, -0.0428])

From the example above I deduced that in order to obtain the same scores as those computed in the sequences_scores it would suffice to divide by the length of the sentences. In this case, it seems to work nicely because both sequences have the same length:

# output of beamsearch_results.sequences
tensor([[    0,    27,   141,     3,     9,   307,   239,     6,   255,     1],
        [    0,     3, 16287,   147,     8,  8227,   139,     3,     9,     1]])

So I tried a different example, that would cause the beamsearch_results.sequences to be different:

# Example 2
# The only change to the script above is the example, where we modify the first sequence in the batch
EXAMPLE = ["question: Is this yes or no question? \n context: It is yes", 
           "question: What did the fox do? \n context: The fox jumped over the fence into a very green lawn."]
# ...

print("Sum:", torch.sum(trs_bs, dim=1), "Expected:", beamsearch_results.sequences_scores)
print("Sum/length:", torch.sum(trs_bs, dim=1)/beamsearch_results.beam_indices.shape[-1], "Expected:", beamsearch_results.sequences_scores)
print("Sum/rel_length:", torch.sum(trs_bs, dim=1) / torch.sum(trs_bs != 0, dim=1), "Expected:", beamsearch_results.sequences_scores)
# outputs
# Sum: tensor([-0.0770, -0.3851]) Expected: tensor([-0.0385, -0.0428])
# Sum/length: tensor([-0.0086, -0.0428]) Expected: tensor([-0.0385, -0.0428])
# Sum/rel_length: tensor([-0.0385, -0.0481]) Expected: tensor([-0.0385, -0.0428])

The output of beamsearch_results.sequences for the above example is:

tensor([[    0,  4273,     1,     0,     0,     0,     0,     0,     0,     0],
        [    0,     3, 16287,   147,     8,  8227,   139,     3,     9,     1]])

The difference from Sum/length to Sum/rel_length is that in the former I divide by the maximum length of the generated sentences, whereas the previous is divided by the number of non-zero transition probabilities. We can see that for the latter case, (i.e., when dividing by the relative length) only the first example score is matched to the original beamsearch_results.sequences_scores).

Will you please help me better understand the computation of these probabilities and their connection with the sequence_scores? In particular, are the individual scores returned by the compute_transition_beam_scores length-normalized ? Do these individual scores aim to represent the joint probability or are they representing the individual probabilities? Are we supposed to consider the initial padding token when computing the scores?

Thanks in advance for your time!

@github-actions
Copy link

github-actions bot commented Apr 1, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@patrickvonplaten
Copy link
Contributor

Hey @PastelBelem8,

Sorry to answer so late. We've added some tests to make sure the transition probabilities work correctly. Could you take a look at this answer: #16413 (comment) and see whether it applies to your use case?

@github-actions
Copy link

github-actions bot commented May 1, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@hacobe
Copy link

hacobe commented May 3, 2022

I find the code above from @PastelBelem8 works if I set the length_penalty to 0. However, if I change the prompt so that the model produces a completion that has fewer tokens than the max_length, then the sequences_scores and the output of compute_transition_beam_scores are very different again. @patrickvonplaten, any thoughts on what might be going on there? Thanks in advance for your help!

See the code in this colab:
https://colab.research.google.com/drive/1KAc_Mk2k8qiiqKgXfAcggJWfaBEvvzox#scrollTo=6in7zwm7Dqxf

@patrickvonplaten
Copy link
Contributor

Thanks for the script @hacobe,

I can reproduce the problem. It looks like an error in Transformers. Investigating now

@patrickvonplaten
Copy link
Contributor

Hey @hacobe,

The problem is actually much more difficult than I thought. It'll need a bigger refactor of the beam scorer. I keep you updated!

@hacobe
Copy link

hacobe commented May 16, 2022

Got it. Thanks for looking into it!

@superhero-7
Copy link

superhero-7 commented May 25, 2022

@PastelBelem8 @hacobe @patrickvonplaten I am confues in some things(I also work on T5 model in some Vision-Language task):

  1. Do you find that the first token id in every beam of beamsearch_results.sequences always equal to zero?And I find that length of every beam of beamsearch_results.sequences seems alway equal to length of scores tuple plus one.
  2. I noticed these questions because I am recently woking on RINFORCE algorithm using CIDEr reward.And I think most people want to get the transition probability may be also want to use RINFORCE algorithm, but I think we don't need the probability of every token in every time step, what I need it's the probability of the sequence,which can be formalize to P(W) = P(Wt|Wt-1,Wt-2,...,W1)P(Wt-2|Wt-3...Wt-1)P(W1);And I think the last indice of return scores tuple could represent the probablity of the sequence.I don't know whether I miss something or not?And I don't konw can I calculate the gradient of P(W) (which attaind from last indice of return scores )using standard backpropagation algorithm?
  3. And right now I am working on an old version transformer which doesn't support the newest function
    'compute_transition_beam_scores', are ther any method to avoid upgrade the whole transformer, but I can also use the function
    'compute_transition_beam_scores'?

I appreciate it very much if anyone can give me any advice!!!

@PastelBelem8
Copy link
Author

@superhero-7 the first token id in every beam search is always 0 because the model introduces a pad token for every possible continuation of the string you give as input to the generate method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants