Update `past_key_values` in GPT-2 #9596

forest1988 · 2021-01-14T14:54:53Z

What does this PR do?

It seems GPT-2 and BartDecoder has a different style of past_key_values.
Advised by @patrickvonplaten,
I opened this PR to change GPT-2's cache format from a single tensor to a tuple of 2 tensors.
Once this problem is solved, it is expected that past_key_values in GPT-2 will be handled in the same way as in Bart.

Sorry there remain some errors. This PR is [WIP].
I would appreciate your advice on how to update generation_utils.py.
Can I modify _reorder_cache so that past is replaced from Tuple[torch.Tensor] to Tuple[Tuple[torch.Tensor]],
or should I consider other output variations, output.mem and outputs.past_buckets_states?

Fixes #9391

From patrickvonplaten:

This PR cleans the _reorder_cache logic. Now _reorcher_cache defaults to an erroneous NotImplementedError in generation_utils.py forcing the model to implement its corresponding _rerorder_cache it the modeling_...py file itself. This is cleaner as _reorder_cache strongly differs from model to model. In addition, this PR makes sure that gradient_checkpointing can only be used if the model is in training mode and makes sure that use_cache is disabled when training and gradient_checkpointing is enabled to prevent errors.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

GPT2: @LysandreJik, @patrickvonplaten

forest1988 · 2021-01-14T15:22:19Z

CircleCI error messages says as below.

In run_tests_torch:

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_beam_sample_generate
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate_dict_outputs_use_cache
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_gpt2_gradient_checkpointing
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_group_beam_search_generate
==== 5 failed, 4202 passed, 1775 skipped, 744 warnings in 216.47s (0:03:36) ====

Exited with code exit status 1
CircleCI received exit code 1

In run_tests_flax:

FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_beam_sample_generate
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate_dict_outputs_use_cache
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_gpt2_gradient_checkpointing
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_group_beam_search_generate
==== 5 failed, 4172 passed, 1805 skipped, 751 warnings in 282.27s (0:04:42) ====

Exited with code exit status 1
CircleCI received exit code 1

forest1988 · 2021-01-14T15:37:37Z

Is there a difference between past_key_value and layer_past?　I understand that they both represent the contents of past_key_values, the past of each layer, but are they different?

I first thought it might be a difference between the Causal language model and the Seq2Seq language model, but it seems that both past_key_value and layer_past are used in modeling_bart.py.

And as for the contents of layer_past, should it be named past_state, as the following part of modeling_bart.py shows?

transformers/src/transformers/models/bart/modeling_bart.py

Lines 1236 to 1244 in 236cc36

    
           @staticmethod 
        
           def _reorder_cache(past, beam_idx): 
        
               reordered_past = () 
        
               for layer_past in past: 
        
                   # cached cross_attention states don't have to be reordered -> they are always the same 
        
                   reordered_past += ( 
        
                       tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:], 
        
                   ) 
        
               return reordered_past

forest1988 · 2021-01-14T16:02:50Z

I've updated generation_utils.py, and it seems mems in transfo_xl and xlnet causes a new error.

=========================== short test summary info ============================
FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_gpt2_gradient_checkpointing
FAILED tests/test_modeling_transfo_xl.py::TransfoXLModelTest::test_beam_sample_generate
FAILED tests/test_modeling_transfo_xl.py::TransfoXLModelTest::test_beam_sample_generate_dict_output
FAILED tests/test_modeling_transfo_xl.py::TransfoXLModelTest::test_beam_search_generate
FAILED tests/test_modeling_transfo_xl.py::TransfoXLModelTest::test_beam_search_generate_dict_output
FAILED tests/test_modeling_transfo_xl.py::TransfoXLModelTest::test_group_beam_search_generate
FAILED tests/test_modeling_transfo_xl.py::TransfoXLModelTest::test_group_beam_search_generate_dict_output
FAILED tests/test_modeling_xlnet.py::XLNetModelTest::test_beam_sample_generate
FAILED tests/test_modeling_xlnet.py::XLNetModelTest::test_beam_sample_generate_dict_output
FAILED tests/test_modeling_xlnet.py::XLNetModelTest::test_beam_search_generate
FAILED tests/test_modeling_xlnet.py::XLNetModelTest::test_beam_search_generate_dict_output
FAILED tests/test_modeling_xlnet.py::XLNetModelTest::test_group_beam_search_generate
FAILED tests/test_modeling_xlnet.py::XLNetModelTest::test_group_beam_search_generate_dict_output
=== 13 failed, 4194 passed, 1775 skipped, 743 warnings in 205.38s (0:03:25) ====

Exited with code exit status 1
CircleCI received exit code 1

transformers/src/transformers/models/xlnet/modeling_xlnet.py

Lines 581 to 607 in 236cc36

    
           @dataclass 
        
           class XLNetModelOutput(ModelOutput): 
        
               """ 
        
               Output type of :class:`~transformers.XLNetModel`. 
        
               Args: 
        
                   last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, hidden_size)`): 
        
                       Sequence of hidden-states at the last layer of the model. 
        
                       ``num_predict`` corresponds to ``target_mapping.shape[1]``. If ``target_mapping`` is ``None``, then 
        
                       ``num_predict`` corresponds to ``sequence_length``. 
        
                   mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`): 
        
                       Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding. 
        
                       The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they 
        
                       have already been computed. 
        
                   hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): 
        
                       Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) 
        
                       of shape :obj:`(batch_size, sequence_length, hidden_size)`. 
        
                       Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
        
                   attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): 
        
                       Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, 
        
                       sequence_length, sequence_length)`. 
        
                       Attentions weights after the attention softmax, used to compute the weighted average in the self-attention 
        
                       heads. 
        
               """

It seems mems is something similar to past_key_values.
Is there any difference between these two elements with different names?
Also, is it safe to change mems from List[torch.Tensor] to Tuple[Tuple[torch.Tensor]]?

src/transformers/models/gpt2/modeling_gpt2.py

patrickvonplaten · 2021-01-14T16:31:06Z

Hey @forest1988,

You're PR looks very nice! Yes, it is expected that XLNet and TransfoXL fail actually since they also have been using the "default" _reorder_cache function of modeling_utils.py. Could you do the following changes to correct this:

Copy that old _reorder_cache (the one before you did your changes) function that was in generation_utils.py to both modeling_xlnet.py and modeling_transfo_xl.py file so that those have the same function as before?
Copy the current _reorder_cache function of generation_utils.py into modeling_gpt2.py?
Add a default _reorder_cache function to generation_utils.py that looks as follows:

def _reorder_cache(self, past, beam_idx):
    raise NotImplementedError(...)

forest1988 · 2021-01-14T16:31:21Z

I've just updated torch.utils.checkpoint.checkpoint check in modeling_gpt2.py, referring to modeling_bart.py.

patrickvonplaten · 2021-01-14T16:35:07Z

This way it's much cleaner and correct :-) The reason I'm proposing this change is that the _reorder_cache function is so different for each model that there should be no default function. A default function could confuse people that want to add a new model in a way that they think it works out of the box, but in most cases it just doesn't. A clear error message such as:

def _reorder_cache(self, past, beam_idx):
    raise NotImplementedError(f"Make sure that a `_reorder_cache` function is correctly implemented in {self.__class__.__module__} to enable beam search for {self.__class__}")

patrickvonplaten · 2021-01-14T16:35:30Z

I think this should solve the problems, let me know if you need more help :-)

forest1988 · 2021-01-14T16:39:20Z

Thank you for your advice! I'll update _reorder_cache soon and commit it.

…XL, GPT-2

forest1988 · 2021-01-14T17:50:48Z

Hi @patrickvonplaten,

Thanks to your kind advice, I could solve the problem of _reorder_cache in GPT-2, XLNet, TransfoXL (, and CTRL).
Referring to modeling_bart.py, in which _reorder_cache is placed in ConditionalGeneration Model, I added _reoder_cache in LMHead Models in each Causal Language Models.

The last one remaining bug is:

FAILED tests/test_modeling_gpt2.py::GPT2ModelTest::test_gpt2_gradient_checkpointing

I think I should modify test_gpt2_gradient_checkpointing so that it has use_cache=False, or reconsider my previous update and re-modify the usage of checkpoint in modeling_gpt2.

I've just updated torch.utils.checkpoint.checkpoint check in modeling_gpt2.py, referring to modeling_bart.py.

forest1988 · 2021-01-14T18:27:35Z

All checks have passed!
I appreciate all your help.

However, in the documentation of _reorder_cache, there are references to both past_key_values and mems regardless of which object is used.
I think we can fix that and only mention the one we use, or we can leave the reference to both to show that the aim of the function is the same.
If there is a need to modify it, please let me know.

src/transformers/models/ctrl/modeling_ctrl.py

src/transformers/models/gpt2/modeling_gpt2.py

patrickvonplaten · 2021-01-14T23:34:13Z

src/transformers/models/transfo_xl/modeling_transfo_xl.py

+        called. This is required to match :obj:`past_key_values` or :obj:`mems` with the correct beam_idx at every
+        generation step.
+
+        For custom re-ordering of :obj:`past_key_values` or :obj:`mems`, the function should be implemented in


remove those lines and past_key_values above

I cleaned it as well.

src/transformers/models/xlnet/modeling_xlnet.py

patrickvonplaten

The PR looks very nice - thanks so much for taking the time to tackle this @forest1988 . Let's wait a bit to see how to proceed with gradient_checkpointing in GPT2 as this question will come up more often. IMO, use_cache should always be False for training so either we update all use_cache in the models with a use_cache= not self.is_training and (use_cache if use_cache is not None else self.config.use_cache) or we force it somehow in the Trainer. Similarly gradient_checkpointing should never be set to True when the model is not training IMO (we could also automatically disable this using self.training). Let's see what @LysandreJik and @sgugger think.

sgugger

This is not a part of the library I'm very familiar with, so the changes look okay on my side, but I'm no expert.

src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik

These changes look good to me! Thanks for taking care of it @forest1988.

tests/test_modeling_gpt2.py

patrickvonplaten

Great work @forest1988,

I hope it's fine for you that I went into the PR to do some final fixes. Thanks a lot for cleaning this up :-)

forest1988 · 2021-01-19T03:16:20Z

Hi @patrickvonplaten,

I hope it's fine for you that I went into the PR to do some final fixes. Thanks a lot for cleaning this up :-)

Of course! Thank you for adding fixes to make this PR more valuable!

LysandreJik

Your commit looks good to me @patrickvonplaten! Thanks.

sgugger

The new changes look good to me, thanks!

patrickvonplaten · 2021-01-19T15:00:33Z

Awesome, merging - great job @forest1988 !

forest1988 · 2021-01-19T15:24:28Z

Thank you for your advice and encouraging comments!
It’s my pleasure to have opened this PR!

patrickvonplaten · 2021-01-21T09:47:23Z

src/transformers/models/gpt2/modeling_gpt2.py

@@ -232,7 +232,7 @@ def forward(
            value = torch.cat((past_value, value), dim=-2)

        if use_cache is True:
-            present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking
+            present = (key.transpose(-2, -1), value)  # transpose to have same shapes


This is the reason for the recent failure of the slow test:

RUN_SLOW=1 pytest tests/test_onnx.py::OnnxExportTestCase::test_export_pytorch

Can you fix the onnx part easily? @mfuntowicz @Narsil

Update past_key_values in gpt2 (huggingface#9391)

4f21a34

forest1988 mentioned this pull request Jan 14, 2021

Similar usage of past_key_values in CausalLM and Seq2SeqLM #9391

Closed

Update generation_utils, and rename some items

f8a3ad0

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Show resolved Hide resolved

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Show resolved Hide resolved

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Show resolved Hide resolved

Update modeling_gpt2 to avoid an error in gradient_checkpointing

3fb461d

Remove 'reorder_cache' from util and add variations to XLNet, Transfo…

d04b10c

…XL, GPT-2

forest1988 force-pushed the forest1988-fix-gpt2-past_key_values branch from 89ee453 to d04b10c Compare January 14, 2021 17:00

forest1988 added 4 commits January 15, 2021 02:10

Change the location of '_reorder_cache' in modeling files

09e6e59

Add '_reorder_cache' in modeling_ctrl

2654400

Fix a bug of my last commit in CTRL

06ce793

Add '_reorder_cache' to GPT2DoubleHeadsModel

445a96b

Manage 'use_cache' in config of test_modeling_gpt2

eb5b936

forest1988 changed the title ~~[WIP] Update past_key_values in GPT-2~~ Update past_key_values in GPT-2 Jan 14, 2021

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/ctrl/modeling_ctrl.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Show resolved Hide resolved

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

src/transformers/models/xlnet/modeling_xlnet.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 14, 2021

View reviewed changes

sgugger approved these changes Jan 15, 2021

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

src/transformers/models/gpt2/modeling_gpt2.py Show resolved Hide resolved

forest1988 and others added 4 commits January 15, 2021 15:22

Clean up the doc string

85c0c1d

Update src/transformers/models/gpt2/modeling_gpt2.py

d035d68

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Fix the doc string (GPT-2, CTRL)

e932434

Fix conflicts in modeling_gpt2

35da6f3

LysandreJik approved these changes Jan 15, 2021

View reviewed changes

tests/test_modeling_gpt2.py Show resolved Hide resolved

improve gradient_checkpointing_behavior

0850299

patrickvonplaten approved these changes Jan 18, 2021

View reviewed changes

patrickvonplaten requested review from LysandreJik and sgugger January 18, 2021 22:53

LysandreJik approved these changes Jan 19, 2021

View reviewed changes

sgugger approved these changes Jan 19, 2021

View reviewed changes

patrickvonplaten merged commit b020a73 into huggingface:master Jan 19, 2021

sgugger mentioned this pull request Jan 19, 2021

Fix model templates and use less than 119 chars #9684

Merged

patrickvonplaten reviewed Jan 21, 2021

View reviewed changes

OyvindTafjord mentioned this pull request May 13, 2021

Fix T5 beam search when using parallelize #11717

Merged

LysandreJik mentioned this pull request Aug 1, 2022

[BLOOM] Clean modeling code #18344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `past_key_values` in GPT-2 #9596

Update `past_key_values` in GPT-2 #9596

forest1988 commented Jan 14, 2021 •

edited by patrickvonplaten

Loading

forest1988 commented Jan 14, 2021

forest1988 commented Jan 14, 2021

forest1988 commented Jan 14, 2021

patrickvonplaten commented Jan 14, 2021

forest1988 commented Jan 14, 2021

patrickvonplaten commented Jan 14, 2021

patrickvonplaten commented Jan 14, 2021

forest1988 commented Jan 14, 2021

forest1988 commented Jan 14, 2021 •

edited

Loading

forest1988 commented Jan 14, 2021

patrickvonplaten Jan 14, 2021

forest1988 Jan 15, 2021

patrickvonplaten left a comment

sgugger left a comment

LysandreJik left a comment

patrickvonplaten left a comment

forest1988 commented Jan 19, 2021

LysandreJik left a comment

sgugger left a comment

patrickvonplaten commented Jan 19, 2021

forest1988 commented Jan 19, 2021

patrickvonplaten Jan 21, 2021

Update past_key_values in GPT-2 #9596

Update past_key_values in GPT-2 #9596

Conversation

forest1988 commented Jan 14, 2021 • edited by patrickvonplaten Loading

What does this PR do?

Before submitting

Who can review?

forest1988 commented Jan 14, 2021

forest1988 commented Jan 14, 2021

forest1988 commented Jan 14, 2021

patrickvonplaten commented Jan 14, 2021

forest1988 commented Jan 14, 2021

patrickvonplaten commented Jan 14, 2021

patrickvonplaten commented Jan 14, 2021

forest1988 commented Jan 14, 2021

forest1988 commented Jan 14, 2021 • edited Loading

forest1988 commented Jan 14, 2021

patrickvonplaten Jan 14, 2021

Choose a reason for hiding this comment

forest1988 Jan 15, 2021

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

forest1988 commented Jan 19, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Jan 19, 2021

forest1988 commented Jan 19, 2021

patrickvonplaten Jan 21, 2021

Choose a reason for hiding this comment

Update `past_key_values` in GPT-2 #9596

Update `past_key_values` in GPT-2 #9596

forest1988 commented Jan 14, 2021 •

edited by patrickvonplaten

Loading

forest1988 commented Jan 14, 2021 •

edited

Loading