Make gradient_checkpointing a training argument #13657

sgugger · 2021-09-20T19:28:58Z

What does this PR do?

This PR reworks the logic behind gradient accumulation. It is currently set as a configuration argument which is annoying because:

it's not easily discoverable
when someone pushes a model trained with gradient checkpointing activated to the Hub, that models keeps this gradient checkpointing even if new users don't want to use it.

That's why this PR depractes the gradient_checkpointing argument in any config and adds:

a method gradient_checkpointing_enable to PreTrainedModel to activate gradient checkpointing
a training argument for the users using the Trainer API that will call that gradient_checkpointing method.

Internally, the implementation still relies on the config as it's the easiest place to set something that needs to pass several layers of a model (if we have a BertForMaskedLM for instance, the actual gradient checkpointing only applies to the BertEncoder inside the BertModel inside that BertForMaskedLM) but that argument is made private and not saved to the model Hub.

stas00

Fantastic! Thanks so much for adding this feature and making it independent from tweaking the config object. Loving it!

Left a few small suggestions.

src/transformers/configuration_utils.py

stas00 · 2021-09-20T19:38:38Z

src/transformers/modeling_utils.py

+                Will activate gradient checkpointing if :obj:`True`, deactivate it if :obj:`False`.
+        """
+        if not self.supports_gradient_checkpointing and flag:
+            logger.warn(f"{self.__class__.__name__} does not support gradient checkpointing so nothing will happen.")


any reason not to assert here instead? The user can then change their setup and proceed without problems.

It's a clear error to activate this option if a model doesn't support it, IMHO.

It's to be consistent with the previous behavior where we did nothing if the user input gradient_checkpointing for a model that did not support it.

I'm not opposed to asserting, but let's see what @LysandreJik and @patrickvonplaten think.

Would also be in favor of raising an error here actually. It's a new function so I think we can add this behavior here

Will switch then!

src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

LysandreJik

Looks good to me! Thanks for taking care of all the mentions of gradient_checkpointing in the repository, very cool work!

LysandreJik · 2021-09-20T20:05:31Z

docs/source/model_doc/led.rst

- To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by setting
-  ``config.gradient_checkpointing = True``.
+- To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by executing
+  ``model.gradient_checkpointing_enable()``.


How about enable_gradient_checkpointing?

LysandreJik · 2021-09-20T20:06:53Z

src/transformers/modeling_utils.py

@@ -932,6 +933,21 @@ def prune_heads(self, heads_to_prune: Dict[int, List[int]]):

        self.base_model._prune_heads(heads_to_prune)

+    def gradient_checkpointing_enable(self, flag: bool = True):


Should there be a disable too?

Ah I didn't see this had a flag! Maybe toggle then? Or set_gradient_checkpointing to follow traditional boolean setter conventions?

@stas00 really wanted the method name to start with gradient_checkpointing to be more easily discoverable.

After some discussion, with Lysandre, we decided to try gradient_checkpointing_enable and gradient_checkpointing_disable (no args for each).

stas00 · 2021-09-20T20:42:11Z

I took the liberty to also document this feature in https://huggingface.co/transformers/performance.html and pushed it here, so if you rename the method please adjust the doc as well. Thank you!

patrickvonplaten · 2021-09-21T11:33:53Z

I'm not very happy about keeping gradient_checkpointing in the config internally as it adds IMO significantly more complexity to what a user has to know now about model configurations. Before this PR, every configuration parameter that one sees in configuration_utils.py is stored when saving the configuration file. If we introduce now private configuration parameters that are not saved when the model is saved, it forces users to learn/understand a new exception and makes the code harder to understand/read.

I'm very much in favor of removing gradient_checkpointing from the config, but the better option IMO is not to go over the config anymore at all but to provide _disable_gradient_checkpointing, _enable_gradient_checkpointing functions to all sub-modules. It's much more work, but IMO there are also much more upsides to having this approach.

sgugger · 2021-09-21T11:46:57Z

I'm not very happy about keeping gradient_checkpointing in the config internally as it adds IMO significantly more complexity to what a user has to know now about model configurations. Before this PR, every configuration parameter that one sees in configuration_utils.py is stored when saving the configuration file. If we introduce now private configuration parameters that are not saved when the model is saved, it forces users to learn/understand a new exception and makes the code harder to understand/read.

I am not following since this is all private. The user does not have to know anything about model configurations for this option. I'm also not sure which new exceptions you are mentioning?

I'm very much in favor of removing gradient_checkpointing from the config, but the better option IMO is not to go over the config anymore at all but to provide _disable_gradient_checkpointing, _enable_gradient_checkpointing functions to all sub-modules. It's much more work, but IMO there are also much more upsides to having this approach.

Note that those submodules are often not even PreTrainedModel, so we will have to add those functions manually to a tons of nn.Module. For backward compatibility, we will also need to still have something stored in the config, since the config can't call the method gradient_checkpointing_enable on the model, so this effort is a bit pointless before v5 in the sense that there will be private parameters not saved anyway.

In any case, if this second approach is selected, I would still urge to merge this PR as soon as possible to avoid any merge conflict or many user diverging from the templates. We can then change the internal implementation on the models added more progressively.

patrickvonplaten · 2021-09-21T14:15:24Z

I'm just a bit worried that we'll start using the "private" configuration parameters of PreTrainedConfig just as a way to easily pass flags to all the nn,Modules even though those parameters shouldn't be in the config at all. For me the configuration should really just be static configuration and not serve any other purpose than defining the model architecture.

For a user that just looks at the configuration on the hub this PR is great, but for users that actually looks into the code, adding a NO_SAVE_CONFIG_KEYS option to PreTrainedConfig adds a new layer of complexity for the reader to understand. This could be avoided IMO.

Think we should be able to add a single method to the BertPreTrainedModel like this:

def _enable_gradient_checkpointing(self):
    model = self
    if hasattr(model, self.base_model_prefix):
        model = getattr(model, self.base_model_prefix)
    
    # set gradient checkpointing to True in the encoder
    model.encoder.gradient_checkpointing = True

=> this should work just fine no?

Given that we will have to leave it in the config anyways until v5, I'm fine with leveraging the config I guess - I just don't think it's good practice to introduce "special" configuration parameters with NO_SAVE_CONFIG_KEYS

stas00 · 2021-09-21T19:35:26Z

If we leave the config as is, as proposed by Patrick, should we perhaps discuss the ability for the user to choose what goes into the published model's config? We are sort of trying to do DWIM (do what I mean) and magically have the published model have all the right settings.

So adding to the model saving interface our default filters which for example will automatically disable gradient_checkpointing and then allowing users to override those if they need to? So we have the ease of use of having sensible defaults and then allow users to override any of the defaults?

In the current PR the user has no control over NO_SAVE_CONFIG_KEYS

And we won't need to wait till v5 to do so.

sgugger · 2021-09-21T19:51:42Z

@stas00 This is out of scope of this PR (which does not contain the NO_SAVE_CONFIG_KEYS anymore btw, to address Patrick's comments), so maybe the discussion should be moved elsewhere?

stas00 · 2021-09-21T19:52:57Z

I was just following up to Patrick's comment. I have no problem with not discussing it here.

patrickvonplaten

Thanks a lot for the extra effort! Really like the new design

LysandreJik

Looks good to me! Thank you for iterating.

* Make gradient_checkpointing a training argument * Update src/transformers/modeling_utils.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update src/transformers/configuration_utils.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Fix tests * Style * document Gradient Checkpointing as a performance feature * Small rename * PoC for not using the config * Adapt BC to new PoC * Forgot to save * Rollout changes to all other models * Fix typo Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Stas Bekman <stas@stason.org>

It uses `flax.linen.remat` and follows on PRs huggingface#13657 and huggingface#17994

Make gradient_checkpointing a training argument

418f924

sgugger requested review from stas00, patrickvonplaten and LysandreJik September 20, 2021 19:28

stas00 approved these changes Sep 20, 2021

View reviewed changes

stas00 mentioned this pull request Sep 20, 2021

[DeepSpeed] [success] trained t5-11b on 1x 40GB gpu #9996

Closed

sgugger and others added 4 commits September 20, 2021 15:58

Update src/transformers/modeling_utils.py

3438429

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Update src/transformers/configuration_utils.py

fc703a3

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Fix tests

80debb4

Style

0b0ff32

LysandreJik approved these changes Sep 20, 2021

View reviewed changes

document Gradient Checkpointing as a performance feature

9cad3e0

Small rename

dd842b3

sgugger mentioned this pull request Sep 21, 2021

Add ESM to huggingface #13662

Closed

5 tasks

sgugger added 3 commits September 21, 2021 13:26

PoC for not using the config

5a89ff4

Adapt BC to new PoC

ab8c6ca

Forgot to save

e286cea

Rollout changes to all other models

bafe1e0

Fix typo

2ea2a52

patrickvonplaten approved these changes Sep 21, 2021

View reviewed changes

LysandreJik approved these changes Sep 21, 2021

View reviewed changes

Hecim1984 approved these changes Sep 21, 2021

View reviewed changes

sgugger merged commit 27d4639 into master Sep 22, 2021

sgugger deleted the gradient_checkpointing branch September 22, 2021 11:51

NielsRogge mentioned this pull request Oct 8, 2021

'BertEncoder' object has no attribute 'gradient_checkpointing' #13920

Closed

patrickvonplaten mentioned this pull request Oct 11, 2021

[Gradient checkpoining] Correct disabling find_unused_parameters in Trainer when gradient checkpointing is enabled #13961

Merged

5 tasks

falcaopetri mentioned this pull request Oct 16, 2021

[Gradient checkpoining] Update Wav2Vec scripts #14036

Merged

5 tasks

This was referenced Oct 18, 2021

argparse.ArgumentError: argument --gradient_checkpointing: conflicting option string: --gradient_checkpointing ServiceNow/picard#4

Closed

fix #4 ServiceNow/picard#7

Merged

patrickvonplaten mentioned this pull request Nov 15, 2021

Wav2Vec2 CUDA memory usage doubled in v4.11.3 compared to v4.10.3 with the same batch size #14388

Closed

versae added a commit to versae/transformers that referenced this pull request Apr 20, 2023

Adding gradient_checkpointing to Flax Whisper

1eefc67

It uses `flax.linen.remat` and follows on PRs huggingface#13657 and huggingface#17994

This was referenced Apr 20, 2023

Flax whisper gradient checkpointing #22897

Closed

Add gradient checkpointing to Whisper Flax #22954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make gradient_checkpointing a training argument #13657

Make gradient_checkpointing a training argument #13657

sgugger commented Sep 20, 2021

stas00 left a comment

stas00 Sep 20, 2021

sgugger Sep 20, 2021

patrickvonplaten Sep 21, 2021

sgugger Sep 21, 2021

LysandreJik left a comment

LysandreJik Sep 20, 2021

LysandreJik Sep 20, 2021

LysandreJik Sep 20, 2021

sgugger Sep 20, 2021

sgugger Sep 20, 2021

stas00 commented Sep 20, 2021

patrickvonplaten commented Sep 21, 2021

sgugger commented Sep 21, 2021

patrickvonplaten commented Sep 21, 2021

stas00 commented Sep 21, 2021

sgugger commented Sep 21, 2021

stas00 commented Sep 21, 2021 •

edited

patrickvonplaten left a comment

LysandreJik left a comment

		@@ -932,6 +933,21 @@ def prune_heads(self, heads_to_prune: Dict[int, List[int]]):

		self.base_model._prune_heads(heads_to_prune)

		def gradient_checkpointing_enable(self, flag: bool = True):

Make gradient_checkpointing a training argument #13657

Make gradient_checkpointing a training argument #13657

Conversation

sgugger commented Sep 20, 2021

What does this PR do?

stas00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Sep 20, 2021

patrickvonplaten commented Sep 21, 2021

sgugger commented Sep 21, 2021

patrickvonplaten commented Sep 21, 2021

stas00 commented Sep 21, 2021

sgugger commented Sep 21, 2021

stas00 commented Sep 21, 2021 • edited

patrickvonplaten left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

stas00 commented Sep 21, 2021 •

edited