Fix gradient_checkpointing backward compatibility #14408

sgugger · 2021-11-15T22:12:48Z

What does this PR do?

This supercedes #14405 and fixes #14388 by going at the root of the problem. When the code for backward compatibility is executed in the main init, the submodules of the model have not been created yet, so there is nothing to do. That code needs to be executed in some kind of post_init.

We currently don't have a post_init in our models, and for another operation that is very similar (init_weights, which needs ot be executed at the end of the init), we have a call to that method at the end of the init of every model. The good fix will thus be to replace that call to init_weights to a call to post_init (which will call init_weights internally). This will be a big PR that touches every model, so will implement this for the end of the week.

For a quick fix since we need to do a patch release because of the BC problem, this PR uses a forward pre hook (executed before the forward method) that removes itself. So the code is executed just before the first forward (not as clean as in a post init but the next best thing).

LysandreJik

Ok good, this looks good to me! Thanks for working on it

LysandreJik · 2021-11-16T13:15:29Z

src/transformers/modeling_utils.py

+def gradient_checkpointing_hook(module, _):
+    # Hook to enable backward compatibility for gradient checkpointing. Will be removed once all models have a
+    # proper post_init method.
+    if getattr(module.config, "gradient_checkpointing", False):
+        module.gradient_checkpointing_enable()
+        # Remove the attribute now that is has been consumed, so it's no saved in the config.
+        delattr(module.config, "gradient_checkpointing")
+    # The hook will remove itself after the first execution
+    module._gradient_checkpointing_hook.remove()


Ok that works for me

… gradient_checkpointing_fix

* Fix gradient_checkpointing backward compatibility * Remove needless line * make sure mask prob is big enough and length small enough * Fix tests Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

stas00 · 2021-11-17T01:08:14Z

This broke HF/deepspeed integration with pt-1.8 or pt-1.9 - works fine with pt-1.10. found with git bisecting and reported by @jeffra, as their CI broke with our master.

RUN_SLOW=1 pyt tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_clm_1_zero3 -sv

E           Traceback (most recent call last):
E             File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_clm.py", line 524, in <module>
E               main()
E             File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_clm.py", line 472, in main
E               train_result = trainer.train(resume_from_checkpoint=checkpoint)
E             File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1316, in train
E               tr_loss_step = self.training_step(model, inputs)
E             File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1849, in training_step
E               loss = self.compute_loss(model, inputs)
E             File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1881, in compute_loss
E               outputs = model(**inputs)
E             File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
E               return forward_call(*input, **kwargs)
E             File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1580, in forward
E               loss = self.module(*inputs, **kwargs)
E             File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1057, in _call_impl
E               for hook in itertools.chain(
E           RuntimeError: OrderedDict mutated during iteration

stas00 · 2021-11-17T01:15:56Z

src/transformers/modeling_utils.py

+        if self.supports_gradient_checkpointing:
+            self._gradient_checkpointing_hook = self.register_forward_pre_hook(gradient_checkpointing_hook)


This is the culprit for the failure I reported here: #14408 (comment)

removing it fixes the problem

Does this mean that DeepSpeed does not support PyTorch hooks?

it does and uses those extensively. Which perhaps is the cause of the problem if some hooks disagree or don't follow the prescribed instruction of not modifying certain things.

If you're not sure about the cause I can investigate it and report back what I find.

Investigated: the issue is triggered by:

transformers/src/transformers/modeling_utils.py

Line 423 in b567510

module._gradient_checkpointing_hook.remove()

so it looks like Deepspeed is just a harbinger here, and any other application that also uses hooks will trigger this issue.

It appears that what happens is that the hook is being removed from the dict that is being traversed one or more frames above.

I looked at what others did to solve this and they had to move the hook removal outside of the hook itself and into the forward when it's safe to remove it. Except we don't have a forward for this super class.

For some reason I can't reproduce this with pt-1.10, which means that they have reworked the loop that traverses the hooks dict to allow hooks to self-remove - probably using a copy to traverse the dict.

possible fix: #14427

Fix gradient_checkpointing backward compatibility

6f38f18

sgugger requested review from patrickvonplaten and LysandreJik November 15, 2021 22:12

sgugger and others added 2 commits November 15, 2021 17:15

Remove needless line

59595e3

make sure mask prob is big enough and length small enough

bbbd3b6

LysandreJik approved these changes Nov 16, 2021

View reviewed changes

sgugger added 2 commits November 16, 2021 08:20

Fix tests

5c6e449

Merge remote-tracking branch 'origin/gradient_checkpointing_fix' into…

3853299

… gradient_checkpointing_fix

sgugger merged commit 040fd47 into master Nov 16, 2021

sgugger deleted the gradient_checkpointing_fix branch November 16, 2021 13:58

stas00 reviewed Nov 17, 2021

View reviewed changes

jeffra mentioned this pull request Nov 17, 2021

[CI] use last known good HF hash until fix is merged for PR-14408 microsoft/DeepSpeed#1568

Merged

stas00 mentioned this pull request Nov 17, 2021

fix hook removal issue #14427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gradient_checkpointing backward compatibility #14408

Fix gradient_checkpointing backward compatibility #14408

sgugger commented Nov 15, 2021

LysandreJik left a comment

LysandreJik Nov 16, 2021

stas00 commented Nov 17, 2021 •

edited

Loading

stas00 Nov 17, 2021

stas00 Nov 17, 2021

sgugger Nov 17, 2021

stas00 Nov 17, 2021

stas00 Nov 17, 2021 •

edited

Loading

stas00 Nov 17, 2021 •

edited

Loading

stas00 Nov 17, 2021

		if self.supports_gradient_checkpointing:
		self._gradient_checkpointing_hook = self.register_forward_pre_hook(gradient_checkpointing_hook)

Fix gradient_checkpointing backward compatibility #14408

Fix gradient_checkpointing backward compatibility #14408

Conversation

sgugger commented Nov 15, 2021

What does this PR do?

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Nov 16, 2021

Choose a reason for hiding this comment

stas00 commented Nov 17, 2021 • edited Loading

stas00 Nov 17, 2021

Choose a reason for hiding this comment

stas00 Nov 17, 2021

Choose a reason for hiding this comment

sgugger Nov 17, 2021

Choose a reason for hiding this comment

stas00 Nov 17, 2021

Choose a reason for hiding this comment

stas00 Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

stas00 Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

stas00 Nov 17, 2021

Choose a reason for hiding this comment

stas00 commented Nov 17, 2021 •

edited

Loading

stas00 Nov 17, 2021 •

edited

Loading

stas00 Nov 17, 2021 •

edited

Loading