Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix hook removal issue #14427

Closed
wants to merge 2 commits into from
Closed

fix hook removal issue #14427

wants to merge 2 commits into from

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Nov 17, 2021

Here is a possible workaround for an issue triggered by #14408 and reported at #14408 (comment). I repeat all the relevant information below.

This broke HF/deepspeed integration with pt-1.8 or pt-1.9 - works fine with pt-1.10. found with git bisecting and reported by @jeffra, as their CI broke with our master.

RUN_SLOW=1 pyt tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_clm_1_zero3 -sv
E           Traceback (most recent call last):
E             File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_clm.py", line 524, in <module>
E               main()
E             File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_clm.py", line 472, in main
E               train_result = trainer.train(resume_from_checkpoint=checkpoint)
E             File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1316, in train
E               tr_loss_step = self.training_step(model, inputs)
E             File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1849, in training_step
E               loss = self.compute_loss(model, inputs)
E             File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1881, in compute_loss
E               outputs = model(**inputs)
E             File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
E               return forward_call(*input, **kwargs)
E             File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1580, in forward
E               loss = self.module(*inputs, **kwargs)
E             File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1057, in _call_impl
E               for hook in itertools.chain(
E           RuntimeError: OrderedDict mutated during iteration

The issue is triggered by:

module._gradient_checkpointing_hook.remove()

so it looks like Deepspeed is just a harbinger here, and any other application that also uses hooks that get inserted after this hook will trigger this issue.

It appears that what happens is that the hook is being removed from the dict while it being traversed one or more frames above.

Perhaps if the hook is last python doesn't report this issue. But if there are more hooks registered after that one, that's when the dict mutation is detected.

I looked at what others did to solve this and they had to move the hook removal outside of the hook itself and into the forward when it's safe to remove it. Except we don't have a forward for this super class.

For some reason I can't reproduce this with pt-1.10, which means that pytorch has reworked the loop that traverses the hooks dict to allow hooks to self-remove - probably using a copy to traverse the dict.

So this PR is an attempt to make things work, while rendering the hook a noop for subsequent calls. As it says this is a temporary hook and will be removed soon, perhaps it's OK? for pt-1.10 we can safely remove it.

Obviously, this is just a suggestion. now that you understand the issue, perhaps you will come up with a more efficient solution.

@sgugger

@sgugger
Copy link
Collaborator

sgugger commented Nov 17, 2021

Thanks a lot for the investigation! The hook was only a temporary solution to make a quick fix for the backward compatibility issue. I will work on the "real fix" today (which won't depend on hooks) and if I finish it quickly enough, I propose we just merge the "real fix". If it starts taking too much time, we can merge your PR as a quicker fix.

@sgugger
Copy link
Collaborator

sgugger commented Nov 17, 2021

#14431 should fix the issue in a better way, by removing the hook entirely :-)

@stas00
Copy link
Contributor Author

stas00 commented Nov 17, 2021

I confirm that #14431 undid the problem. Thank you, Sylvain!

@stas00 stas00 closed this Nov 17, 2021
@stas00 stas00 deleted the fix-hook branch April 27, 2022 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants