Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here is a possible workaround for an issue triggered by #14408 and reported at #14408 (comment). I repeat all the relevant information below.
This broke HF/deepspeed integration with pt-1.8 or pt-1.9 - works fine with pt-1.10. found with git bisecting and reported by @jeffra, as their CI broke with our master.
The issue is triggered by:
transformers/src/transformers/modeling_utils.py
Line 423 in b567510
so it looks like Deepspeed is just a harbinger here, and any other application that also uses hooks that get inserted after this hook will trigger this issue.
It appears that what happens is that the hook is being removed from the dict while it being traversed one or more frames above.
Perhaps if the hook is last python doesn't report this issue. But if there are more hooks registered after that one, that's when the dict mutation is detected.
I looked at what others did to solve this and they had to move the hook removal outside of the hook itself and into the
forward
when it's safe to remove it. Except we don't have aforward
for this super class.For some reason I can't reproduce this with pt-1.10, which means that pytorch has reworked the loop that traverses the hooks dict to allow hooks to self-remove - probably using a copy to traverse the dict.
So this PR is an attempt to make things work, while rendering the hook a noop for subsequent calls. As it says this is a temporary hook and will be removed soon, perhaps it's OK? for pt-1.10 we can safely remove it.
Obviously, this is just a suggestion. now that you understand the issue, perhaps you will come up with a more efficient solution.
@sgugger