Skip to content

[zero3] fix reference counting in backward over multiple forwards#1227

Merged
tjruwase merged 2 commits into
deepspeedai:masterfrom
stas00:fix-prefetch-with-repeat-layer
Jul 14, 2021
Merged

[zero3] fix reference counting in backward over multiple forwards#1227
tjruwase merged 2 commits into
deepspeedai:masterfrom
stas00:fix-prefetch-with-repeat-layer

Conversation

@stas00
Copy link
Copy Markdown
Collaborator

@stas00 stas00 commented Jul 14, 2021

Models like Albert run the same layer's forward multiple times in a loop before doing backward. The current implementation can't handle that because it assumes forward/backward pairs and runs prefetch hooks only once, the subsequent backwards get to work with the partitioned / ungathered param which breaks with:

Traceback (most recent call last):
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/language-modeling/run_mlm.py", line 550, in <module>
    main()
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/language-modeling/run_mlm.py", line 501, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/trainer.py", line 1275, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/trainer.py", line 1784, in training_step
    loss = self.deepspeed.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1191, in backward
    self.optimizer.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage3.py", line 2972, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function LinearFunctionForZeroStage3Backward returned an invalid gradient at index 0 - got [2, 512] but expected shape compatible with [2, 512, 256]

This PR

  • switches to reference counting, instead of on/off flag, which solves the problem.
  • adds tests
  • also did some other small improvements in tests and code

Thank you @tjruwase and @samyam for helping to diagnose and fix this problem.

@stas00 stas00 changed the title [WIP] [zero3] fix reference counting in backward over multiple forwards [zero3] fix reference counting in backward over multiple forwards Jul 14, 2021
Comment thread tests/unit/test_zero.py
Comment thread tests/unit/test_zero.py
@tjruwase tjruwase merged commit 3fa2420 into deepspeedai:master Jul 14, 2021
@stas00 stas00 deleted the fix-prefetch-with-repeat-layer branch July 14, 2021 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants