You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In manual optimization, the user can call self.backward() anywhere in training_step(). There are no limitations for this in single-device execution, but for distributed strategies there are challenges associated with that.
In DDP, we solve that problem by disabling a backward hook before calling the actual backward:
However, such a mechanism doesn't exist for FSDP, and calling backward during "forward" is not supported in sharded models. We should investigate whether it is ok to do this from the root fsdp model or not, and discuss possible workarounds if there are issues.
What version are you seeing the problem on?
master
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
Bug description
In manual optimization, the user can call
self.backward()
anywhere intraining_step()
. There are no limitations for this in single-device execution, but for distributed strategies there are challenges associated with that.In DDP, we solve that problem by disabling a backward hook before calling the actual backward:
pytorch-lightning/src/lightning/pytorch/strategies/ddp.py
Lines 316 to 317 in 6cfc590
However, such a mechanism doesn't exist for FSDP, and calling backward during "forward" is not supported in sharded models. We should investigate whether it is ok to do this from the root fsdp model or not, and discuss possible workarounds if there are issues.
What version are you seeing the problem on?
master
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
cc @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: