Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address FSDP + manual optimization #19685

Open
awaelchli opened this issue Mar 22, 2024 · 1 comment
Open

Address FSDP + manual optimization #19685

awaelchli opened this issue Mar 22, 2024 · 1 comment
Labels
bug Something isn't working strategy: fsdp Fully Sharded Data Parallel ver: 2.2.x

Comments

@awaelchli
Copy link
Member

awaelchli commented Mar 22, 2024

Bug description

In manual optimization, the user can call self.backward() anywhere in training_step(). There are no limitations for this in single-device execution, but for distributed strategies there are challenges associated with that.

In DDP, we solve that problem by disabling a backward hook before calling the actual backward:

if not self.lightning_module.automatic_optimization:
prepare_for_backward(self.model, closure_loss)

However, such a mechanism doesn't exist for FSDP, and calling backward during "forward" is not supported in sharded models. We should investigate whether it is ok to do this from the root fsdp model or not, and discuss possible workarounds if there are issues.

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli @carmocca

@awaelchli awaelchli added bug Something isn't working needs triage Waiting to be triaged by maintainers strategy: fsdp Fully Sharded Data Parallel and removed needs triage Waiting to be triaged by maintainers labels Mar 22, 2024
@carmocca
Copy link
Contributor

carmocca commented Apr 1, 2024

#19626 has users trying this already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strategy: fsdp Fully Sharded Data Parallel ver: 2.2.x
Projects
None yet
Development

No branches or pull requests

2 participants