Address FSDP + manual optimization #19685

awaelchli · 2024-03-22T03:43:51Z

Bug description

In manual optimization, the user can call self.backward() anywhere in training_step(). There are no limitations for this in single-device execution, but for distributed strategies there are challenges associated with that.

In DDP, we solve that problem by disabling a backward hook before calling the actual backward:

pytorch-lightning/src/lightning/pytorch/strategies/ddp.py

Lines 316 to 317 in 6cfc590

    
           if not self.lightning_module.automatic_optimization: 
        
               prepare_for_backward(self.model, closure_loss)

However, such a mechanism doesn't exist for FSDP, and calling backward during "forward" is not supported in sharded models. We should investigate whether it is ok to do this from the root fsdp model or not, and discuss possible workarounds if there are issues.

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli @carmocca

The text was updated successfully, but these errors were encountered:

carmocca · 2024-04-01T20:39:23Z

#19626 has users trying this already

awaelchli added bug Something isn't working needs triage Waiting to be triaged by maintainers strategy: fsdp Fully Sharded Data Parallel and removed needs triage Waiting to be triaged by maintainers labels Mar 22, 2024

github-actions bot added the ver: 2.2.x label Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address FSDP + manual optimization #19685

Address FSDP + manual optimization #19685

awaelchli commented Mar 22, 2024 •

edited by github-actions bot

carmocca commented Apr 1, 2024

Address FSDP + manual optimization #19685

Address FSDP + manual optimization #19685

Comments

awaelchli commented Mar 22, 2024 • edited by github-actions bot

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

carmocca commented Apr 1, 2024

awaelchli commented Mar 22, 2024 •

edited by github-actions bot