fsdp refactoring #2177

pacman100 · 2023-11-21T11:11:47Z

What does this PR do?

FSDP refactogin based on:

Torch 2.1 official docs: https://pytorch.org/docs/stable/fsdp.html
with use_orig_params=True, we no longer require preparing model before creating optimizer object. Earlier, we needed to prepare model, i.e., wrap the model with FSDP before creating optimizer object because of below warning from PyTorch official docs:

The optimizer must be initialized after the module has been wrapped with FSDP since FSDP will shard and transform the module’s parameters in a way that may not preserve the original parameter variables. Thus, the previously initialized optimizer may have stale references to the parameters.

Now, with use_orig_params=True, it is no longer the case. This makes the Accelerate training APi consistent, i.e., users using single GPU, DDP, FSDP, DeepSpeed now need to follow the same logic as below:

model, optimizer, lr_scheduler, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, lr_scheduler, train_dataloader, eval_dataloader)

Earlier, for FSDP, the recommended practice was shown as below. Else we used to receate the optimizer post preparing the model and it didn't preserve optimizer groups. Now, all that is resolved. Now, optimizer groups are also supported.

model = accelerator.prepare(model)

optim = torch.optim.AdamW(model.parameters(), lr=lr)
scheduler = ...
optimizer, lr_scheduler, train_dataloader, eval_dataloader = accelerator.prepare(optimizer, lr_scheduler, train_dataloader, eval_dataloader)

As such, use_orig_params=True is now the default.

https://github.com/facebookresearch/llama-recipes: Using this as best practices guide for FSDP, we are inline with it for all the features and usage of FSDP APIs. For checkpointing, they support FULL_STATE_DICT and SHARDED_STATE_DICT. We are also supporting both of these and already have tests for it. They don't show how to save and load for LOCAL_STATE_DICT state dict type.
Regression: A test for LOCAL_STATE_DICT checkpointing feature of FSDP is now failing. Couldn't find anything about it in llama recipes, FSDP documentation, torch FSDP codebase https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp as well as on the internet. Will raise an issue with PyTorch team regarding it.
Ran all the slow tests for FSDP and all green! Updated documentation and examples.

HuggingFaceDocBuilderDev · 2023-11-21T11:15:42Z

The documentation is not available anymore as the PR was closed or merged.

BenjaminBossan

This looks like a great change, love to see so many lines deleted.

I don't have experience with FSDP, so a few questions:

Does this still work as expected when using PyTorch < 2.1?
use_orig_params default was changed to True. Is there any disadvantage to that, e.g. more memory usage?

src/accelerate/accelerator.py

pacman100 · 2023-11-21T13:00:42Z

Hello Benjamin,

Does this still work as expected when using PyTorch < 2.1?
Accelerate will throw an error when FSDP integration is used with PyTorch < 2.1 now because of the lines below.

accelerate/src/accelerate/accelerator.py

Lines 306 to 307 in 0e51680

    
           if is_torch_version("<", FSDP_PYTORCH_VERSION): 
        
               raise ValueError(f"FSDP requires PyTorch >= {FSDP_PYTORCH_VERSION}")

use_orig_params default was changed to True. Is there any disadvantage to that, e.g. more memory usage?

more memory usage -> No
It is meant to enable below things:

Multiple parameter groups https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019#tldr-1
Allows non-uniform requires_grad during init, which means support for interspersed frozen and trainable parameters. Think PEFT wherein majority of params are frozen and only adapters are trainable.
It allows to create optimizer object before wrapping the model in FSDP module (this is what simplified a lot of code in this PR).

It is expected to become default as per the above dev blogpost:

These semantics to use the original parameters are available today by passing use_orig_params=True to the FSDP constructor, and they were added exactly by augmenting the existing unshard/reshard logic. In that case, named_parameters() returns the original fully-qualified names (FQNs), not ones like .flat_param . This enables using multiple optimizer parameter groups and/or different requires_grad within one FlatParameter 's original parameters, and this helps hide the FlatParameter abstraction from users. We hope to converge to setting use_orig_params=True by default in the future.

BenjaminBossan · 2023-11-21T13:18:47Z

Accelerate will throw an error when FSDP integration is used with PyTorch < 2.1

Ah okay, I missed the version bump, thanks for pointing me to it.

It is expected to become default as per the above dev blogpost

Thanks for providing more context. My question arose because use_orig_params=True seems to be strictly better and I wondered why it wasn't the default. I wondered if there is any disadvantage, but it seems to be mainly just for backwards compatibility in PyTorch.

muellerzr

Nicely done @pacman100! Excellent refactor and loving that diff. Keeping the simplistic API all around is a phenomenal win!

BenjaminBossan

Great work, thanks Sourab.

pacman100 added 2 commits November 21, 2023 15:21

remove the redundant code post the torch 2.1 release

3dd2717

make use_orig_params=True by default.

4b8ae90

pacman100 added 4 commits November 21, 2023 16:56

fix save_state optimizer saving for fsdp and update the fsdp example

4c942c9

quality

5abe9c0

fixing the utils and tests. Updating the docs

256e42f

bump up the minimum version for FSDP support.

cbb25ea

pacman100 requested review from muellerzr and BenjaminBossan and removed request for muellerzr November 21, 2023 12:20

pacman100 marked this pull request as ready for review November 21, 2023 12:21

BenjaminBossan reviewed Nov 21, 2023

View reviewed changes

src/accelerate/accelerator.py Show resolved Hide resolved

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

address comment

d18b7e3

muellerzr approved these changes Nov 21, 2023

View reviewed changes

rename fsdp model checkpointing variables

03c0040

BenjaminBossan approved these changes Nov 22, 2023

View reviewed changes

This was referenced Nov 22, 2023

Refactoring Trainer, adds save_only_model arg and simplifying FSDP integration huggingface/transformers#27652

Merged

remove the need to shard the model post saving when using FSDP pacman100/LLM-Workshop#13

Merged

pacman100 merged commit 244122c into huggingface:main Nov 24, 2023
23 checks passed

pacman100 mentioned this pull request Dec 20, 2023

auto_wrap_policy for PEFT with FSDP #2253

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsdp refactoring #2177

fsdp refactoring #2177

pacman100 commented Nov 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 21, 2023 •

edited

Loading

BenjaminBossan left a comment

pacman100 commented Nov 21, 2023 •

edited

Loading

BenjaminBossan commented Nov 21, 2023

muellerzr left a comment

BenjaminBossan left a comment

fsdp refactoring #2177

fsdp refactoring #2177

Conversation

pacman100 commented Nov 21, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 21, 2023 • edited Loading

BenjaminBossan left a comment

Choose a reason for hiding this comment

pacman100 commented Nov 21, 2023 • edited Loading

BenjaminBossan commented Nov 21, 2023

muellerzr left a comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

pacman100 commented Nov 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 21, 2023 •

edited

Loading

pacman100 commented Nov 21, 2023 •

edited

Loading