[Fix] Fix FSDP bug #553

C1rN09 · 2022-09-27T10:45:10Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Fix bugs in MMEngine's FSDP wrapper

Modification

Change execution order in Runner

Before change: wrap_model --> build_optimizer --> init_weight
After change: init_weight --> wrap_model --> build_optimizer

Modify train_step, val_step and test_step in MMFullyShardedDataParallel to be consistent with BaseModel
Fix AssertionError caused by pytorch issue when some stages are frozen.

This fix is kind of tricky. We manually set all parameters to requires_grad=True, but the frozen layers are passed to FSDP in ignored_modules arguments, so that they are not in FSDP's parameters and not optimized by optimizer.
This may not be a perfect solution, and may need further improvement.

Fix checkpoint saving issues.

BC-breaking (Optional)

The execution order of Runner has been changed, especially the hook point of before_run, as follows.

Before: wrap_model --> build_optimizer --> hooks.before_run --> init_weight
After: init_weight --> wrap_model --> build_optimizer --> hooks.before_run

This modification is potentially BC-breaking.

Use cases (Optional)

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
The documentation has been modified accordingly, like docstring or example tutorials.

yhna940

I leave a comment because I think fsdp will need extra logic to save the optimizer state and use it to resume training. I'm really sorry that there is a difference between the place where the comment was posted and the part where the new proposal is located. It worked fine in the environment we used, but if there are any problems, please leave a new comment. Thank you.

mmengine/runner/runner.py

yhna940

I'm really sorry for leaving a comment that doesn't fit this PR, but can I ask you to add the sharding_strategy option? (The sharding strategy option seems to have been added in version 1.12 of Torch.) If this request is accepted, it is likely that memory and time tradeoffs can be considered. Currently, shard_grad_op and full_shard are supported. Thank you.

yhna940 · 2022-10-04T07:13:45Z

mmengine/model/wrappers/fully_sharded_distributed.py

@@ -16,7 +19,7 @@


 @MODEL_WRAPPERS.register_module()
-class MMFullyShardedDataParallel(FullyShardedDataParallel):
+class MMFullyShardedDataParallel(FSDP):


mmengine/mmengine/model/wrappers/fully_sharded_distributed.py

Line 117 in 24c9627

if sharding_strategy is not None: if isinstance(sharding_strategy, str): assert sharding_strategy in ['shard_grad_op', 'full_shard'] sharding_strategy = ( ShardingStrategy.SHARD_GRAD_OP if sharding_strategy == 'shard_grad_op' else ShardingStrategy.FULL_SHARD) elif not isinstance(sharding_strategy, ShardingStrategy): raise TypeError('`sharding_strategy` ' 'should be `None`, `str` ' 'or `ShardingStrategy`, but has type ' f'{type(sharding_strategy)}')

mmengine/model/wrappers/fully_sharded_distributed.py

C1rN09 · 2022-10-06T10:33:24Z

@yhna940 Great suggestions! First I apologize for my late response because I was on vacation ^_^ Your suggestions are correct and I am glad to see community's interests in requesting this FSDP feature in MMEngine. But this PR is originally aiming at resolving bugs for basic use cases without breaking BC, not supporting full FSDP

I have actually planned for some new modifications:

Support saving full optimizer states to enable resume training (exactly as you mentioned)
Add config options for newly added FSDP arguments (e.g. sharding_strategy you have mentioned)
Make FSDP arguments configurable in our config file

These modifications are of relatively low priority because I am focusing on some other things (e.g. better Chinese/English documentations), and probably they will be carried out next month. But if you are interested in either of them, you are welcome to make contributions by making issues/PR(s) ^_^

mmengine/hooks/checkpoint_hook.py

mmengine/model/wrappers/fully_sharded_distributed.py

mmengine/runner/runner.py

mmengine/model/wrappers/fully_sharded_distributed.py

… behavior

docs/zh_cn/examples/save_gpu_memory.md

zhouzaida · 2022-11-08T05:14:32Z

mmengine/runner/runner.py

+
+        # initialize the model weights
+        self._init_model_weights()
+        # make sure checkpoint-related hooks are triggered after `before_run`


The description "after before_run" is not consistent with the code.

…fter before_run()

codecov · 2022-11-08T07:44:53Z

Codecov Report

Base: 78.40% // Head: 78.10% // Decreases project coverage by -0.30% ⚠️

Coverage data is based on head (e039fcc) compared to base (618a063).
Patch coverage: 19.67% of modified lines in pull request are covered.

❗ Current head e039fcc differs from pull request most recent head edc936b. Consider uploading reports for the commit edc936b to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #553      +/-   ##
==========================================
- Coverage   78.40%   78.10%   -0.31%     
==========================================
  Files         127      127              
  Lines        9175     9212      +37     
  Branches     1826     1838      +12     
==========================================
+ Hits         7194     7195       +1     
- Misses       1670     1702      +32     
- Partials      311      315       +4

Flag	Coverage Δ
unittests	`78.10% <19.67%> (-0.31%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
mmengine/config/utils.py	`58.49% <ø> (ø)`
mmengine/hooks/checkpoint_hook.py	`89.28% <0.00%> (ø)`
mmengine/model/__init__.py	`75.00% <0.00%> (ø)`
mmengine/model/wrappers/__init__.py	`70.00% <0.00%> (ø)`
...engine/model/wrappers/fully_sharded_distributed.py	`0.00% <0.00%> (ø)`
mmengine/runner/runner.py	`83.33% <47.61%> (-1.36%)`	⬇️
mmengine/model/weight_init.py	`35.50% <100.00%> (ø)`
mmengine/version.py	`58.33% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

SCZwangxiao · 2023-01-14T13:32:49Z

Waiting for this feature...

SCZwangxiao · 2023-01-25T03:05:50Z

mmengine/runner/runner.py

+        #  the former depends on the latter in FSDP
+        self.optim_wrapper = self.build_optim_wrapper(self.optim_wrapper)
+        # Automatically scaling lr by linear scaling rule
+        self.scale_lr(self.optim_wrapper, self.auto_scale_lr)


Summary of the bugs

Hi @C1rN09 . I am deeply appreciative of your commitment! FSDP is critical for training large models.

However, I leveraged this commit to train my own model, and found two bugs when resuming checkpoints. The bugs are caused due to changes in the execution order of some functions. The bugs are fixed in pull request C1rN09#1. Below are the details.

Bug 1: RuntimeError in self.scale_lr()

Description. Currently, the scale_lr() method must be called before building the ParamScheduler, or it will raise RuntimeError. However, if there is a checkpoint, the ParamScheduler will be built in self.load_or_resume(), before we run self.scale_lr(self.optim_wrapper, self.auto_scale_lr).

Fix: I think scale_lr() method should be modified to fix the bug, because the execution order of other functions cannot be changed. Concretely, since wrap_model() should be called after load_or_resume() to be compatible with FSDP, there must be:
load_or_resume()->wrap_model()->build_optim_wrapper()->scale_lr().

Bug 2: state dict of optim_wrapper will be loaded into CPU instead of cuda

Description. When we call self.load_or_resume(), the state dict of optim_wrapper will be loaded into CPU, because the self.model is in CPU. The original code does not have this problem, because self.load_or_resume() is called after self.wrap_model() where the model is sent to cuda.

Fix: There could be many solutions, as long as model = model.to(get_device()) is inserted before self.load_or_resume(). However, I cannot find an insertion place that makes the code as elegant as before. Therefore, I inserted it right before self.load_or_resume().

Thanks for your bug report & solutions! Honestly speaking, this PR does not solve all issues in FSDP integration, especially the issue with checkpoint loading. The PR stops because we find that FSDP and some other frameworks (DeepSpeed, ColossalAI, etc.) are changing the execution order of model setup, initialization, checkpoint saving/loading, which spoil the whole thing. Sometimes they may even conflict with each other. Therefore, we are working on a more elegant way to solve this issue by refactoring Runner, as discussed in Discussion topic.

This refactor is working in progress, and we'll take your bug reports into consideration. Hopefully it will come out soon, and you will be able to use FSDP in MMEngine then!

zhouzaida · 2023-07-26T12:34:25Z

supported by #1213

C1rN09 changed the title ~~[WIP] Fix FSDP bug~~ [Fix] Fix FSDP bug Sep 28, 2022

HAOCHENYE mentioned this pull request Oct 1, 2022

[Feature] Support torch ZeroRedundancyOptimizer #551

Merged

4 tasks

yhna940 reviewed Oct 4, 2022

View reviewed changes

mmengine/runner/runner.py Show resolved Hide resolved

mmengine/runner/runner.py Show resolved Hide resolved

yhna940 reviewed Oct 4, 2022

View reviewed changes

ZwwWayne requested a review from HAOCHENYE October 8, 2022 03:25

ZwwWayne added this to the 0.2.0 milestone Oct 8, 2022

C1rN09 force-pushed the fix_fsdp branch from 6cf3455 to 76cf279 Compare October 9, 2022 12:39

HAOCHENYE reviewed Oct 10, 2022

View reviewed changes

mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved

HAOCHENYE reviewed Oct 10, 2022

View reviewed changes

mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved

HAOCHENYE reviewed Oct 10, 2022

View reviewed changes

mmengine/model/wrappers/fully_sharded_distributed.py Show resolved Hide resolved

HAOCHENYE previously approved these changes Oct 11, 2022

View reviewed changes

C1rN09 added the ready ready to merge label Oct 13, 2022

C1rN09 requested a review from zhouzaida October 17, 2022 06:45

zhouzaida reviewed Oct 18, 2022

View reviewed changes

mmengine/model/wrappers/fully_sharded_distributed.py Outdated Show resolved Hide resolved

zhouzaida reviewed Oct 18, 2022

View reviewed changes

mmengine/model/wrappers/fully_sharded_distributed.py Outdated Show resolved Hide resolved

zhouzaida reviewed Oct 18, 2022

View reviewed changes

mmengine/model/wrappers/fully_sharded_distributed.py Outdated Show resolved Hide resolved

zhouzaida requested review from zhouzaida and RangiLyu October 18, 2022 15:09

C1rN09 dismissed HAOCHENYE’s stale review via 6231ee2 October 19, 2022 02:09

ZwwWayne modified the milestones: 0.2.0, 0.3.0 Oct 24, 2022

HAOCHENYE added the Bug:P1 label Oct 27, 2022

zhouzaida requested a review from ZwwWayne October 31, 2022 07:43

RangiLyu reviewed Oct 31, 2022

View reviewed changes

mmengine/runner/runner.py Outdated Show resolved Hide resolved

RangiLyu reviewed Oct 31, 2022

View reviewed changes

mmengine/model/wrappers/fully_sharded_distributed.py Outdated Show resolved Hide resolved

RangiLyu reviewed Oct 31, 2022

View reviewed changes

mmengine/model/wrappers/fully_sharded_distributed.py Show resolved Hide resolved

C1rN09 added 3 commits November 2, 2022 18:03

fix fsdp_wrapper xxx_step() methods to be consistent with BaseModel's…

e8c3055

… behavior

fix fsdp by reordering weight_init->wrap_model->build_optim in runner

a6dac25

attemp to fix fully_sharded_distributed.py

968f57d

C1rN09 added 11 commits November 2, 2022 18:04

resolve pytorch FSDP AssertionError when some stages are fixed

e8a1452

resolve FSDP save_checkpoint issue

30f724b

fix ut

be1a96a

change FSDP pytorch version requirement from 1.11+ to 1.12+

46480a9

fix error in save_best_checkpoint when saving fsdp model

54b9958

update fsdp related docs

d9f36fa

change manual zero_grad bahavior in FSDP to set to None

bfa1d24

resolve comments & improve readability

72c7584

resolve comments & improve readability

9962c2e

fix style: make called function public instead of private

942a523

add notes & warnings about known issues in current implementation

2f1b9cc

C1rN09 force-pushed the fix_fsdp branch from 0a6a911 to 2f1b9cc Compare November 2, 2022 10:05

fix lint

9228762

RangiLyu previously approved these changes Nov 2, 2022

View reviewed changes

yhna940 previously approved these changes Nov 3, 2022

View reviewed changes

zhouzaida modified the milestones: 0.3.0, 0.3.1 Nov 8, 2022

zhouzaida reviewed Nov 8, 2022

View reviewed changes

docs/zh_cn/examples/save_gpu_memory.md Outdated Show resolved Hide resolved

zhouzaida reviewed Nov 8, 2022

View reviewed changes

docs/zh_cn/examples/save_gpu_memory.md Outdated Show resolved Hide resolved

zhouzaida reviewed Nov 8, 2022

View reviewed changes

fix as comments

58a5263

C1rN09 dismissed stale reviews from yhna940 and RangiLyu via 58a5263 November 8, 2022 07:17

change Runner.train() build order to make sure checkpoint hooks are a…

edc936b

…fter before_run()

SCZwangxiao reviewed Jan 25, 2023

View reviewed changes

SCZwangxiao mentioned this pull request Jan 29, 2023

Fix two bugs in "Fix fsdp" C1rN09/mmengine#1

Open

zhouzaida closed this Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Fix FSDP bug #553

[Fix] Fix FSDP bug #553

C1rN09 commented Sep 27, 2022 •

edited

yhna940 left a comment

yhna940 left a comment •

edited

yhna940 Oct 4, 2022

C1rN09 commented Oct 6, 2022

zhouzaida Nov 8, 2022

codecov bot commented Nov 8, 2022

SCZwangxiao commented Jan 14, 2023

SCZwangxiao Jan 25, 2023 •

edited

C1rN09 Jan 30, 2023

zhouzaida commented Jul 26, 2023

[Fix] Fix FSDP bug #553

[Fix] Fix FSDP bug #553

Conversation

C1rN09 commented Sep 27, 2022 • edited

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

yhna940 left a comment

Choose a reason for hiding this comment

yhna940 left a comment • edited

Choose a reason for hiding this comment

yhna940 Oct 4, 2022

Choose a reason for hiding this comment

C1rN09 commented Oct 6, 2022

zhouzaida Nov 8, 2022

Choose a reason for hiding this comment

codecov bot commented Nov 8, 2022

Codecov Report

SCZwangxiao commented Jan 14, 2023

SCZwangxiao Jan 25, 2023 • edited

Choose a reason for hiding this comment

Summary of the bugs

Bug 1: RuntimeError in self.scale_lr()

Bug 2: state dict of optim_wrapper will be loaded into CPU instead of cuda

C1rN09 Jan 30, 2023

Choose a reason for hiding this comment

zhouzaida commented Jul 26, 2023

C1rN09 commented Sep 27, 2022 •

edited

yhna940 left a comment •

edited

SCZwangxiao Jan 25, 2023 •

edited

Bug 1: `RuntimeError` in `self.scale_lr()`

Bug 2: state dict of `optim_wrapper` will be loaded into CPU instead of cuda