[Feature] Add FlexibleRunner and Strategies #1183

zhouzaida · 2023-06-02T06:22:54Z

Background

In the process of supporting FSDP, DeepSpeed, and ColossalAI, Runner's scalability has encountered challenges, mainly manifested in the following three aspects:

Incompatibility with existing fixed training processes and new training methods like ZeRO.

MMEngine achieves the unification of single-gpu training and DDP (Distributed Data Parallel) training processes, and this unified and fixed training process is written in code within the Runner. However, when supporting ZeRO series methods (FSDP, ColossalAI ZeroDDP, DeepSpeed ZeRO), this fixed training process becomes incompatible and requires adjustments in the order.

For example, after model = FSDP(model) is applied, the parameters and buffers in the model are split across different GPUs, resulting in incompleteness. Any direct modification operations on the model (such as init_weights in MMEngine) will result in errors.

Furthermore, even within the ZeRO series methods, there are differences in the implementation across different frameworks, requiring flexibility to adjust the order and dispatch based on different frameworks. For example, load_checkpoint in FSDP must be called before model = FSDP(model); while DeepSpeed and ColossalAI require it to be called after model = initialize(model).
Coupling between training components in other frameworks (DeepSpeed, ColossalAI).

In MMEngine, the Model Wrapper and Optim Wrapper are independent, while in DeepSpeed and ColossalAI, there is a coupling relationship between the model and optimizer, requiring mutual access to accomplish certain tasks. This can be observed in the colossalai.initialize and deepspeed.initialize interfaces.
The unified save/load checkpoint function cannot meet the requirements of different models and frameworks.

The current save_checkpoint and load_checkpoint are independent functions in Runner, with no association with the model, framework, etc., which is counterintuitive. For example, the FSDP training method requires collecting model parameters and optimizer states to GPU 0 before saving the model, while ColossalAI and DeepSpeed have their own complex logic for weight saving and loading.

Design

To avoid impacting the existing Runner, we will re-implement a FlexibleRunner and introduce a new abstract Strategy.

The Strategy is primarily responsible for:

Constructing and initializing training components such as the model, optimizer, parameter scheduler, etc.
Initializing the distributed training environment.
Saving and loading the model, optimizer state, etc.

This PR will support three types of strategies:

SingleDeviceStrategy
DDPStrategy
DeepSpeedStrategy

Note: This is an experimental feature, and the interface is subject to change.

Environment

PyTorch: 2.0.0
deepspeed: 0.9.3+d755b9d6
CUDA: 11.7
GPU: 8 * A100, 80G

Validation

resume
load_from

Experiment

MMPreTrain

vit-huge-p14_8xb128-coslr-50e_in1k.py

DDP: Out of memory

strategy = dict(
    type='DDPStrategy',
)

DDP + fp16: 58G per GPU

optim_wrapper = dict(
    type='AmpOptimWrapper',
    optimizer=dict(
        type='AdamW',
        lr=0.004,
        weight_decay=0.05,
        eps=1e-08,
        betas=(0.9, 0.999)),
    paramwise_cfg=dict(
        norm_decay_mult=0.0,
        bias_decay_mult=0.0,
        flat_decay_mult=0.0,
        custom_keys=dict({
            '.absolute_pos_embed': dict(decay_mult=0.0),
            '.relative_position_bias_table': dict(decay_mult=0.0),
            '.ln': dict(decay_mult=0.0),
            '.bias': dict(decay_mult=0.0),
            '.cls_token': dict(decay_mult=0.0),
            '.pos_embed': dict(decay_mult=0.0)
        }),
        layer_decay_rate=0.75),
    constructor='LearningRateDecayOptimWrapperConstructor')
strategy = dict(
    type='DDPStrategy',
)

DeepSpeed ZeRO1 + fp16: 44G per GPU

strategy = dict(
    type='DeepSpeedStrategy',
    fp16=dict(
        enabled=True,
        fp16_master_weights_and_grads=False,
        loss_scale=0,
        loss_scale_window=500,
        hysteresis=2,
        min_loss_scale=1,
        initial_scale_power=15,
    ),
    inputs_to_half=['inputs'],
    zero_optimization=dict(
        stage=3,
        allgather_partitions=True,
        reduce_scatter=True,
        allgather_bucket_size=50000000,
        reduce_bucket_size=50000000,
        overlap_comm=True,
        contiguous_gradients=True,
        cpu_offload=False,
    )
)

DeepSpeed ZeRO3 + fp16:

strategy = dict(
    type='DeepSpeedStrategy',
    fp16=dict(
        enabled=True,
        fp16_master_weights_and_grads=False,
        loss_scale=0,
        loss_scale_window=500,
        hysteresis=2,
        min_loss_scale=1,
        initial_scale_power=15,
    ),
    inputs_to_half=['inputs'],
    zero_optimization=dict(
        stage=3,
        allgather_partitions=True,
        reduce_scatter=True,
        allgather_bucket_size=50000000,
        reduce_bucket_size=50000000,
        overlap_comm=True,
        contiguous_gradients=True,
        cpu_offload=False,
    )
)

vit-large-p16_8xb128-coslr-50e_in1k.py

Deespeed ZeRO1+fp16: 21G per GPU, accuracy: 85.6040

strategy = dict(
    type='DeepSpeedStrategy',
    fp16=dict(
        enabled=True,
        fp16_master_weights_and_grads=False,
        loss_scale=0,
        loss_scale_window=500,
        hysteresis=2,
        min_loss_scale=1,
        initial_scale_power=15,
    ),
    inputs_to_half=['inputs'],
    zero_optimization=dict(
        stage=3,
        allgather_partitions=True,
        reduce_scatter=True,
        allgather_bucket_size=50000000,
        reduce_bucket_size=50000000,
        overlap_comm=True,
        contiguous_gradients=True,
        cpu_offload=False,
    )
)

Deespeed ZeRO3

strategy = dict(
    type='DeepSpeedStrategy',
    fp16=dict(
        enabled=True,
        fp16_master_weights_and_grads=False,
        loss_scale=0,
        loss_scale_window=500,
        hysteresis=2,
        min_loss_scale=1,
        initial_scale_power=15,
    ),
    inputs_to_half=['inputs'],
    zero_optimization=dict(
        stage=3,
        allgather_partitions=True,
        reduce_scatter=True,
        allgather_bucket_size=50000000,
        reduce_bucket_size=50000000,
        overlap_comm=True,
        contiguous_gradients=True,
        cpu_offload=False,
    )
)

MMDet

config	mAP		time
config	FlexibleRunner + DDPStrategy	Runner + DDP	FlexibleRunner + DDPStrategy	Runner + DDP
atss_r50_fpn_1x_coco.py	39.3	39.4
faster-rcnn_r50_fpn_1x_coco.py	37.4	37.4

examples/distributed_training_with_deepspeed.py

examples/distributed_training_with_flexible_runner.py

mmengine/_strategy/base.py

mmengine/_strategy/deepspeed.py

mmengine/model/wrappers/_deepspeed.py

mmengine/_strategy/single_device.py

mmengine/optim/optimizer/_deepspeed.py

mmengine/_strategy/base.py

.gitignore

mmengine/runner/_flexible_runner.py

mmengine/_strategy/distributed.py

mmengine/runner/_flexible_runner.py

HAOCHENYE · 2023-06-26T08:26:39Z

mmengine/_strategy/deepspeed.py

+        map_location: Union[str, Callable] = 'cpu',
+        strict: bool = False,
+        revise_keys: list = [(r'^module.', '')],
+        callback: Optional[Callable] = None,


callback is not used in DeepSpeedStrategy, is that expected?

mmengine/_strategy/base.py

zhouzaida added 14 commits April 18, 2023 20:39

Add FlexibleRunner and Strategy

d06b9db

refine Strategy and add DeepSpeedStrategy

f8881a1

refine docstring

5bb8ae4

load_optim_state_dict only accpets the its state

3a0d04e

BaseStrategy supports resuming ckpt

59d8ad1

fix error

835f05a

add inputs_to_half method

ef6e12e

delete the master_only decorator of save_checkpoints

7f8b9bf

ds supports loading and saving checkpoints

f40377c

fix path

ee2a84d

handle deepspeed version

165c3e7

ds does not save latest

d323d2f

move the calling of setup_up to __init__

431b929

remove ckpt

43bfca2

zhouzaida requested review from RangiLyu, HAOCHENYE and C1rN09 as code owners June 2, 2023 06:22

zhouzaida marked this pull request as draft June 2, 2023 06:24

zhouzaida added 4 commits June 2, 2023 14:27

rename filenames

64c085e

refine prepare interface

dd6e689

merge main branch

1f8de6c

merge main branch

5a7822c

zhouzaida changed the base branch from flexible-runner to main June 2, 2023 08:20

zhouzaida changed the base branch from main to flexible-runner June 2, 2023 08:21

zhouzaida added 6 commits June 2, 2023 16:34

rename filenames

04b6709

update

f26aeff

import is_installed

5803608

refactor the deepspeed wrapper

275be9e

refactor the inheritance

35f35fc

fix lint

31820d6

zhouzaida added 3 commits June 18, 2023 23:12

refine

1719810

refine

b5c2e89

update docstring

82b72e6

HAOCHENYE reviewed Jun 19, 2023

View reviewed changes

examples/distributed_training_with_deepspeed.py Outdated Show resolved Hide resolved

zhouzaida added 2 commits June 19, 2023 14:31

make the calling consistent

bd1dd98

update deepspeed docstring

35f8a5a

HAOCHENYE reviewed Jun 19, 2023

View reviewed changes

zhouzaida added 4 commits June 20, 2023 14:39

refine

4301de6

resolve comments

9088abc

rename DSOptimWrapper to DeepSpeedOptimWrapper

8436d00

register FlexibleRunner to RUNNERS

9d38c31

HAOCHENYE reviewed Jun 21, 2023

View reviewed changes

mmengine/model/wrappers/_deepspeed.py Outdated Show resolved Hide resolved

mmengine/_strategy/single_device.py Outdated Show resolved Hide resolved

mmengine/_strategy/single_device.py Show resolved Hide resolved

mmengine/optim/optimizer/_deepspeed.py Outdated Show resolved Hide resolved

zhouzaida added 5 commits June 25, 2023 14:37

refactor the BaseOptimWrapper

d79636c

refactor the BaseOptimWrapper

10a6320

minor refine

1cedbe1

load_checkpoint also can initialize optim_wrapper status

9f54663

move _scale_lr to single device strategy

f7c21a6

HAOCHENYE reviewed Jun 26, 2023

View reviewed changes

mmengine/_strategy/base.py Show resolved Hide resolved

rename setup_env to _setup_env

1d9ac21

HAOCHENYE reviewed Jun 26, 2023

View reviewed changes

mmengine/_strategy/base.py Show resolved Hide resolved

zhouzaida added 2 commits June 26, 2023 14:46

deepspeed supports slurm

ac68242

move convert_model to single device strategy

e613794

HAOCHENYE reviewed Jun 26, 2023

View reviewed changes

mmengine/_strategy/base.py Outdated Show resolved Hide resolved

move scale_lr to base strategy

08cb129

zhouzaida merged commit 1c3f9f7 into open-mmlab:flexible-runner Jun 27, 2023
16 of 19 checks passed

zhouzaida added a commit that referenced this pull request Jun 29, 2023

[Experimental] Add FlexibleRunner and Strategies (#1183)

ccd5dc8

zhouzaida mentioned this pull request Jul 3, 2023

[Experimental] Add FlexibleRunner and Strategies #1216

Merged

zhouzaida deleted the flexible-runner branch July 3, 2023 02:54

zhouzaida mentioned this pull request Jul 7, 2023

Thoughts on an integration with 🤗 Accelerate? #583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add FlexibleRunner and Strategies #1183

[Feature] Add FlexibleRunner and Strategies #1183

zhouzaida commented Jun 2, 2023 •

edited

HAOCHENYE Jun 26, 2023

[Feature] Add FlexibleRunner and Strategies #1183

[Feature] Add FlexibleRunner and Strategies #1183

Conversation

zhouzaida commented Jun 2, 2023 • edited

Background

Design

Environment

Validation

Experiment

MMPreTrain

MMDet

HAOCHENYE Jun 26, 2023

Choose a reason for hiding this comment

zhouzaida commented Jun 2, 2023 •

edited