[feature] A new ZeRO implementation #1644

1SAA · 2022-09-26T02:53:14Z

A New ZeRO Implementation

This PR is about a new kind of implementation of ZeRO.

Backgrounds

In the current version, our ZeRO has a performance issue. The reason is that our asymmetric distribution of chunks makes one process hinder other processes when reading the content of the chunk located in the CPU memory. It really prolongs the time of the data transmission and undermines the efficiency of ZeRO.

Implementation

In order to solve this problem, I refactored the class Chunk. The new chunk is distributed evenly to all processes. All processes can move the data from the CPU memory to the CUDA memory in the same time. Furthermore, I provide an option to enable the pin memory for chunks. Now all chunks can have a copy in the pinned CPU memory. The above optimizations prominently improved the efficiency of data movements between CPU and CUDA.

Another Advantage

The new ZeRO supports the true hybrid parallelism. It creates different chunk groups for parameters which have different DP communication groups. This brings huge flexibility to our up coming automatic configuration of parallelism.

colossalai/gemini/chunk/chunk.py

colossalai/zero/zero_optimizer.py

tests/test_gemini/update/test_zeroddp_state_dict.py

tests/test_gemini/update/test_zerooptim_state_dict.py

tests/test_tensor/test_zero_optim.py

1SAA added 13 commits September 21, 2022 18:18

[zero] use ChunkV2 for zero training

41121cb

[zero] fix cuda, cpu memory leak

d5f83d7

[zero] add zero optimizer

04082a4

[zero] add unit test for ZeROOptimV2

7f31381

[zero] make fp32_param in pin_memory

c01c62e

[zero] add state_dict, load_state_dict, unit test for ZeRODDP

9f5195b

[hotfix] fix the init device for each chunk

5feb621

[hotfix] fix error for gathered chunks

0f60d90

[zero] add support for different paramter groups

81f33bb

[zero] add state_dict for zero optimizer

b7fd676

[polish] polish zero code and add docstrings

d36ff1c

[refactor] refactor zero and its test files

662e38d

[polish] fix all unit tests with zero

dac2388

1SAA requested review from ver217 and feifeibear September 26, 2022 02:53