[feature] new zero implementation #1623

1SAA · 2022-09-21T10:07:09Z

A New ZeRO Implementation

Backgrounds

In the current version, our ZeRO has a performance issue. The reason is that our asymmetric distribution of chunks makes one process hinder other processes when reading the content of the chunk located in the CPU memory. It really prolongs the time of the data transmission and undermines the efficiency of ZeRO.

Implementation

In order to solve this problem, I refactored the class Chunk. The new chunk is distributed evenly to all processes. All processes can move the data from the CPU memory to the CUDA memory in the same time. Furthermore, I provide an option to enable the pin memory for chunks. Now all chunks can have a copy in the pinned CPU memory. The above optimizations prominently improved the efficiency of data movements between CPU and CUDA.

Another Advantage

The new ZeRO supports the true hybrid parallelism. It creates different chunk groups for parameters which have different DP communication groups. This brings huge flexibility to our up coming automatic configuration of parallelism.

ver217 · 2022-09-22T02:35:26Z

colossalai/zero/zero_optimizer.py

-            self.fp16_param_to_fp32_param[p] = fp32_p
+            chunk_16 = self.chunk_manager.get_chunk(p)
+            chunk_32 = self.chunk_manager.get_chunk(fp32_p)
+            chunk_32.init_pair(chunk_16)


Why not init pair in ZeroDDP?

It's a problem for a legacy reason. I'll move it to ZeroDDP.

ver217 · 2022-09-22T02:40:16Z

colossalai/gemini/chunk/chunk.py

        else:
-            assert self.cuda_shard is not None    # only check in CUDA
-            valid_tensor = self.cuda_shard[:self.valid_end]


Where do you copy fp32 param to fp16 param when both of them are on CPU?

See shard_move function in the Chunk class.

This reverts commit 5be118f.

1SAA requested review from ver217, feifeibear and FrankLeeeee September 21, 2022 10:07

1SAA added 13 commits September 21, 2022 18:18

[zero] use ChunkV2 for zero training

41121cb

[zero] fix cuda, cpu memory leak

d5f83d7

[zero] add zero optimizer

04082a4

[zero] add unit test for ZeROOptimV2

7f31381

[zero] make fp32_param in pin_memory

c01c62e

[zero] add state_dict, load_state_dict, unit test for ZeRODDP

9f5195b

[hotfix] fix the init device for each chunk

5feb621

[hotfix] fix error for gathered chunks

0f60d90

[zero] add support for different paramter groups

81f33bb

[zero] add state_dict for zero optimizer

b7fd676

[polish] polish zero code and add docstrings

d36ff1c

[refactor] refactor zero and its test files

662e38d

[polish] fix all unit tests with zero

dac2388

1SAA force-pushed the demo_moe branch from 34d1d6a to dac2388 Compare September 21, 2022 10:21

ver217 reviewed Sep 22, 2022

View reviewed changes

feifeibear added the Run Build and Test label Sep 24, 2022

feifeibear approved these changes Sep 24, 2022

View reviewed changes

feifeibear merged commit 5be118f into hpcaitech:main Sep 24, 2022

feifeibear added a commit that referenced this pull request Sep 26, 2022

Revert "[feature] new zero implementation (#1623)"

fb8f86b

This reverts commit 5be118f.

feifeibear mentioned this pull request Sep 26, 2022

Revert "[feature] new zero implementation" #1643

Merged

feifeibear added a commit that referenced this pull request Sep 26, 2022

Revert "[feature] new zero implementation (#1623)" (#1643)

c5d3921

This reverts commit 5be118f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] new zero implementation #1623

[feature] new zero implementation #1623

1SAA commented Sep 21, 2022

ver217 Sep 22, 2022

1SAA Sep 22, 2022

ver217 Sep 22, 2022

1SAA Sep 22, 2022

[feature] new zero implementation #1623

[feature] new zero implementation #1623

Conversation

1SAA commented Sep 21, 2022

A New ZeRO Implementation

Backgrounds

Implementation

Another Advantage

ver217 Sep 22, 2022

Choose a reason for hiding this comment

1SAA Sep 22, 2022

Choose a reason for hiding this comment

ver217 Sep 22, 2022

Choose a reason for hiding this comment

1SAA Sep 22, 2022

Choose a reason for hiding this comment