[tensor] ColoTensor supports ZeRo #1015

ver217 · 2022-05-24T05:38:17Z

Usage:

chunk_size = 38 * 1024**2 if use_chunk else None
chunk_manager = ChunkManager(chunk_size, enable_distributed_storage=use_zero)
model = ColoDDPV2(model, chunk_manager)

chunk_size=None means chunk is not used.

feifeibear · 2022-05-25T01:23:45Z

colossalai/zero/utils/zero_hook_v2.py

+
+    def __init__(self, chunk_manager: ChunkManager) -> None:
+        super().__init__()
+        self.chunk_manager = chunk_manager


self._chunk_manager
use _XXX as an internal var of Class.

colossalai/nn/parallel.py

Wesley-Jzy · 2022-05-27T10:31:48Z

colossalai/tensor/chunk.py

+        self._update_tensors_state(TensorState.HOLD)
+
+    def tensor_trans_state(self, tensor: torch.Tensor, tensor_state: TensorState) -> None:
+        assert tensor != TensorState.FREE, 'Can only set a chunk of tesors to FREE'


typo: tensor

colossalai/tensor/chunk.py

feifeibear · 2022-05-30T04:24:29Z

The main concern is about suspending parameters.
For example

class Net(_)
  def __init__
       self.fc1 = Linear()
       self.suspend_param = torch.Paramer(..)

The self.suspend_param will occur in module.parameters(). So your DDPV2 will append it to the chunk manager.
Managing the state of the param will be a disaster for your design?

ver217 · 2022-05-30T04:29:19Z

The main concern is about suspending parameters. For example
class Net(_)
  def __init__
       self.fc1 = Linear()
       self.suspend_param = torch.Paramer(..)
The self.suspend_param will occur in module.parameters(). So your DDPV2 will append it to the chunk manager. Managing the state of the param will be a disaster for your design?

Finally, we will use computation graph to derive the computation order of params and filter those unsed params. Chunk manager only manages used params

feifeibear · 2022-05-31T01:54:54Z

colossalai/tensor/chunk.py

+            return
+        self.tensors_info[tensor].state = tensor_state
+
+    def update_tensor(self, tensor: torch.Tensor, data_slice: torch.Tensor) -> None:


The name update is not informative enough.
I think you mean
def copy_tensor_to_chunk_slice()

feifeibear · 2022-05-31T01:58:54Z

colossalai/tensor/chunk.py

+        if not self.enable_distributed_storage:
+            return
+        chunk = self.tensor_chunk_map[tensor]
+        if chunk not in self.accessed_chunks:


is the access_chunks necessary?
It is only used for this line. You can know whether the chunk is accessed via its tensor states?

If a rank store a chunk, its initial state is HOLD, but the rank should broadcast.

Cannot understand what you mean? You can say Chinese here.

For example, dp size = 2 here, and rank0 stores chunk0. The initial state of chunk0 in rank0 is HOLD, which in rank1 is FREE. We want to access chunk0 now, even rank0 has chunk0, rank0 have to do broadcast(). We can determine whether a rank has a chunk by state, but we cannot know whether the broadcast() is done by state.

colossalai/tensor/chunk.py

colossalai/tensor/param_op_hook.py

feifeibear · 2022-05-31T02:14:58Z

colossalai/nn/parallel.py

+            if self.chunk_manager.is_chunk_free(p) or not p.requires_grad:
+                p.grad = None
+            else:
+                p.grad = p.data


what does it mean?
reusing grad fp16 with param fp16?
should it be p.data = p.grad?

Set p.grad to correct ptr here. p.data saves grad, due to reuse.

if a chunk is moved from gpu to cpu later. This line makes p.grad point to an old memory space.

No, move device will update p.data at the same time.

This line makes p.grad -> p.data (addr1)
Afterwards, you move chunk
let p.data (addr2)
However, p.grad still points to addr1.

If so, just set p.grad again, or move this code snippets to optimizer.step() after device moving is done.

To check grad in unit test, I just set p.grad here.

Is it necessary to build a dict {param: chunk slice} to index grad and its true memory space (may reuse with param.data} ?

Not necessary now I think. p.grad should always point to the chunk slice memory of p, as we always reuse now. If not reuse, I think it's necessary.

OK, add it if necessary later.

colossalai/nn/parallel.py

…olo-tensor-zero

feifeibear · 2022-05-31T03:25:54Z

colossalai/nn/parallel.py

+            if self.chunk_manager.is_chunk_free(p) or not p.requires_grad:
+                p.grad = None
+            else:
+                p.grad = p.data


OK, add it if necessary later.

ver217 added 6 commits May 24, 2022 13:10

impl chunk manager

63ae7bf

impl param op hook

a4d002e

add reduce_chunk

3730baa

add zero hook v2

cd5c51b

add zero dp

cf588d8

fix TensorInfo

b4d5fa1

feifeibear reviewed May 25, 2022

View reviewed changes

ver217 added 15 commits May 25, 2022 13:36

impl load balancing when using zero without chunk

5d69c92

fix zero hook

ede5b0e

polish chunk

fe8b7e7

fix bugs

eaad042

ddp ok

aa9d5b1

zero ok

96131bc

polish code

e95ba44

fix bugs about load balancing

388adac

polish code

8599b3e

polish code

1f336f9

add ene-to-end test

7c4a657

polish code

a85013f

Merge branch 'main' into feature/colo-tensor-zero

75511a5

polish code

2a14d5d

polish code

a1655b1

ver217 marked this pull request as ready for review May 27, 2022 09:38

Wesley-Jzy reviewed May 27, 2022

View reviewed changes

colossalai/nn/parallel.py Show resolved Hide resolved

Wesley-Jzy reviewed May 27, 2022

View reviewed changes

fix typo

c481911

Wesley-Jzy reviewed May 27, 2022

View reviewed changes

colossalai/tensor/chunk.py Show resolved Hide resolved

add test_chunk

494acbd

ver217 added the Run Build and Test label May 30, 2022

feifeibear reviewed May 30, 2022

View reviewed changes

colossalai/tensor/chunk.py Show resolved Hide resolved

fix bugs

7274951

fix bugs

d3b7e0d

feifeibear reviewed May 31, 2022

View reviewed changes

colossalai/tensor/param_op_hook.py Show resolved Hide resolved

feifeibear reviewed May 31, 2022

View reviewed changes

colossalai/nn/parallel.py Show resolved Hide resolved

colossalai/nn/parallel.py Outdated Show resolved Hide resolved

ver217 added 2 commits May 31, 2022 10:54

polish code

9bc3391

Merge branch 'main' of github.com:hpcaitech/ColossalAI into feature/c…

c028cb5

…olo-tensor-zero

feifeibear approved these changes May 31, 2022

View reviewed changes

ver217 merged commit 9492a56 into main May 31, 2022

ver217 deleted the feature/colo-tensor-zero branch May 31, 2022 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tensor] ColoTensor supports ZeRo #1015

[tensor] ColoTensor supports ZeRo #1015

ver217 commented May 24, 2022 •

edited

feifeibear May 25, 2022

Wesley-Jzy May 27, 2022

ver217 May 27, 2022

feifeibear commented May 30, 2022

ver217 commented May 30, 2022

feifeibear May 31, 2022

ver217 May 31, 2022

feifeibear May 31, 2022

ver217 May 31, 2022

feifeibear May 31, 2022

ver217 May 31, 2022

feifeibear May 31, 2022

ver217 May 31, 2022

feifeibear May 31, 2022

ver217 May 31, 2022

feifeibear May 31, 2022

ver217 May 31, 2022

ver217 May 31, 2022

feifeibear May 31, 2022

ver217 May 31, 2022

feifeibear May 31, 2022

feifeibear May 31, 2022

[tensor] ColoTensor supports ZeRo #1015

[tensor] ColoTensor supports ZeRo #1015

Conversation

ver217 commented May 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feifeibear commented May 30, 2022

ver217 commented May 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ver217 commented May 24, 2022 •

edited