[zero] add zero optimizer for ColoTensor #1046

ver217 · 2022-05-31T09:15:38Z

No description provided.

FrankLeeeee · 2022-06-01T07:07:02Z

colossalai/nn/optimizer/zero_optimizer.py

+    UNSCALED = 1
+
+
+class ZeroOptimizer(ColossalaiOptimizer):


Should this be put in the zero module?

feifeibear · 2022-06-01T07:57:27Z

tests/test_tensor/test_zero_optim.py

+    torch_model.train()
+    set_seed(gpc.get_local_rank(ParallelMode.DATA))
+    for i, (input_ids, attn_mask) in enumerate(train_dataloader):
+        if i > 1:


why only one pass? will it fail if i > 5?

This test takes too long if i > 5

OK, but you should have more robust test like i > 3

tests/test_tensor/test_zero_optim.py

feifeibear · 2022-06-01T08:09:16Z

colossalai/zero/zero_optimizer.py

+                                             max_scale=max_scale)
+        self._found_overflow: torch.Tensor = torch.zeros(1, dtype=torch.int64, device=torch.cuda.current_device())
+        self.dp_process_group = gpc.get_group(ParallelMode.DATA)
+        self.mp_process_group = gpc.get_group(ParallelMode.MODEL)


why don't we have a ParallelMode.Global?

@FrankLeeeee shall we add this mode?

I think implementing it is not difficult. You can consider it as a patch in future.

We have this mode actually...

feifeibear · 2022-06-01T08:11:55Z

colossalai/zero/zero_optimizer.py

+            for p in group['params']:
+                if not self.module.chunk_manager.is_chunk_free(p):
+                    fp32_p = self.fp16_param_to_fp32_param[p]
+                    self.module.chunk_manager.copy_tensor_to_chunk_slice(p, fp32_p)


copy in tensors is inefficient....
You should add a TODO and improve this line later?

FrankLeeeee · 2022-06-01T13:55:00Z

tests/test_tensor/test_gpt.py

@@ -68,6 +68,7 @@ def run_gpt(init_spec_func, use_ddp):
    for i, (input_ids, attn_mask) in enumerate(train_dataloader):
        logits = model(input_ids, attn_mask)
        torch_logits = torch_model(input_ids, attn_mask)
+        print(torch_logits, logits)


remove this.

ver217 added 8 commits May 31, 2022 17:14

add zero optimizer

0302df0

torch ok

2b18701

unit test ok

70a1ae5

polish code

578526b

fix bugs

9a4eca9

polish unit test

3217fab

polish zero optim

6274260

polish colo ddp v2

8ab6d69

ver217 marked this pull request as ready for review June 1, 2022 07:05

ver217 added the Run Build and Test label Jun 1, 2022

FrankLeeeee reviewed Jun 1, 2022

View reviewed changes

refactor folder structure

2bbe207

feifeibear reviewed Jun 1, 2022

View reviewed changes

add comment

126ce1f

feifeibear approved these changes Jun 1, 2022

View reviewed changes

polish unit test

5d858a0

FrankLeeeee reviewed Jun 1, 2022

View reviewed changes

ver217 added 2 commits June 2, 2022 11:31

polish zero optim

d36276d

polish unit test

25775c0

ver217 merged commit 51b9a49 into main Jun 2, 2022

ver217 deleted the feature/colo-tensor-optim branch June 2, 2022 04:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero] add zero optimizer for ColoTensor #1046

[zero] add zero optimizer for ColoTensor #1046

ver217 commented May 31, 2022

FrankLeeeee Jun 1, 2022

ver217 Jun 1, 2022

feifeibear Jun 1, 2022

ver217 Jun 1, 2022

feifeibear Jun 1, 2022

feifeibear Jun 1, 2022

ver217 Jun 1, 2022

feifeibear Jun 1, 2022

FrankLeeeee Jun 1, 2022

feifeibear Jun 1, 2022

ver217 Jun 1, 2022

FrankLeeeee Jun 1, 2022

[zero] add zero optimizer for ColoTensor #1046

[zero] add zero optimizer for ColoTensor #1046

Conversation

ver217 commented May 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment