[checkpoint] save sharded optimizer states #1237

feifeibear · 2022-07-08T07:09:46Z

No description provided.

…nto tensor/test_model

…st_model

…bug_ckp

feifeibear · 2022-07-08T07:11:19Z

colossalai/utils/checkpoint/module_checkpoint.py

+    # only rank 0 saves the REPLICATE tensors.
+    optim_state = {
+        'epoch': epoch,
+        'optimizer': colo_state_dict(optimizer, state_dict_func=torch.optim.Optimizer.state_dict),


@FrankLeeeee

现在Optimizer state_dict里面的Tensor可能是shard之后的colotensor。
如果直接序列化，会有process_group无法序列化的问题。
我想参考你写colo_state_dict解决这个事情。

可能只需要把上面哪行替换成optimizer的OS访问

Ok, 我觉得ok。但是那个函数不知道被谁被更新过了，有一些不明确的参数命名, e.g. mapping1, mapping2，可以顺手改一下。

feifeibear · 2022-07-08T08:27:13Z

Previsouly, ColoTensor has dist.ProcessGorup type attributes. Therefore, it is impossible to _save a ColoTensor when call state_dict().
In this PR, I removed dist.ProcessGorup type attributes in ColoTensor.

feifeibear and others added 30 commits July 4, 2022 14:48

init a checkpoint dir

543dbb2

[checkpoint]support resume for cosinewarmuplr

5837f30

[checkpoint]add unit test

8b0ce12

Merge branch 'main' into ckp

f9bce8f

fix some bugs but still not OK

f67aa3a

fix bugs

66a4d81

make it faster

86be744

Merge branch 'main' of github.com:hpcaitech/ColossalAI into main

4b2333e

[checkpoint]support generalized scheduler

76abb58

Merge branch 'ckp' of github.com:hpcaitech/ColossalAI into ckp

3abafb8

polish

47d7cf0

Merge branch 'main' of github.com:hpcaitech/ColossalAI into main

5078a72

[tensor] torch function return colotensor

8288746

polish

4f7e146

fix bugs

3e17cba

remove debug info

24a9c86

polish

5e780b7

polish

73a13e5

Merge branch 'hotfix/torch_func' into ckp

9f20458

Merge branch 'main' of github.com:hpcaitech/ColossalAI into ckp

339e8e4

[tensor] test_model pass unittests

490bf07

polish

3566167

[hotfix] fx get comm size bug

87da29c

polish

23d9aad

[tensor] fix some unittests

369abef

Merge branch 'main' into tensor/unittests

4bf067f

polish

174d284

Merge branch 'tensor/unittests' of github.com:feifeibear/ColossalAI i…

6027f48

…nto tensor/test_model

Merge branch 'main' of github.com:hpcaitech/ColossalAI into tensor/te…

b4f4297

…st_model

fix unitest bugs in test_model

563b967

feifeibear added 4 commits July 8, 2022 13:25

polish code

2a66c13

Merge branch 'main' of github.com:hpcaitech/ColossalAI into tensor/te…

d121a5b

…st_model

polish code

5ce8263

try to debug save ckp for optimizer

3de2fa9

feifeibear requested a review from ver217 July 8, 2022 07:09

Merge branch 'main' of github.com:hpcaitech/ColossalAI into tensor/de…

601c915

…bug_ckp

feifeibear commented Jul 8, 2022

View reviewed changes

correctly serialized colo tensor

8efe509

polish

c255f67

feifeibear added the Run Build and Test label Jul 8, 2022

ZhaoYi1222 approved these changes Jul 8, 2022

View reviewed changes

feifeibear merged commit 20da6e4 into hpcaitech:main Jul 8, 2022

feifeibear deleted the tensor/debug_ckp branch July 8, 2022 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[checkpoint] save sharded optimizer states #1237

[checkpoint] save sharded optimizer states #1237

feifeibear commented Jul 8, 2022

feifeibear Jul 8, 2022

feifeibear Jul 8, 2022

feifeibear Jul 8, 2022

FrankLeeeee Jul 8, 2022

feifeibear Jul 8, 2022

feifeibear commented Jul 8, 2022

[checkpoint] save sharded optimizer states #1237

[checkpoint] save sharded optimizer states #1237

Conversation

feifeibear commented Jul 8, 2022

feifeibear Jul 8, 2022

Choose a reason for hiding this comment

feifeibear Jul 8, 2022

Choose a reason for hiding this comment

feifeibear Jul 8, 2022

Choose a reason for hiding this comment

FrankLeeeee Jul 8, 2022

Choose a reason for hiding this comment

feifeibear Jul 8, 2022

Choose a reason for hiding this comment

feifeibear commented Jul 8, 2022