Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[checkpoint] save sharded optimizer states #1237

Merged
merged 37 commits into from
Jul 8, 2022
Merged

[checkpoint] save sharded optimizer states #1237

merged 37 commits into from
Jul 8, 2022

Conversation

feifeibear
Copy link
Contributor

No description provided.

@feifeibear feifeibear requested a review from ver217 July 8, 2022 07:09
# only rank 0 saves the REPLICATE tensors.
optim_state = {
'epoch': epoch,
'optimizer': colo_state_dict(optimizer, state_dict_func=torch.optim.Optimizer.state_dict),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在Optimizer state_dict里面的Tensor可能是shard之后的colotensor。
如果直接序列化,会有process_group无法序列化的问题。
我想参考你写colo_state_dict解决这个事情。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

可能只需要把上面哪行替换成optimizer的OS访问

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, 我觉得ok。但是那个函数不知道被谁被更新过了,有一些不明确的参数命名, e.g. mapping1, mapping2,可以顺手改一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的!

@feifeibear
Copy link
Contributor Author

Previsouly, ColoTensor has dist.ProcessGorup type attributes. Therefore, it is impossible to _save a ColoTensor when call state_dict().
In this PR, I removed dist.ProcessGorup type attributes in ColoTensor.

@feifeibear feifeibear merged commit 20da6e4 into hpcaitech:main Jul 8, 2022
@feifeibear feifeibear deleted the tensor/debug_ckp branch July 8, 2022 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants