-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[checkpoint] save sharded optimizer states #1237
Conversation
…nto tensor/test_model
# only rank 0 saves the REPLICATE tensors. | ||
optim_state = { | ||
'epoch': epoch, | ||
'optimizer': colo_state_dict(optimizer, state_dict_func=torch.optim.Optimizer.state_dict), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在Optimizer state_dict里面的Tensor可能是shard之后的colotensor。
如果直接序列化,会有process_group无法序列化的问题。
我想参考你写colo_state_dict解决这个事情。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, 我觉得ok。但是那个函数不知道被谁被更新过了,有一些不明确的参数命名, e.g. mapping1, mapping2
,可以顺手改一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的!
Previsouly, ColoTensor has dist.ProcessGorup type attributes. Therefore, it is impossible to _save a ColoTensor when call state_dict(). |
No description provided.