-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gemini] fix grad unreleased issue and param recovery issue #2052
Conversation
if p.data.device.type == "cpu": | ||
raise NotImplementedError("Only free cuda memory") | ||
p.cpu_data = torch.empty(p.data.shape, dtype=self.dtype, device="cpu") | ||
p.cpu_data.copy_(p.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.data不也是你自己empty出来的么?这样会备份到有意义的信息么?
Comments:
|
@@ -15,42 +15,76 @@ class TrainingPhase(Enum): | |||
BACKWARD = 1 | |||
|
|||
|
|||
class MemInfo(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些信息都是gpu memory的么?if so,应该改名叫CUDAMemInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数保存与恢复
拆分grad hook和param hook
MemInfo
shared module处理
测试
https://github.com/hpcaitech/ColossalAI/blob/main/tests/test_gemini/test_param_tracer.py