Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gemini] fix grad unreleased issue and param recovery issue #2052

Merged
merged 4 commits into from
Dec 2, 2022
Merged

[Gemini] fix grad unreleased issue and param recovery issue #2052

merged 4 commits into from
Dec 2, 2022

Conversation

zengzh95
Copy link
Contributor

@zengzh95 zengzh95 commented Nov 30, 2022

参数保存与恢复

  1. ParamTracerWrapper中定义cpu_param_data_dict属性,forward执行之前调用_save_param_data_on_cpu函数备份参数
  2. Tracer过程中,_free_cuda_params函数和_allocate_params_on_cuda函数只负责释放和分配cuda内存,不拷贝数据
  3. backward执行完后,调用_restore_param_data函数恢复参数,并清空cpu_param_data_dict

拆分grad hook和param hook

  1. 添加GradHook类
  2. ParamTracerWrapper实例执行forward之前,调用GradHook实例的register_grad_hook函数给参数挂上hook
  3. backward执行之后,调用GradHook实例的remove_grad_hook函数删除参数的hook

MemInfo

  1. model_data_list 记录 model data 的内存
  2. non_model_data_list 记录 non model data 的内存
  3. unreleased_grad_flag 表示参数对应的梯度是否未释放
  4. unreleased_grad_volume 表示未释放的梯度数据量

shared module处理

  1. sample_model_data函数中计算model data时,初始化data_volume为unreleased_grad_volume,
  2. sample_model_data函数中判断参数梯度是否处于未释放状态,如果是则不用累加unreleased_grad_volume,PyTorch梯度累加器只保存一份梯度
  3. grad_handle函数中释放梯度

测试

https://github.com/hpcaitech/ColossalAI/blob/main/tests/test_gemini/test_param_tracer.py

if p.data.device.type == "cpu":
raise NotImplementedError("Only free cuda memory")
p.cpu_data = torch.empty(p.data.shape, dtype=self.dtype, device="cpu")
p.cpu_data.copy_(p.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 p.data不也是你自己empty出来的么?这样会备份到有意义的信息么?

@feifeibear
Copy link
Contributor

Comments:

  1. 把param hook和grad hook拆成两个class。current memory statistics信息单独一个类,被前二者的实例共同更新。
  2. 备份param.data逻辑抽出来,在tracer运行前备份,在运行后恢复。

@@ -15,42 +15,76 @@ class TrainingPhase(Enum):
BACKWARD = 1


class MemInfo():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些信息都是gpu memory的么?if so,应该改名叫CUDAMemInfo

Copy link
Contributor

@feifeibear feifeibear Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你一致把它当成全局变量用。建议使用单例模式。
image

@feifeibear feifeibear changed the title fix grad unreleased issue and param recovery issue [Gemini] fix grad unreleased issue and param recovery issue Dec 2, 2022
@feifeibear feifeibear merged commit 38ea4ba into hpcaitech:main Dec 2, 2022
@zengzh95 zengzh95 deleted the rumtimeTracer branch December 2, 2022 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants