model.float() OOM as #3510 #3606

lk137095576 · 2024-05-07T07:14:50Z

Reminder

I have read the README and searched the existing issues.

Reproduction

#3510
看下面的图，反向是用的fp16的，计算之后才后cast为32的。
从下面微软介绍deepspeed的视频也可以看到，反向的时候，用的也是16。
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

这里提到过，但是我不知道如何reopen issue，所以新提了一个。

Expected behavior

No response

System Info

No response

Others

No response

lk137095576 · 2024-05-13T03:05:53Z

我今天又看了下，我测试的是14B，如果是72B full的话，即使是80G的显卡也撑不过model.float()

hiyouga · 2024-05-15T15:06:08Z

fixed

hiyouga added the pending This problem is yet to be addressed. label May 7, 2024

hiyouga added the enhancement New feature or request label May 15, 2024

hiyouga closed this as completed in 44cfa9a May 15, 2024

hiyouga added solved This problem has been already solved. and removed enhancement New feature or request pending This problem is yet to be addressed. labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.float() OOM as #3510 #3606

model.float() OOM as #3510 #3606

lk137095576 commented May 7, 2024

lk137095576 commented May 13, 2024

hiyouga commented May 15, 2024

model.float() OOM as #3510 #3606

model.float() OOM as #3510 #3606

Comments

lk137095576 commented May 7, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

lk137095576 commented May 13, 2024

hiyouga commented May 15, 2024