-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why cpu_checkpointing can't work? #522
Comments
I think this is related to #541 |
@ghosthamlet |
@hpourmodheji, thanks for the question. This should have been fixed by #1254. Are you still having this problem? |
@tjruwase, thanks for your comment. I tried ZeRO-Infinity with the DS version 0.5.7+, but it seems that activations are not offloaded to the CPU. Here is the config part: "zero_optimization": {
|
Training Stages | Baseline: GPU Memory Consumption during Training with single GPU | ZeRO-Infinity: GPU Memory Consumption during Training with single GPU | Memory consumption reduction |
---|---|---|---|
before forward | MA 4.38 GB | MA 0.12 GB | ~37x |
before backward | MA 5.74 GB | MA 1.55 GB | ~4x |
before optimizer | MA 5.01 GB | MA 0.12 GB | ~42x |
@tjruwase, Do you kindly have any comment on this and offloading activations? |
@hpourmodheji Sorry, i did not try single GPU for activation_checkpointing again, only added another GPU, then it works. |
@ghosthamlet, Thanks for your comment. I tried with 2 GPUs, and it still does not work. I hope to hear from the DS team. |
@ghosthamlet and @hpourmodheji, it seems there a few issues here. I will take a closer look. |
@ghosthamlet, can you please confirm that activation_checkpointing and cpu_offloading work as expected with 2 GPUs, but not with 1 GPU? |
@hpourmodheji, thanks for the table you shared earlier, but I think we need something different for this investigation. Activation checkpointing (including cpu offload) can be enabled without zero stage 3. And so, it would be good to disable zero by removing the
The collected information will help with the next steps of investigation. Thanks! |
@tjruwase Sorry for the late reply.
Configs:
ds_report:
|
@ghosthamlet, awesome! Thanks so much for sharing this detailed feedback. If possible, can you please check if this activation checkpointing features also work with zero stages < 3? Thanks! |
@tjruwase For stage0/stage1 2080Ti GPU run out of memory, so i have to test small model. I think stage0/stage1/stage2 working too:
I want to ask another question:
|
@tjruwase, sorry for my late reply. Please see the table below. Unfortunately, I see no offloading. Can you please advise?
No activation checkpointing activation_checkpointing removed:{ "_activation_checkpointing": {
}, Activation checkpointing without cpu offloading cpu_checkpointing = false:{ "activation_checkpointing": {
}, Activation checkpointing without cpu offloading cpu_checkpointing = true:{ "activation_checkpointing": {
}, $ ds_report-------------------------------------------------- |
@hpourmodheji, thanks for your response. By the way, are you testing with an existing deepseed examples code, or your own model? If you are using your own, did you properly wrap your forward pass as done below
I suspect you may be doing this already but wanted to be sure. If you are already doing this, then is it possible to share your model code for me to repro? Thanks! |
@tjruwase, I use your bing_bert example. It seems there is no checkpointing in this model. Since we have no Megatron module in this example, how can use checkpointing in this example? |
@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following: |
… kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix
* INT4 weight only quantization (#479) * INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * add zero3 test * quantize small weight first to prevent oom * fold quantization config into ds_config * Fix license & refactor ds_config & rebase master * fix UT * Moving quantization into post_init_method and add int4 dequantization kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix * Refactor: move int4 code to deepspeed/inference (#528) * Move int 4 code to deepspeed/inference * fix * fix * fix * zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO++ (#3784) Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix dequant bug * Address PR feedback * Use super() __exit__ * Fix unit tests --------- Co-authored-by: Donglin Zhuang <donglinzhuang@outlook.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com> Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can't do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can't offload all its checkpoints to the CPU?
My CPU memory is enough, configs:
Environment:
python 3.6
torch 1.6.0
deepspeed 0.3.7
The text was updated successfully, but these errors were encountered: