Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can DeepSpeedCPUAdam be used as a drop in replacement to torch.optim.Adam? #479

Closed
ofirzaf opened this issue Oct 21, 2020 · 10 comments

Comments

@ofirzaf
Copy link

ofirzaf commented Oct 21, 2020

Hi,

I want to use the DeepSpeedCPUAdam instead of torch.optim.Adam for reducing the RAM usage of my GPUs while training.
I was wondering if DeepSpeedCPUAdam can be just dropped in instead of torch.optim.Adam or additional steps are needed?
I tried to do exactly that and I got a segmentation fault

Thanks

@tjruwase
Copy link
Contributor

@ofirzaf Thanks for using DeepSpeed.

Yes, you can use DeepSpeedCPUAdam as drop-in replacement for torch.optim.Adam, that will give up to a 6X throughput boost. Let us know of any issues.

@ofirzaf
Copy link
Author

ofirzaf commented Oct 22, 2020

I tried to make it work with mnist example pytorch published: https://pastebin.pl/view/embed/d3921ab0
When I try to run it I get:

Adam Optimizer #0 is created with AVX2 arithmetic capability.
group 0 param 0 = 288
Segmentation fault (core dumped)

I am running with pytorch 1.6, cuda 10.1, deepspeed installed from master and v0.3 using DS_BUILD_CPU_ADAM=1 install.sh
Also tried to use your docker image to check if the problem is in my installation but the image doesn't have cpu adam installed, got ModuleNotFoundError on cpu_adam_op

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @ofirzaf

It seems that CPUAdam is instantiated correctly. However, it has crashed when trying to access the parameters on CPU memory. I see in your code, that the model is converted in line 124, and can be sent over to the gpu memory if the use_cuda flag is set in line 98. CPUAdam requires the parameter (master copy in FP32) and the optimizer states such as variance and momentum reside on CPU memory. Could you please check if such requirement is met in your test?

Thanks,
Reza

@ofirzaf
Copy link
Author

ofirzaf commented Oct 27, 2020

I initialize the optimizer while the model's parameters which I want to optimize are passed to the GPU when I move the entire model to the GPU. I am not sure how should I keep the parameters and other stats on CPU while still training on GPU.
Maybe I misunderstood and this optimizer should only be used when training using CPU only?

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @ofirzaf ,

In this case, you probably want to run your code through DeepSpeed and turning on the ZeRO-Offload feature. You can find more information on ZeRO-offload here: https://www.deepspeed.ai/tutorials/zero-offload/. Also, to get started with DeepSpeed, please check out this tutorial: https://www.deepspeed.ai/getting-started/
Thanks.
Reza

@tjruwase
Copy link
Contributor

tjruwase commented Oct 29, 2020

@ofirzaf, yes you are right, DeepSpeedCPUAdam is only used for running optimizer on CPU only. I apologize that my initial response was confusing because I did not clarify that while torch.optim.Adam works correctly on both GPU and CPU, DeepSpeedCPUAdam only works on CPU. So it is only on CPU execution that DeepSpeedCPUAdam is a drop in replacement for torch.optim.Adam.

@tjruwase
Copy link
Contributor

Closing for lack of activity. Please reopen as needed.

@peterukk
Copy link

peterukk commented Aug 5, 2021

So DeepSpeedCPUAdam should work without CUDA?

  File "...python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 37, in installed_cuda_version
    assert cuda_home is not None, "CUDA_HOME does not exist, unable to compile CUDA op(s)"

@tjruwase
Copy link
Contributor

tjruwase commented Aug 5, 2021

@peterukk, DeepSpeedCPUAdam will not work without CUDA (in theory it could). The reason is that DeepSpeedCPUAdam has a mode of execution where it also copies the updated parameters back to GPU using CUDA kernels. Do you have scenario where you want to use DeepSpeedCPUAdam outside CUDA environment? Depending on your answer can you please open a Question or reopen this one as appropriate? Thanks.

@peterukk
Copy link

peterukk commented Aug 5, 2021

@tjruwase thanks, I opened a new issue

tjruwase pushed a commit that referenced this issue Aug 23, 2023
* INT4 weight only quantization

* pre commit

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* add zero3 test

* quantize small weight first to prevent oom

* fold quantization config into ds_config

* Fix license & refactor ds_config & rebase master

* fix UT
github-merge-queue bot pushed a commit that referenced this issue Sep 11, 2023
* INT4 weight only quantization (#479)

* INT4 weight only quantization

* pre commit

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* add zero3 test

* quantize small weight first to prevent oom

* fold quantization config into ds_config

* Fix license & refactor ds_config & rebase master

* fix UT

* Moving quantization into post_init_method and add int4 dequantization kernel (#522)

* Add experimental int4 dequantize kernel

* move quantiation into post_init_method

* fix

* Refactor: move int4 code to deepspeed/inference (#528)

* Move int 4 code to deepspeed/inference

* fix

* fix

* fix

* zero++ tutorial PR (#3783)

* [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169)

* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* fix interpolate flops compute (#3782)

* use `Flops Profiler` to test `model.generate()` (#2515)

* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* revert PR #3611 (#3786)

* bump to 0.9.6

* ZeRO++ chinese blog (#3793)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* remove staging trigger (#3792)

* DeepSpeed-Triton for Inference (#3748)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO++ (#3784)

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

* adding zero++ to navigation panel of deepspeed.ai (#3796)

* Add ZeRO++ Japanese blog (#3797)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>

* Bug Fixes for autotuner and flops profiler (#1880)

* fix autotuner when backward is not called

* fix format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Missing strided copy for gated MLP (#3788)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* Requires grad checking. (#3789)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump to 0.10.0

* Fix Bug in transform.cu (#3534)

* Bug fix

* Fixed formatting error

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* bug fix: triton importing error (#3799)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix dequant bug

* Address PR feedback

* Use super() __exit__

* Fix unit tests

---------

Co-authored-by: Donglin Zhuang <donglinzhuang@outlook.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com>
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants