Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

Closed
GrvLeo opened this issue Oct 22, 2020 · 6 comments

Comments

@GrvLeo
Copy link

GrvLeo commented Oct 22, 2020

Hi,

I'm looking to use Zero-offload feature to train 10B param model on a single GPU. I have been able to train models using Zero-2 but when I enable cpu-optimizer flag the job fails with the following error.
"ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'"

Not sure why this is happening, although I do see there's a recent change to disable default installation https://github.com/microsoft/DeepSpeed/pull/450/files/19c51251f1f6d32099fe321911316eeacaa9ed26

Is there something user's need to enable the installation of this?
Appreciate the help.

@tjruwase
Copy link
Contributor

To enable cpu-adam please

Install cpufeature: pip install cpufeature

Run DS_BUILD_CPU_ADAM=1 ./install.sh -n
The -n is for incremental installation so that it does not rebuild existing binaries.

@GrvLeo
Copy link
Author

GrvLeo commented Oct 22, 2020

Hi tjruwase@,

thanks for the quick response. I saw the DS_BUILD_CPU_ADAM=1 change required in one of the commits so made the change and was able to successfully install this.

This gets rid of the module not found error but I encounter another error while training with zero-offload enabled
Added the stack trace for reference.

`
2020-10-22 21:31:28,140] [INFO] [config.py:624:print] zero_optimization_stage ...... 2 [50/1658]
[2020-10-22 21:31:28,140] [INFO] [config.py:631:print] json = {
"activation_checkpointing":{
"contiguous_memory_optimization":true,
"cpu_checkpointing":true,
"partition_activations":true
},
"fp16":{
"enabled":true,
"hysteresis":2,
"loss_scale":4096,
"loss_scale_window":1000,
"min_loss_scale":1
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"steps_per_print":1,
"train_micro_batch_size_per_gpu":10,
"wall_clock_breakdown":true,
"zero_allow_untested_optimizer":true,
"zero_optimization":{
"cpu_offload":true,
"reduce_bucket_size":50000000,
"stage":2
}
}
[2020-10-22 21:31:32,170] [INFO] [checkpointing.py:63:see_memory_usage] First Forward Begining
[2020-10-22 21:31:32,171] [INFO] [checkpointing.py:66:see_memory_usage] Memory Allocated 19.562528133392334 GigaBytes
[2020-10-22 21:31:32,172] [INFO] [checkpointing.py:70:see_memory_usage] Max Memory Allocated 20.299896240234375 GigaBytes
[2020-10-22 21:31:32,172] [INFO] [checkpointing.py:74:see_memory_usage] Cache Allocated 20.3203125 GigaBytes
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:78:see_memory_usage] Max cache Allocated 20.3203125 GigaBytes
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:357:forward] Activation Checkpointing Information
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:359:forward] ----Partition Activations True, CPU CHECKPOINTING True
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:362:forward] ----contiguous Memory Checkpointing True with 50 total layers
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:364:forward] ----Synchronization False
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:365:forward] ----Profiling False
Traceback (most recent call last):
File "pretrain_gpt2.py", line 717, in
main()
File "pretrain_gpt2.py", line 694, in main
timers, args)
File "pretrain_gpt2.py", line 427, in train
args, timers)
File "pretrain_gpt2.py", line 391, in train_step
model.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 917, in step
self._take_model_step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 875, in _take_model_step
self.optimizer.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/runtime/zero/stage2.py", line 1472, in step
group=self.dp_process_group)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1153, in all_gather
work = group.allgather([tensor_list], [tensor])

RuntimeError: CUDA error: invalid device function (copy_device_to_device at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/ATen/native/cuda/Copy.cu:81)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7fe30d6ae627 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::copy_device_to_device(at::TensorIterator&, bool) + 0x8e5 (0x7fe3149869b5 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: + 0x55af1ab (0x7fe3149881ab in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: + 0x15e93dd (0x7fe3109c23dd in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: + 0x15e56bf (0x7fe3109be6bf in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x3e (0x7fe3109c0bae in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: + 0x3f2dc78 (0x7fe313306c78 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: + 0xaa73ce (0x7fe33fc623ce in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xaae7ac (0x7fe33fc697ac in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at
::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) + 0x54 (0x7fe33fc6a514 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xa357a4 (0x7fe33fbf07a4 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x28c076 (0x7fe33f447076 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: _PyCFunction_FastCallDict + 0x154 (0x55a8921dd334 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #13: + 0x198ade (0x55a892264ade in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #15: + 0x191b76 (0x55a89225db76 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #16: + 0x192be6 (0x55a89225ebe6 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #17: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x10cb (0x55a89228831b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #19: + 0x191b76 (0x55a89225db76 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #20: + 0x192b83 (0x55a89225eb83 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #21: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #23: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #24: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #26: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #27: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #29: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #30: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #32: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #33: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #35: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #36: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #38: PyEval_EvalCodeEx + 0x329 (0x55a89225f6c9 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #39: PyEval_EvalCode + 0x1c (0x55a89226045c in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #40: + 0x214d54 (0x55a8922e0d54 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #41: PyRun_FileExFlags + 0xa1 (0x55a8922e1151 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #42: PyRun_SimpleFileExFlags + 0x1c3 (0x55a8922e1353 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #43: Py_Main + 0x613 (0x55a8922e4e43 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #44: main + 0xee (0x55a8921af28e in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #45: __libc_start_main + 0xea (0x7fe35562f02a in /lib64/libc.so.6)
frame #46: + 0x1c1fff (0x55a89228dfff in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)`

@GrvLeo
Copy link
Author

GrvLeo commented Oct 23, 2020

Additional pointer is that I'm able to train Zero2 without the cpu-optimizer enabled and it works fine.

@GrvLeo
Copy link
Author

GrvLeo commented Oct 23, 2020

FYI this is what the result of installation looks like.

Successfully installed deepspeed-0.3.0+d720fdb
Cleaning up...
Removed build tracker: '/tmp/pip-req-tracker-et4p62xs'
[SUCCESS] deepspeed successfully imported.
[INFO] torch install path: ['/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch']
[INFO] torch version: 1.4.0, torch.cuda: 10.1
[INFO] deepspeed install path: ['/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed']
[INFO] deepspeed info: 0.3.0+d720fdb, d720fdb, master
[SUCCESS] apex extensions successfully installed
[INFO] using new-style apex
[SUCCESS] fused lamb successfully installed.
[SUCCESS] transformer kernels successfully installed.
[WARNING] sparse attention is NOT installed.
[SUCCESS] cpu-adam (used by ZeRO-offload) successfully installed.
Installation is successful

@tjruwase
Copy link
Contributor

@GrvLeo Thanks for sharing these updates.

The stack trace is quite mysterious because although you are training on 1 GPU and the failure is in the all-gather. To help debug further, can you please try torch.optim.Adam instead of DeepSpeedCPUAdam by adding the following command line argument:
--torch-adam

@GrvLeo
Copy link
Author

GrvLeo commented Oct 23, 2020

Hi @tjruwase,
Yeah it was mysterious not sure what was happening, I saw some threads on stack overflow which pointed to having same cuda version as pytorch.
I tried reinstalling deepspeed from scratch while enabling deepspeed adam.

It worked for me now and I'm able to train 10B param model. Thanks for the help.

@GrvLeo GrvLeo closed this as completed Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants