Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

GrvLeo · 2020-10-22T20:22:26Z

Hi,

I'm looking to use Zero-offload feature to train 10B param model on a single GPU. I have been able to train models using Zero-2 but when I enable cpu-optimizer flag the job fails with the following error.
"ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'"

Not sure why this is happening, although I do see there's a recent change to disable default installation https://github.com/microsoft/DeepSpeed/pull/450/files/19c51251f1f6d32099fe321911316eeacaa9ed26

Is there something user's need to enable the installation of this?
Appreciate the help.

tjruwase · 2020-10-22T21:39:15Z

To enable cpu-adam please

Install cpufeature: pip install cpufeature

Run DS_BUILD_CPU_ADAM=1 ./install.sh -n
The -n is for incremental installation so that it does not rebuild existing binaries.

GrvLeo · 2020-10-22T23:46:06Z

Hi tjruwase@,

thanks for the quick response. I saw the DS_BUILD_CPU_ADAM=1 change required in one of the commits so made the change and was able to successfully install this.

This gets rid of the module not found error but I encounter another error while training with zero-offload enabled
Added the stack trace for reference.

`
2020-10-22 21:31:28,140] [INFO] [config.py:624:print] zero_optimization_stage ...... 2 [50/1658]
[2020-10-22 21:31:28,140] [INFO] [config.py:631:print] json = {
"activation_checkpointing":{
"contiguous_memory_optimization":true,
"cpu_checkpointing":true,
"partition_activations":true
},
"fp16":{
"enabled":true,
"hysteresis":2,
"loss_scale":4096,
"loss_scale_window":1000,
"min_loss_scale":1
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"steps_per_print":1,
"train_micro_batch_size_per_gpu":10,
"wall_clock_breakdown":true,
"zero_allow_untested_optimizer":true,
"zero_optimization":{
"cpu_offload":true,
"reduce_bucket_size":50000000,
"stage":2
}
}
[2020-10-22 21:31:32,170] [INFO] [checkpointing.py:63:see_memory_usage] First Forward Begining
[2020-10-22 21:31:32,171] [INFO] [checkpointing.py:66:see_memory_usage] Memory Allocated 19.562528133392334 GigaBytes
[2020-10-22 21:31:32,172] [INFO] [checkpointing.py:70:see_memory_usage] Max Memory Allocated 20.299896240234375 GigaBytes
[2020-10-22 21:31:32,172] [INFO] [checkpointing.py:74:see_memory_usage] Cache Allocated 20.3203125 GigaBytes
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:78:see_memory_usage] Max cache Allocated 20.3203125 GigaBytes
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:357:forward] Activation Checkpointing Information
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:359:forward] ----Partition Activations True, CPU CHECKPOINTING True
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:362:forward] ----contiguous Memory Checkpointing True with 50 total layers
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:364:forward] ----Synchronization False
[2020-10-22 21:31:32,173] [INFO] [checkpointing.py:365:forward] ----Profiling False
Traceback (most recent call last):
File "pretrain_gpt2.py", line 717, in
main()
File "pretrain_gpt2.py", line 694, in main
timers, args)
File "pretrain_gpt2.py", line 427, in train
args, timers)
File "pretrain_gpt2.py", line 391, in train_step
model.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 917, in step
self._take_model_step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 875, in _take_model_step
self.optimizer.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/runtime/zero/stage2.py", line 1472, in step
group=self.dp_process_group)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1153, in all_gather
work = group.allgather([tensor_list], [tensor])

RuntimeError: CUDA error: invalid device function (copy_device_to_device at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/ATen/native/cuda/Copy.cu:81)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7fe30d6ae627 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::copy_device_to_device(at::TensorIterator&, bool) + 0x8e5 (0x7fe3149869b5 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: + 0x55af1ab (0x7fe3149881ab in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: + 0x15e93dd (0x7fe3109c23dd in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: + 0x15e56bf (0x7fe3109be6bf in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x3e (0x7fe3109c0bae in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: + 0x3f2dc78 (0x7fe313306c78 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: + 0xaa73ce (0x7fe33fc623ce in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xaae7ac (0x7fe33fc697ac in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at
::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) + 0x54 (0x7fe33fc6a514 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xa357a4 (0x7fe33fbf07a4 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x28c076 (0x7fe33f447076 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: _PyCFunction_FastCallDict + 0x154 (0x55a8921dd334 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #13: + 0x198ade (0x55a892264ade in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #15: + 0x191b76 (0x55a89225db76 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #16: + 0x192be6 (0x55a89225ebe6 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #17: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x10cb (0x55a89228831b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #19: + 0x191b76 (0x55a89225db76 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #20: + 0x192b83 (0x55a89225eb83 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #21: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #23: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #24: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #26: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #27: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #29: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #30: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #32: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #33: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #35: + 0x19296b (0x55a89225e96b in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #36: + 0x198a65 (0x55a892264a65 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x30a (0x55a89228755a in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #38: PyEval_EvalCodeEx + 0x329 (0x55a89225f6c9 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #39: PyEval_EvalCode + 0x1c (0x55a89226045c in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #40: + 0x214d54 (0x55a8922e0d54 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #41: PyRun_FileExFlags + 0xa1 (0x55a8922e1151 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #42: PyRun_SimpleFileExFlags + 0x1c3 (0x55a8922e1353 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #43: Py_Main + 0x613 (0x55a8922e4e43 in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #44: main + 0xee (0x55a8921af28e in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)
frame #45: __libc_start_main + 0xea (0x7fe35562f02a in /lib64/libc.so.6)
frame #46: + 0x1c1fff (0x55a89228dfff in /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python)`

GrvLeo · 2020-10-23T18:57:32Z

Additional pointer is that I'm able to train Zero2 without the cpu-optimizer enabled and it works fine.

GrvLeo · 2020-10-23T20:56:39Z

FYI this is what the result of installation looks like.

Successfully installed deepspeed-0.3.0+d720fdb
Cleaning up...
Removed build tracker: '/tmp/pip-req-tracker-et4p62xs'
[SUCCESS] deepspeed successfully imported.
[INFO] torch install path: ['/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch']
[INFO] torch version: 1.4.0, torch.cuda: 10.1
[INFO] deepspeed install path: ['/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed']
[INFO] deepspeed info: 0.3.0+d720fdb, d720fdb, master
[SUCCESS] apex extensions successfully installed
[INFO] using new-style apex
[SUCCESS] fused lamb successfully installed.
[SUCCESS] transformer kernels successfully installed.
[WARNING] sparse attention is NOT installed.
[SUCCESS] cpu-adam (used by ZeRO-offload) successfully installed.
Installation is successful

tjruwase · 2020-10-23T22:17:17Z

@GrvLeo Thanks for sharing these updates.

The stack trace is quite mysterious because although you are training on 1 GPU and the failure is in the all-gather. To help debug further, can you please try torch.optim.Adam instead of DeepSpeedCPUAdam by adding the following command line argument:
--torch-adam

GrvLeo · 2020-10-23T22:56:19Z

Hi @tjruwase,
Yeah it was mysterious not sure what was happening, I saw some threads on stack overflow which pointed to having same cuda version as pytorch.
I tried reinstalling deepspeed from scratch while enabling deepspeed adam.

It worked for me now and I'm able to train 10B param model. Thanks for the help.

GrvLeo closed this as completed Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

GrvLeo commented Oct 22, 2020

tjruwase commented Oct 22, 2020

GrvLeo commented Oct 22, 2020 •

edited

Loading

GrvLeo commented Oct 23, 2020

GrvLeo commented Oct 23, 2020

tjruwase commented Oct 23, 2020

GrvLeo commented Oct 23, 2020

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

Comments

GrvLeo commented Oct 22, 2020

tjruwase commented Oct 22, 2020

GrvLeo commented Oct 22, 2020 • edited Loading

GrvLeo commented Oct 23, 2020

GrvLeo commented Oct 23, 2020

tjruwase commented Oct 23, 2020

GrvLeo commented Oct 23, 2020

GrvLeo commented Oct 22, 2020 •

edited

Loading