-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483
Comments
To enable cpu-adam please Install cpufeature: Run |
Hi tjruwase@, thanks for the quick response. I saw the DS_BUILD_CPU_ADAM=1 change required in one of the commits so made the change and was able to successfully install this. This gets rid of the module not found error but I encounter another error while training with zero-offload enabled ` RuntimeError: CUDA error: invalid device function (copy_device_to_device at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/ATen/native/cuda/Copy.cu:81) |
Additional pointer is that I'm able to train Zero2 without the cpu-optimizer enabled and it works fine. |
FYI this is what the result of installation looks like. Successfully installed deepspeed-0.3.0+d720fdb |
@GrvLeo Thanks for sharing these updates. The stack trace is quite mysterious because although you are training on 1 GPU and the failure is in the all-gather. To help debug further, can you please try torch.optim.Adam instead of DeepSpeedCPUAdam by adding the following command line argument: |
Hi @tjruwase, It worked for me now and I'm able to train 10B param model. Thanks for the help. |
Hi,
I'm looking to use Zero-offload feature to train 10B param model on a single GPU. I have been able to train models using Zero-2 but when I enable cpu-optimizer flag the job fails with the following error.
"ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'"
Not sure why this is happening, although I do see there's a recent change to disable default installation https://github.com/microsoft/DeepSpeed/pull/450/files/19c51251f1f6d32099fe321911316eeacaa9ed26
Is there something user's need to enable the installation of this?
Appreciate the help.
The text was updated successfully, but these errors were encountered: