-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Using and Building DeepSpeedCPUAdam #5677
Comments
I also encountered this bug. I have been using Docker images to install and use deepspeed, and the same code I have used worked before. However, when I created a new container through the Docker image and installed deepspeed, the same issue as described above occurs. Is this a problem with the 0.14.3 version of deepspeed? How can I resolve this? Specific system information is as follows:
|
I get same error and noticed that cpu_adam.so didn't get properly built. In my case it seems a missing depedency. You can scroll back further to see what caused the module didn't load.
|
@delock
|
@delock i got same error,infact i have the libcurand.so |
The full error message from gcc might give an indication what might have gone wrong. The real reason for kernel build failure might be different in your case. One thing I usually try is execute the following command printed out by DeepSpeed manually so this specific error can be reproduced and triaged.
|
Hi @delock DS_BUILD_CPU_ADAM=1 pip install deepspeed or DS_BUILD_CPU_ADAM=1 pip install . # inside DeepSpeed directory Here is the command it is stuck at
And here is where it is stuck
|
Now when using
It gives the same above error but telling which line
|
The message posted are warning message, which compiler should not stop. Usually compile stops when encountered an error. When you execute the gcc line manually, what is the first error compiler reported? |
When running
or
I have this error
Then when running the latest gcc command
I get
|
OK so the building is stuck sometimes due to having ~/.cache/torch_extensions/. |
This specific error indicates that gcc has encountered some serious error. I ran the same command on my system and I see gcc spend several seconds after the unroll warning message before it finally exists successfully. I suspect there are something complicate in the code that it stress some version of gcc out.
What is your gcc version? I'm using gcc 12.3.0 and I can finally finish compiling. |
Thanks @delock
|
Did you encountered same issue with other GCC version? Have you tried gcc 12.3? |
I checked out the latest code and it's back to the 'stuck' phase .. I used GCC 12.2.0 |
I encountered the build paused several seconds but never stuck. I saw CPU adam has some new feature last month. Hi @BacharL did you encountered anything abnormal during building process when you change this kernel? |
I never encountered compilation being aborted. I used GCC 11.4.0 and now I tested on 12.2.0 |
Can you try on a device with a GPU and CUDA please? |
(I use a supercomputer with a shared FS environment. The GPU node shares the venv with the CPU node) When I run
Then when I run the code I get this error
I run the command to fix this issue as I read here in other solutions, but still it doesn't work
Now when I log in to the GPU node an run ds_report
And
Now when I uninstall deepspeed, remove .cache/torch_extensions, and run
or sometimes it continues to this error
This is the whole story :) |
Hi @oabuhamdan - can you summarize the state of this, is there a bug that needs more debugging, or do we think this is something perhaps unique to your setup/cuda/torch/hw? |
Hi @loadams Summary: Thanks! |
Thanks for the quick summary @oabuhamdan - I'll test this on my side as well. Though I believe this runs currently in the nv-pre-compile-ops workflow, so this may be setup related? |
Thanks for continuing this thread @loadams.
Cuda 11.6 was released in January 2022. We have 12.5 now (May 2024). Second, the ds_report doesn't look fine. It says it uses CPU, and CPUAdam is not even installed.
In my previous comment, I stated that the build works when I use CPU node, but with GPU node it fails. Appreciate your help! |
@oabuhamdan - thanks for clarifying, I forgot that our node for that wasn't using GPUs, I'll work on getting a repro and will share my results here. |
Describe the bug
I installed deepspeed with pip install deepspeed and tried to use DeepSpeedCPUAdam but with this error
After trying one of the solutions posted to one of the issues here
I got this error
Expected behavior
DeepSpeedCPUAdam to work.
ds_report output
System info (please complete the following information):
The text was updated successfully, but these errors were encountered: