-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda devices unavailable or busy #73
Comments
This seems to be a common issue caused by the tensorflow docker container not releasing GPUs when slurm kills the job thus leaving slurm believing a GPU is free but it is in fact still locked by a process that hasn't been correctly reaped. Related issues:
This can also happen when someone manually overrides |
Is it possible to check the gpu memory usage after a job is finished or killed? |
Same proble slurmstepd-dgj401: error: task_p_post_term: rmdir(/dev/cpuset/slurm_dgj401_304107/slurm304107.4294967294_0) failed Device or resource busy |
I confirm to also get these errors |
I obtain the same error, and it is occurring quite frequently these days. For instance on a single-GPU job where I am assigned GPU 1:
One can see that device 1 has no process running on it, yet all the memory is already used. It is possible to find the processes responsible for locking the GPU memory with a command like: Using this command, maybe the processes listed by Thanks in advance. |
Unfortunately |
I too encounter this problem. This happened to me on As I requested 8 GPUs (and on the
Apparently some phantom process is occupying some memory on The following is a minimal script that reproduces the error: import torch
a = torch.rand(1, device=cuda:0) Produces: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable However, if I specify any another device from Has there been any solutions so far? |
@qizhuli Looks like corrupted GPU state. I raised this back in January (this is slightly different from iberrada's issue where processes are still running, where slurm has probably failed to reap the processes). I think the issue is being tracked internally on service now (https://stfc.service-now.com/hartreecentre) under the reference |
My previous post was a similar issue: I had a single-GPU job on device 1 (I've edited my previous post to make this clearer). Then nvidia-smi showed no process running on device 1, yet all its memory was used. |
@lberrada @willprice Thanks for the responses. It does look like we all encountered a similar issue. How do I add myself to the watch list of |
@lberrada, you're quite right, sorry, I misread your nvidia-smi output. |
I am still encountering the error. However, lately I have noticed that when trying to use the GPU there is no process running on the GPU, however, the memory on GPUs is still allocated:
CUDA_VISIBLE_DEVICES = 3 in this case. I encounter this problem in 30% of my job submits which is very frustrating. |
Hi,
out of 15 jobs usually 2-5 give me the following error message:
It would be very nice if you could fix this.
Since it happens so often I do not think this is related to any specific node, but a general problem.
However, if you need more information in order to tackle the issue, please let me know!
Best,
Philipp
The text was updated successfully, but these errors were encountered: