Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda devices unavailable or busy #73

Open
henzler opened this issue Aug 10, 2018 · 12 comments
Open

Cuda devices unavailable or busy #73

henzler opened this issue Aug 10, 2018 · 12 comments

Comments

@henzler
Copy link

henzler commented Aug 10, 2018

Hi,

out of 15 jobs usually 2-5 give me the following error message:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorRandom.cu line=25 error=46 : all CUDA-capable devices are busy or unavailable

It would be very nice if you could fix this.
Since it happens so often I do not think this is related to any specific node, but a general problem.

However, if you need more information in order to tackle the issue, please let me know!

Best,
Philipp

@willprice
Copy link

This seems to be a common issue caused by the tensorflow docker container not releasing GPUs when slurm kills the job thus leaving slurm believing a GPU is free but it is in fact still locked by a process that hasn't been correctly reaped.

Related issues:

This can also happen when someone manually overrides CUDA_VISIBLE_DEVICES themselves to take over free, but unallocated (by slurm) GPUs on a node... I don't think there are any permission solutions in place to prevent this sort of abuse.

@KaiyangZhou
Copy link

Is it possible to check the gpu memory usage after a job is finished or killed?

@agniszczotka
Copy link

Same proble slurmstepd-dgj401: error: task_p_post_term: rmdir(/dev/cpuset/slurm_dgj401_304107/slurm304107.4294967294_0) failed Device or resource busy

@danielamassiceti
Copy link

I confirm to also get these errors

@lberrada
Copy link

lberrada commented Apr 24, 2019

I obtain the same error, and it is occurring quite frequently these days.

For instance on a single-GPU job where I am assigned GPU 1:

Wed Apr 24 10:42:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    56W / 300W |  15424MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   43C    P0    69W / 300W |  14402MiB / 16152MiB |     82%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   53C    P0   181W / 300W |  15012MiB / 16152MiB |     63%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   46C    P0   140W / 300W |  14980MiB / 16152MiB |     62%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    5     42126      C   ...xr01/anaconda3/envs/allennlp/bin/python 14392MiB |
|    6     42148      C   ...xr01/anaconda3/envs/allennlp/bin/python 15002MiB |
|    7     42157      C   ...xr01/anaconda3/envs/allennlp/bin/python 14970MiB |
+-----------------------------------------------------------------------------+
Using CUDA device 1
...
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

One can see that device 1 has no process running on it, yet all the memory is already used.

It is possible to find the processes responsible for locking the GPU memory with a command like:
lsof /dev/nvidia$CUDA_VISIBLE_DEVICES (https://stackoverflow.com/a/4571250).

Using this command, maybe the processes listed by lsof could be killed at either the start or end of each job, to make sure that the GPU memory is effectively free?

Thanks in advance.

@lberrada
Copy link

Unfortunately lsof /dev/nvidia$CUDA_VISIBLE_DEVICES does not seem to work on jade, I get an error similar to this.

@qizhuli
Copy link

qizhuli commented May 21, 2019

I too encounter this problem. This happened to me on dgj210, though it may not be a node specific problem. (edit: It is happening on dgj113 too. Exactly the same sympton, except that this time the problematic GPU is no. 7.)

As I requested 8 GPUs (and on the big partition), there shouldn't be anyone else sharing the node with me.

nvidia-smi returns the following:

Tue May 21 14:21:33 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   39C    P0    72W / 300W |   1092MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   37C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   37C    P0    41W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Apparently some phantom process is occupying some memory on GPU 0.

The following is a minimal script that reproduces the error:

import torch
a = torch.rand(1, device=cuda:0)

Produces:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

However, if I specify any another device from cuda:1 to cuda:7, the script produces no error, confirming that the problem is indeed with GPU 0 in this case.

Has there been any solutions so far?

@willprice
Copy link

@qizhuli Looks like corrupted GPU state. I raised this back in January (this is slightly different from iberrada's issue where processes are still running, where slurm has probably failed to reap the processes). I think the issue is being tracked internally on service now (https://stfc.service-now.com/hartreecentre) under the reference PRB0040275 . Might be worth getting in contact there and asking to be added to the watch list.

@lberrada
Copy link

My previous post was a similar issue: I had a single-GPU job on device 1 (I've edited my previous post to make this clearer). Then nvidia-smi showed no process running on device 1, yet all its memory was used.

@qizhuli
Copy link

qizhuli commented May 21, 2019

@lberrada @willprice Thanks for the responses. It does look like we all encountered a similar issue.

How do I add myself to the watch list of PRB0040275?

@willprice
Copy link

@lberrada, you're quite right, sorry, I misread your nvidia-smi output.
@qizhuli Perhaps contact the Hartree team through https://stfc.service-now.com/hartreecentre?id=service_catalog.

@henzler
Copy link
Author

henzler commented Jul 1, 2019

I am still encountering the error. However, lately I have noticed that when trying to use the GPU there is no process running on the GPU, however, the memory on GPUs is still allocated:

on Jul  1 11:50:10 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   66C    P0   253W / 300W |   9000MiB / 16152MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   66C    P0   202W / 300W |  13968MiB / 16152MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   59C    P0   266W / 300W |   9000MiB / 16152MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0    57W / 300W |   8756MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   62C    P0   292W / 300W |  12122MiB / 16152MiB |     99%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   64C    P0   207W / 300W |    746MiB / 16152MiB |     84%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   63C    P0   281W / 300W |   9000MiB / 16152MiB |     95%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     38030      C   python3                                     8990MiB |
|    1     36646      C   pmemd.cuda                                 13958MiB |
|    2     56709      C   python3                                     8990MiB |
|    4     56703      C   pmemd.cuda                                 12112MiB |
|    6     69042      C   pmemd.cuda                                   736MiB |
|    7     47998      C   python3                                     8990MiB |
+-----------------------------------------------------------------------------+

CUDA_VISIBLE_DEVICES = 3 in this case.

I encounter this problem in 30% of my job submits which is very frustrating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants