Cuda devices unavailable or busy #73

henzler · 2018-08-10T09:48:35Z

Hi,

out of 15 jobs usually 2-5 give me the following error message:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorRandom.cu line=25 error=46 : all CUDA-capable devices are busy or unavailable

It would be very nice if you could fix this.
Since it happens so often I do not think this is related to any specific node, but a general problem.

However, if you need more information in order to tackle the issue, please let me know!

Best,
Philipp

willprice · 2018-08-30T09:44:37Z

This seems to be a common issue caused by the tensorflow docker container not releasing GPUs when slurm kills the job thus leaving slurm believing a GPU is free but it is in fact still locked by a process that hasn't been correctly reaped.

Related issues:

This can also happen when someone manually overrides CUDA_VISIBLE_DEVICES themselves to take over free, but unallocated (by slurm) GPUs on a node... I don't think there are any permission solutions in place to prevent this sort of abuse.

KaiyangZhou · 2019-01-27T21:51:28Z

Is it possible to check the gpu memory usage after a job is finished or killed?

agniszczotka · 2019-02-03T13:36:41Z

Same proble slurmstepd-dgj401: error: task_p_post_term: rmdir(/dev/cpuset/slurm_dgj401_304107/slurm304107.4294967294_0) failed Device or resource busy

danielamassiceti · 2019-03-15T07:57:00Z

I confirm to also get these errors

lberrada · 2019-04-24T10:03:52Z

I obtain the same error, and it is occurring quite frequently these days.

For instance on a single-GPU job where I am assigned GPU 1:

Wed Apr 24 10:42:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    56W / 300W |  15424MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   43C    P0    69W / 300W |  14402MiB / 16152MiB |     82%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   53C    P0   181W / 300W |  15012MiB / 16152MiB |     63%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   46C    P0   140W / 300W |  14980MiB / 16152MiB |     62%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    5     42126      C   ...xr01/anaconda3/envs/allennlp/bin/python 14392MiB |
|    6     42148      C   ...xr01/anaconda3/envs/allennlp/bin/python 15002MiB |
|    7     42157      C   ...xr01/anaconda3/envs/allennlp/bin/python 14970MiB |
+-----------------------------------------------------------------------------+
Using CUDA device 1
...
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

One can see that device 1 has no process running on it, yet all the memory is already used.

It is possible to find the processes responsible for locking the GPU memory with a command like:
lsof /dev/nvidia$CUDA_VISIBLE_DEVICES (https://stackoverflow.com/a/4571250).

Using this command, maybe the processes listed by lsof could be killed at either the start or end of each job, to make sure that the GPU memory is effectively free?

Thanks in advance.

lberrada · 2019-04-26T09:04:49Z

Unfortunately lsof /dev/nvidia$CUDA_VISIBLE_DEVICES does not seem to work on jade, I get an error similar to this.

qizhuli · 2019-05-21T13:38:44Z

I too encounter this problem. This happened to me on dgj210, though it may not be a node specific problem. (edit: It is happening on dgj113 too. Exactly the same sympton, except that this time the problematic GPU is no. 7.)

As I requested 8 GPUs (and on the big partition), there shouldn't be anyone else sharing the node with me.

nvidia-smi returns the following:

Tue May 21 14:21:33 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   39C    P0    72W / 300W |   1092MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   37C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   37C    P0    41W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Apparently some phantom process is occupying some memory on GPU 0.

The following is a minimal script that reproduces the error:

import torch
a = torch.rand(1, device=cuda:0)

Produces:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

However, if I specify any another device from cuda:1 to cuda:7, the script produces no error, confirming that the problem is indeed with GPU 0 in this case.

Has there been any solutions so far?

willprice · 2019-05-21T13:53:20Z

@qizhuli Looks like corrupted GPU state. I raised this back in January (this is slightly different from iberrada's issue where processes are still running, where slurm has probably failed to reap the processes). I think the issue is being tracked internally on service now (https://stfc.service-now.com/hartreecentre) under the reference PRB0040275 . Might be worth getting in contact there and asking to be added to the watch list.

lberrada · 2019-05-21T14:00:37Z

My previous post was a similar issue: I had a single-GPU job on device 1 (I've edited my previous post to make this clearer). Then nvidia-smi showed no process running on device 1, yet all its memory was used.

qizhuli · 2019-05-21T14:12:02Z

@lberrada @willprice Thanks for the responses. It does look like we all encountered a similar issue.

How do I add myself to the watch list of PRB0040275?

willprice · 2019-05-21T14:23:19Z

@lberrada, you're quite right, sorry, I misread your nvidia-smi output.
@qizhuli Perhaps contact the Hartree team through https://stfc.service-now.com/hartreecentre?id=service_catalog.

henzler · 2019-07-01T11:00:34Z

I am still encountering the error. However, lately I have noticed that when trying to use the GPU there is no process running on the GPU, however, the memory on GPUs is still allocated:

on Jul  1 11:50:10 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   66C    P0   253W / 300W |   9000MiB / 16152MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   66C    P0   202W / 300W |  13968MiB / 16152MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   59C    P0   266W / 300W |   9000MiB / 16152MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0    57W / 300W |   8756MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   62C    P0   292W / 300W |  12122MiB / 16152MiB |     99%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |     10MiB / 16152MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   64C    P0   207W / 300W |    746MiB / 16152MiB |     84%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   63C    P0   281W / 300W |   9000MiB / 16152MiB |     95%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     38030      C   python3                                     8990MiB |
|    1     36646      C   pmemd.cuda                                 13958MiB |
|    2     56709      C   python3                                     8990MiB |
|    4     56703      C   pmemd.cuda                                 12112MiB |
|    6     69042      C   pmemd.cuda                                   736MiB |
|    7     47998      C   python3                                     8990MiB |
+-----------------------------------------------------------------------------+

CUDA_VISIBLE_DEVICES = 3 in this case.

I encounter this problem in 30% of my job submits which is very frustrating.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda devices unavailable or busy #73

Cuda devices unavailable or busy #73

henzler commented Aug 10, 2018

willprice commented Aug 30, 2018

KaiyangZhou commented Jan 27, 2019

agniszczotka commented Feb 3, 2019

danielamassiceti commented Mar 15, 2019

lberrada commented Apr 24, 2019 •

edited

Loading

lberrada commented Apr 26, 2019

qizhuli commented May 21, 2019 •

edited

Loading

willprice commented May 21, 2019

lberrada commented May 21, 2019

qizhuli commented May 21, 2019

willprice commented May 21, 2019

henzler commented Jul 1, 2019

Cuda devices unavailable or busy #73

Cuda devices unavailable or busy #73

Comments

henzler commented Aug 10, 2018

willprice commented Aug 30, 2018

KaiyangZhou commented Jan 27, 2019

agniszczotka commented Feb 3, 2019

danielamassiceti commented Mar 15, 2019

lberrada commented Apr 24, 2019 • edited Loading

lberrada commented Apr 26, 2019

qizhuli commented May 21, 2019 • edited Loading

willprice commented May 21, 2019

lberrada commented May 21, 2019

qizhuli commented May 21, 2019

willprice commented May 21, 2019

henzler commented Jul 1, 2019

lberrada commented Apr 24, 2019 •

edited

Loading

qizhuli commented May 21, 2019 •

edited

Loading