Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLA error when running on GPU #14480

Closed
mehdiataei opened this issue Feb 15, 2023 · 4 comments
Closed

XLA error when running on GPU #14480

mehdiataei opened this issue Feb 15, 2023 · 4 comments
Assignees
Labels
bug Something isn't working NVIDIA GPU Issues specific to NVIDIA GPUs

Comments

@mehdiataei
Copy link
Contributor

mehdiataei commented Feb 15, 2023

Description

On version 0.4.3 I receive the following error when I run any jitted function on the GPU.

2023-02-15 05:44:23.038563: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:429] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2023-02-15 05:44:23.039220: E external/org_tensorflow/tensorflow/compiler/xla/status_macros.cc:57] INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:626) dnn != nullptr 
*** Begin stack trace ***

The problem goes away with jaxlib 0.4.2.

Ubuntu 22.04, CUDA 11.8, CuDNN 8.6
GPU: A6000 (I have dual-GPU config, but I only utilize one of them when using JAX)

What jax/jaxlib version are you using?

jax 0.4.3/jaxlib 0.4.3

Which accelerator(s) are you using?

GPU

Additional system info

Python 3.10

NVIDIA GPU info

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:41:00.0 Off |                  Off |
| 30%   45C    P0    80W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000    Off  | 00000000:61:00.0 Off |                  Off |
| 34%   56C    P0    38W / 140W |      0MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@mehdiataei mehdiataei added the bug Something isn't working label Feb 15, 2023
@mjsML mjsML added the NVIDIA GPU Issues specific to NVIDIA GPUs label Feb 15, 2023
@nouiz
Copy link
Collaborator

nouiz commented Feb 15, 2023

How did you install jaxlib and cudnn?
You could have a mismatch of version. Maybe you use a jaxlib that expect a different cudnn version?

@hawkinsp
Copy link
Member

I would have expected CuDNN 8.6 to work. Can you try a newer CuDNN, e.g., 8.8?

@mehdiataei
Copy link
Contributor Author

OK this is fixed with CuDNN 8.8 👍 Not sure if there was an issue with my cuDNN installation, so maybe 8.6 also works if I reinstall it. I may try it a bit later.

@mjsML
Copy link
Collaborator

mjsML commented Feb 16, 2023

@mehdiataei is the issue fixed so we can close the bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NVIDIA GPU Issues specific to NVIDIA GPUs
Projects
None yet
Development

No branches or pull requests

4 participants