Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pin cuda-nvcc to get a better working jaxlib from conda-forge #12776

Closed
ngam opened this issue Oct 12, 2022 · 3 comments
Closed

pin cuda-nvcc to get a better working jaxlib from conda-forge #12776

ngam opened this issue Oct 12, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@ngam
Copy link
Contributor

ngam commented Oct 12, 2022

Description

cc @hawkinsp

xref pangeo-data/pangeo-docker-images#387

import jax.numpy as jnp
from jax import grad, jit, vmap
from jax import random

key = random.PRNGKey(0)

fails with

2022-10-12 15:16:32.114395: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_asm_compiler.cc:57] cuLinkAddData fails. This is usually caused by stale driver version.
2022-10-12 15:16:32.114450: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:1325] The CUDA linking API did not work. Please use XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1 to bypass it, but expect to get longer compilation time due to the lack of multi-threading.
...
XlaRuntimeError: UNKNOWN: no kernel image is available for execution on the device

with cuda-nvcc>11.6.*

What jax/jaxlib version are you using?

jaxlib 0.3.15 (cuda)

Which accelerator(s) are you using?

GPU

Additional system info

No response

NVIDIA GPU info

K80/T4 tested

@ngam ngam added the bug Something isn't working label Oct 12, 2022
@hawkinsp
Copy link
Collaborator

hawkinsp commented Oct 12, 2022

We have a planned fix for this, which boils down to: Look at the driver version. Is it less than the ptxas version? If so, disable parallel compilation.

The issue comes when your ptxas version is newer than the driver you have installed.

@ngam
Copy link
Contributor Author

ngam commented Oct 12, 2022

Thanks for the speedy response. Yep, that seems to be it (see xref pangeo-data/pangeo-docker-images#387).

@ngam
Copy link
Contributor Author

ngam commented Oct 12, 2022

import os
os.environ["XLA_FLAGS"]="--xla_gpu_force_compilation_parallelism=1"

also resolves this issue; closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants