Failed to launch CUDA kernel when multiplying bool matrices with large batch size. #16286

BillHuang2001 · 2023-06-07T02:52:29Z

Description

Here is an minimum example to reproduce the bug.
Tested using a single RTX 3090.

import jax
from jax import jit, vmap
import jax.numpy as jnp

@jit
def f(adj, mat):
    return adj @ mat / jnp.sum(adj, axis=1)[:, jnp.newaxis]

adj = jnp.ones((1024 * 100, 10, 10), dtype=bool)
mat = jnp.ones((1024 * 100, 10, 100), dtype=float)

jax.jit(vmap(f))(adj, mat)

The code will result in the following error when compiling:

Traceback (most recent call last):
  File "/****/bug.py", line 12, in <module>
    jax.jit(vmap(f))(adj, mat)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to launch CUDA kernel: triton_gemm_dot_0 with block dimensions: 128x1x1 and grid dimensions: 4x1x102400 and shared memory size: 65536: CUDA_ERROR_INVALID_VALUE: invalid argument

Here adj is an adjacency matrix of type bool and mat is just a random matrix.
Setting adj to float or avoid using @ by using a combination of vmap and jnp.sum could solve this problem.

What jax/jaxlib version are you using?

jax v0.4.11, jaxlib 0.4.11+cuda12.cudnn88

Which accelerator(s) are you using?

GPU

Additional system info

Python 3.10, Linux

NVIDIA GPU info

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         Off| 00000000:17:00.0 Off |                  N/A |
| 35%   36C    P8               27W / 350W|     19MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         Off| 00000000:B3:00.0 Off |                  N/A |
| 37%   44C    P8               21W / 350W|      6MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

hawkinsp · 2023-06-12T14:23:49Z

Thanks for the report, I filed an XLA bug.

If you need a workaround until it is fixed, try setting the environment variable XLA_FLAGS=--xla_gpu_enable_triton_gemm=false

hawkinsp · 2023-06-13T22:21:01Z

openxla/xla#3530 fixed this, and should be in the next jaxlib release.

BillHuang2001 added the bug Something isn't working label Jun 7, 2023

hawkinsp mentioned this issue Jun 12, 2023

[CUDA] Kernel launch failure due to large Triton grid dimensions openxla/xla#3524

Closed

hawkinsp closed this as completed Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to launch CUDA kernel when multiplying bool matrices with large batch size. #16286

Failed to launch CUDA kernel when multiplying bool matrices with large batch size. #16286

BillHuang2001 commented Jun 7, 2023

hawkinsp commented Jun 12, 2023

hawkinsp commented Jun 13, 2023

Failed to launch CUDA kernel when multiplying bool matrices with large batch size. #16286

Failed to launch CUDA kernel when multiplying bool matrices with large batch size. #16286

Comments

BillHuang2001 commented Jun 7, 2023

Description

What jax/jaxlib version are you using?

Which accelerator(s) are you using?

Additional system info

NVIDIA GPU info

hawkinsp commented Jun 12, 2023

hawkinsp commented Jun 13, 2023