Description
The CUDA Gather kernel crashes with cudaErrorIllegalAddress (error 700) when the input tensor has more than 2^31 (~2.15 billion) elements and the gather index references a row whose element offset exceeds INT32_MAX.
Root Cause
In gather_impl.cu, the input_index variable is typed as CUDA_LONG which is int32_t:
CUDA_LONG input_index = 0; // int32_t — overflows for large tensors!
// ...
input_index = input_block_index * input_block_size + idx * block_size.d_ + offset;
output_data[id] = input_data[input_index]; // illegal memory access when input_index overflows
When idx * block_size.d_ + offset > INT32_MAX, the computation overflows int32_t, producing a negative or wrapped-around index that causes an out-of-bounds memory access.
Exact Boundary
For a [262144, 8960] float32 tensor (2,348,810,240 elements, ~9.4 GB):
| Row index |
Last element offset |
Fits int32? |
Result |
| 239,673 |
2,147,479,039 |
✅ Yes |
OK |
| 239,674 |
2,147,487,999 |
❌ No (> 2,147,483,647) |
CRASH |
Minimal Reproduction
import numpy as np
import onnx
from onnx import helper, TensorProto
import onnxruntime as ort
# Build minimal model: Gather(data=[262144, 8960], indices=[N], axis=0)
data = helper.make_tensor_value_info("data", TensorProto.FLOAT, [262144, 8960])
indices = helper.make_tensor_value_info("indices", TensorProto.INT64, ["N"])
output = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["N", 8960])
gather_node = helper.make_node("Gather", inputs=["data", "indices"], outputs=["output"], axis=0)
graph = helper.make_graph([gather_node], "gather_int32_overflow", [data, indices], [output])
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 21)])
onnx.save(model, "/tmp/gather_repro.onnx")
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
sess = ort.InferenceSession(
"/tmp/gather_repro.onnx", opts,
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
assert "CUDAExecutionProvider" in sess.get_providers()
data_np = np.zeros((262144, 8960), dtype=np.float32)
# Row 239673: last element offset fits int32 — OK
out = sess.run(None, {"data": data_np, "indices": np.array([239673], dtype=np.int64)})
print(f"Row 239673: OK")
# Row 239674: last element offset overflows int32 — CRASH
out = sess.run(None, {"data": data_np, "indices": np.array([239674], dtype=np.int64)})
print(f"Row 239674: OK") # never reached
Requires: ~9.4 GB GPU memory for the input tensor. Set CUDA_LAUNCH_BLOCKING=1 to get synchronous error reporting.
Real-World Impact
This bug affects the Gemma 4 multimodal model, which has a per-layer embedding table of shape [262144, 8960]. Token IDs for image boundary markers (<|image> = 255,999; <image|> = 258,882) are above the overflow threshold, causing the model to crash on CUDA during vision inference.
The same model works correctly on CPU.
Suggested Fix
Change input_index from CUDA_LONG (int32_t) to int64_t in gather_impl.cu:
- CUDA_LONG input_index = 0;
+ int64_t input_index = 0;
This is the same pattern used in other kernels that handle large tensors. The N parameter (total output elements) and id variable should also be checked for similar overflow potential when output tensors exceed 2^31 elements.
Environment
- onnxruntime-gpu: 1.24.4 (build 2d92497)
- CUDA: 13.0
- cuDNN: 9.x
- GPU: NVIDIA (driver 580.105.08)
- OS: Linux
Description
The CUDA
Gatherkernel crashes withcudaErrorIllegalAddress(error 700) when the input tensor has more than 2^31 (~2.15 billion) elements and the gather index references a row whose element offset exceedsINT32_MAX.Root Cause
In
gather_impl.cu, theinput_indexvariable is typed asCUDA_LONGwhich isint32_t:When
idx * block_size.d_ + offset > INT32_MAX, the computation overflowsint32_t, producing a negative or wrapped-around index that causes an out-of-bounds memory access.Exact Boundary
For a
[262144, 8960]float32 tensor (2,348,810,240 elements, ~9.4 GB):Minimal Reproduction
Requires: ~9.4 GB GPU memory for the input tensor. Set
CUDA_LAUNCH_BLOCKING=1to get synchronous error reporting.Real-World Impact
This bug affects the Gemma 4 multimodal model, which has a per-layer embedding table of shape
[262144, 8960]. Token IDs for image boundary markers (<|image>= 255,999;<image|>= 258,882) are above the overflow threshold, causing the model to crash on CUDA during vision inference.The same model works correctly on CPU.
Suggested Fix
Change
input_indexfromCUDA_LONG(int32_t) toint64_tingather_impl.cu:This is the same pattern used in other kernels that handle large tensors. The
Nparameter (total output elements) andidvariable should also be checked for similar overflow potential when output tensors exceed 2^31 elements.Environment