Skip to content

CUDA Gather kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow) #28107

@justinchuby

Description

@justinchuby

Description

The CUDA Gather kernel crashes with cudaErrorIllegalAddress (error 700) when the input tensor has more than 2^31 (~2.15 billion) elements and the gather index references a row whose element offset exceeds INT32_MAX.

Root Cause

In gather_impl.cu, the input_index variable is typed as CUDA_LONG which is int32_t:

CUDA_LONG input_index = 0;   // int32_t — overflows for large tensors!
// ...
input_index = input_block_index * input_block_size + idx * block_size.d_ + offset;
output_data[id] = input_data[input_index];  // illegal memory access when input_index overflows

When idx * block_size.d_ + offset > INT32_MAX, the computation overflows int32_t, producing a negative or wrapped-around index that causes an out-of-bounds memory access.

Exact Boundary

For a [262144, 8960] float32 tensor (2,348,810,240 elements, ~9.4 GB):

Row index Last element offset Fits int32? Result
239,673 2,147,479,039 ✅ Yes OK
239,674 2,147,487,999 ❌ No (> 2,147,483,647) CRASH

Minimal Reproduction

import numpy as np
import onnx
from onnx import helper, TensorProto
import onnxruntime as ort

# Build minimal model: Gather(data=[262144, 8960], indices=[N], axis=0)
data = helper.make_tensor_value_info("data", TensorProto.FLOAT, [262144, 8960])
indices = helper.make_tensor_value_info("indices", TensorProto.INT64, ["N"])
output = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["N", 8960])

gather_node = helper.make_node("Gather", inputs=["data", "indices"], outputs=["output"], axis=0)
graph = helper.make_graph([gather_node], "gather_int32_overflow", [data, indices], [output])
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 21)])
onnx.save(model, "/tmp/gather_repro.onnx")

opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

sess = ort.InferenceSession(
    "/tmp/gather_repro.onnx", opts,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
assert "CUDAExecutionProvider" in sess.get_providers()

data_np = np.zeros((262144, 8960), dtype=np.float32)

# Row 239673: last element offset fits int32 — OK
out = sess.run(None, {"data": data_np, "indices": np.array([239673], dtype=np.int64)})
print(f"Row 239673: OK")

# Row 239674: last element offset overflows int32 — CRASH
out = sess.run(None, {"data": data_np, "indices": np.array([239674], dtype=np.int64)})
print(f"Row 239674: OK")  # never reached

Requires: ~9.4 GB GPU memory for the input tensor. Set CUDA_LAUNCH_BLOCKING=1 to get synchronous error reporting.

Real-World Impact

This bug affects the Gemma 4 multimodal model, which has a per-layer embedding table of shape [262144, 8960]. Token IDs for image boundary markers (<|image> = 255,999; <image|> = 258,882) are above the overflow threshold, causing the model to crash on CUDA during vision inference.

The same model works correctly on CPU.

Suggested Fix

Change input_index from CUDA_LONG (int32_t) to int64_t in gather_impl.cu:

- CUDA_LONG input_index = 0;
+ int64_t input_index = 0;

This is the same pattern used in other kernels that handle large tensors. The N parameter (total output elements) and id variable should also be checked for similar overflow potential when output tensors exceed 2^31 elements.

Environment

  • onnxruntime-gpu: 1.24.4 (build 2d92497)
  • CUDA: 13.0
  • cuDNN: 9.x
  • GPU: NVIDIA (driver 580.105.08)
  • OS: Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:CUDAissues related to the CUDA execution provider

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions