CUDA Gather kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow)

## Description

The CUDA `Gather` kernel crashes with `cudaErrorIllegalAddress` (error 700) when the input tensor has more than 2^31 (~2.15 billion) elements and the gather index references a row whose element offset exceeds `INT32_MAX`.

## Root Cause

In [`gather_impl.cu`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cuda/tensor/gather_impl.cu#L38-L51), the `input_index` variable is typed as `CUDA_LONG` which is [`int32_t`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cuda/cu_inc/common.cuh#L677):

```cuda
CUDA_LONG input_index = 0;   // int32_t — overflows for large tensors!
// ...
input_index = input_block_index * input_block_size + idx * block_size.d_ + offset;
output_data[id] = input_data[input_index];  // illegal memory access when input_index overflows
```

When `idx * block_size.d_ + offset > INT32_MAX`, the computation overflows `int32_t`, producing a negative or wrapped-around index that causes an out-of-bounds memory access.

## Exact Boundary

For a `[262144, 8960]` float32 tensor (2,348,810,240 elements, ~9.4 GB):

| Row index | Last element offset | Fits int32? | Result |
|-----------|-------------------|-------------|--------|
| 239,673 | 2,147,479,039 | ✅ Yes | OK |
| 239,674 | 2,147,487,999 | ❌ No (> 2,147,483,647) | **CRASH** |

## Minimal Reproduction

```python
import numpy as np
import onnx
from onnx import helper, TensorProto
import onnxruntime as ort

# Build minimal model: Gather(data=[262144, 8960], indices=[N], axis=0)
data = helper.make_tensor_value_info("data", TensorProto.FLOAT, [262144, 8960])
indices = helper.make_tensor_value_info("indices", TensorProto.INT64, ["N"])
output = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["N", 8960])

gather_node = helper.make_node("Gather", inputs=["data", "indices"], outputs=["output"], axis=0)
graph = helper.make_graph([gather_node], "gather_int32_overflow", [data, indices], [output])
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 21)])
onnx.save(model, "/tmp/gather_repro.onnx")

opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

sess = ort.InferenceSession(
    "/tmp/gather_repro.onnx", opts,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
assert "CUDAExecutionProvider" in sess.get_providers()

data_np = np.zeros((262144, 8960), dtype=np.float32)

# Row 239673: last element offset fits int32 — OK
out = sess.run(None, {"data": data_np, "indices": np.array([239673], dtype=np.int64)})
print(f"Row 239673: OK")

# Row 239674: last element offset overflows int32 — CRASH
out = sess.run(None, {"data": data_np, "indices": np.array([239674], dtype=np.int64)})
print(f"Row 239674: OK")  # never reached
```

**Requires:** ~9.4 GB GPU memory for the input tensor. Set `CUDA_LAUNCH_BLOCKING=1` to get synchronous error reporting.

## Real-World Impact

This bug affects the Gemma 4 multimodal model, which has a per-layer embedding table of shape `[262144, 8960]`. Token IDs for image boundary markers (`<|image>` = 255,999; `<image|>` = 258,882) are above the overflow threshold, causing the model to crash on CUDA during vision inference.

The same model works correctly on CPU.

## Suggested Fix

Change `input_index` from `CUDA_LONG` (`int32_t`) to `int64_t` in `gather_impl.cu`:

```diff
- CUDA_LONG input_index = 0;
+ int64_t input_index = 0;
```

This is the same pattern used in other kernels that handle large tensors. The `N` parameter (total output elements) and `id` variable should also be checked for similar overflow potential when output tensors exceed 2^31 elements.

## Environment

- **onnxruntime-gpu**: 1.24.4 (build 2d924974ef)
- **CUDA**: 13.0
- **cuDNN**: 9.x
- **GPU**: NVIDIA (driver 580.105.08)
- **OS**: Linux


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Gather kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow) #28107

Description

Root Cause

Exact Boundary

Minimal Reproduction

Real-World Impact

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Row index	Last element offset	Fits int32?	Result
239,673	2,147,479,039	✅ Yes	OK
239,674	2,147,487,999	❌ No (> 2,147,483,647)	CRASH

CUDA Gather kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow) #28107

Description

Description

Root Cause

Exact Boundary

Minimal Reproduction

Real-World Impact

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions