Skip to content

Fix Potential Integer Truncation Leading to Heap Out-of-Bounds Read/Write#27544

Merged
chilo-ms merged 2 commits intomainfrom
chi/gather_overflow_fix
Mar 12, 2026
Merged

Fix Potential Integer Truncation Leading to Heap Out-of-Bounds Read/Write#27544
chilo-ms merged 2 commits intomainfrom
chi/gather_overflow_fix

Conversation

@chilo-ms
Copy link
Contributor

@chilo-ms chilo-ms commented Mar 4, 2026

Description

This pull request refactors several tensor operation kernels (GatherND, ScatterND, and GatherGrad) to improve type safety and consistency in parallelized code execution. The main change is replacing int loop indices with ptrdiff_t to avoid overflow.

Parallelization and Type Safety Improvements

  • Updated lambda functions and parallel loop indices in gather_nd.cc (GatherNDBase::PrepareForCompute, GatherND::GatherNumber, and GatherND::GatherString) to use ptrdiff_t instead of int64_t, and replaced index arithmetic with explicit casts to maintain correctness. [1] [2] [3]
  • Refactored scatter_nd.cc (ScatterNDDispatchTarget) to use ptrdiff_t for loop indices and index arithmetic in all reduction cases, ensuring consistent type usage in parallel execution.
  • Modified gather_grad.cc (GatherGrad::ComputeImpl) to use ptrdiff_t for parallel loop indices, aligning with the changes in other tensor kernels.

Motivation and Context

Another same issue was fixed in #27444

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors several CPU tensor kernels to improve type safety in ThreadPool::TryParallelFor usage by replacing truncated 32-bit loop indices with ptrdiff_t, reducing the risk of overflow-driven heap OOB when iterating very large workloads.

Changes:

  • Update GatherND parallel loops and per-slice lambdas to use ptrdiff_t indices and adjust related pointer arithmetic casts.
  • Update ScatterND dispatch lambda and parallel loop to use ptrdiff_t indices consistently.
  • Update GatherGrad parallel loop to use ptrdiff_t indices (avoiding int truncation).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
orttraining/orttraining/training_ops/cpu/tensor/gather_grad.cc Uses ptrdiff_t for TryParallelFor loop indices in GatherGrad.
onnxruntime/core/providers/cpu/tensor/scatter_nd.cc Uses ptrdiff_t for TryParallelFor loop indices and dispatch lambda in ScatterND.
onnxruntime/core/providers/cpu/tensor/gather_nd.cc Uses ptrdiff_t for TryParallelFor loop indices and slice lambdas in GatherND.
Comments suppressed due to low confidence (4)

onnxruntime/core/providers/cpu/tensor/gather_nd.cc:122

  • This change is intended to prevent integer truncation in parallel iteration (e.g., when the total work item count exceeds 32-bit). There isn’t a regression test exercising very large num_slices / output sizes for GatherND similar to Gather_overflow_check for Gather; adding one would help prevent reintroducing truncation issues in the future (with appropriate 32-bit skips and memory considerations).
  concurrency::ThreadPool::TryParallelFor(
      tp, onnxruntime::narrow<size_t>(num_slices), static_cast<double>(num_slice_dims),
      [&lambda](ptrdiff_t first, ptrdiff_t last) {
        for (ptrdiff_t slice_idx = first, end = last; slice_idx < end; ++slice_idx) {
          lambda(slice_idx);

onnxruntime/core/providers/cpu/tensor/scatter_nd.cc:405

  • This change is intended to prevent integer truncation in the parallel loop iteration. There isn’t a stress/regression test for ScatterND that forces prepare.element_offsets.size() to exceed 32-bit and exercises the TryParallelFor path; adding one (skipping 32-bit platforms and being mindful of memory) would help ensure this doesn’t regress.
    concurrency::ThreadPool::TryParallelFor(
        tp, prepare.element_offsets.size(), static_cast<double>(prepare.element_to_copy),
        [&lambda](ptrdiff_t first, ptrdiff_t last) {
          for (ptrdiff_t i = first, end = last; i < end; ++i) {
            lambda(i);
          }
        });

orttraining/orttraining/training_ops/cpu/tensor/gather_grad.cc:98

  • This change avoids truncating first/last (ptrdiff_t) to int in the parallel loop. There is currently no regression test that exercises grad_size values beyond 32-bit to validate this fix; consider adding a stress test (with 32-bit skips/memory constraints) to prevent future reintroductions of the truncation pattern.
  concurrency::ThreadPool::TryParallelFor(tp, grad_size, static_cast<double>(block_size),
                                          [&lambda](ptrdiff_t first, ptrdiff_t last) {
                                            for (ptrdiff_t index = first, end = last; index < end; ++index) {
                                              lambda(index);
                                            }
                                          });

onnxruntime/core/providers/cpu/tensor/gather_nd.cc:107

  • err_index is written from inside the TryParallelFor worker lambda without any synchronization. If multiple threads encounter an invalid index, this is a data race (undefined behavior) and the final error value is nondeterministic. Consider using an atomic (e.g., store first failure), or computing validation sequentially / with thread-local errors and combining after the parallel loop.
      int64_t index = static_cast<int64_t>(slice_indices[dim_idx]);
      const auto upper_limit = input_shape[SafeInt<size_t>(batch_dims_) + dim_idx];
      const auto lower_limit = -upper_limit;
      if (index < lower_limit || index >= upper_limit) {
        err_index = index;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@tianleiwu
Copy link
Contributor

AI Summary

This PR fixes integer truncation bugs in TryParallelFor loop bodies across three tensor operation kernels: GatherND, ScatterND, and GatherGrad. The TryParallelFor callback receives ptrdiff_t parameters (first, last), but the inner loops were casting them to int — a 32-bit type on all platforms. When total > INT_MAX (~2.1 billion), the truncation causes out-of-bounds memory access.

The fix is straightforward and consistent across all three files:

  1. Replace int loop variables with ptrdiff_t (matching the callback parameters)
  2. Change inner lambda parameters from int64_t to ptrdiff_t
  3. Add explicit static_cast<int64_t>() where ptrdiff_t is used in pointer arithmetic with uint64_t fields (to avoid signed/unsigned promotion issues)

Detailed Analysis

gather_nd.cc — 3 functions fixed

PrepareForCompute (line 96):

  • Lambda parameter: int64_tptrdiff_t
  • Pointer arithmetic: slice_idx * num_slice_dimsstatic_cast<int64_t>(slice_idx) * num_slice_dims
  • Inner loop: for (int slice_idx = static_cast<int>(first)...)for (ptrdiff_t slice_idx = first...)

GatherNumber (line 192):

  • Lambda parameter: int64_tptrdiff_t
  • Pointer arithmetic: slice_idx * p.bytes_per_slicestatic_cast<int64_t>(slice_idx) * p.bytes_per_slice
    • p.bytes_per_slice is uint64_t; the cast ensures signed×unsigned multiplication is well-defined
  • Inner loop: intptrdiff_t

GatherString (line 206):

  • Lambda parameter: int64_tptrdiff_t
  • Arithmetic: slice_idx * p.element_count_per_slicestatic_cast<int64_t>(slice_idx) * p.element_count_per_slice
  • Inner loop: intptrdiff_t

scatter_nd.cc — 1 function fixed

ScatterNDDispatchTarget::operator() (line 359):

  • Lambda parameter: int64_tptrdiff_t
  • All 5 reduction branches: i * prepare.element_to_copystatic_cast<int64_t>(i) * prepare.element_to_copy
    • prepare.element_to_copy is uint64_t; cast is correct
  • Inner loop: for (int i = static_cast<int>(first)...)for (ptrdiff_t i = first...)

gather_grad.cc — 1 loop fixed

GatherGrad::ComputeImpl (line 95):

  • Inner loop: for (int index = static_cast<int>(first)...)for (ptrdiff_t index = first...)
  • Note: The inner lambda parameter type remains int64_t (line 80). This is safe because ptrdiff_tint64_t is a widening conversion on 32-bit platforms and same-width on 64-bit platforms. However, for consistency with the other files, the lambda parameter should ideally also be ptrdiff_t.

Issues

1. gather_grad.cc Lambda Parameter Type Inconsistency (P3 — minor)

In gather_grad.cc, the inner loop now passes ptrdiff_t to lambda, but lambda still accepts int64_t:

auto lambda = [&](int64_t g) {  // ← still int64_t
  ...
};
// ...
for (ptrdiff_t index = first, end = last; index < end; ++index) {
  lambda(index);  // ptrdiff_t → int64_t implicit conversion
}

This is functionally safe (widening or same-size conversion), but inconsistent with gather_nd.cc and scatter_nd.cc where the lambda parameter was changed to ptrdiff_t. For consistency and to prevent future confusion, consider changing it to ptrdiff_t.

2. Pre-existing: err_index Data Race in PrepareForCompute (P2 — pre-existing, not introduced by this PR)

The err_index variable (line 88) is written from inside the TryParallelFor worker lambda without synchronization:

int64_t err_index = 0;
// ...
auto lambda = [&](ptrdiff_t slice_idx) {
    // ...
    if (index < lower_limit || index >= upper_limit) {
        err_index = index;  // ← unsynchronized write from multiple threads
        break;
    }
    // ...
};

If multiple threads encounter invalid indices concurrently, this is a data race (undefined behavior per C++ standard). The final value is nondeterministic. This is a pre-existing issue not introduced by this PR, but worth noting. Consider using std::atomic<int64_t> with a compare-exchange or store-first-failure pattern.

3. No Regression Tests for Overflow Scenarios (P2 — missing coverage)

PR #27444 (the related Gather fix) included a regression test (Gather_overflow_check) that verifies large tensor dimensions exceeding 32-bit limits. This PR does not include corresponding tests for GatherND, ScatterND, or GatherGrad.

While testing overflow scenarios is challenging (requires allocating very large tensors, may need to skip on 32-bit platforms), at least one test per kernel would prevent future regressions. The existing Gather_overflow_check test in gather_op_test.cc provides a good template.

4. Only One Remaining Instance of the Pattern in Codebase (P3 — informational)

A codebase-wide search reveals one remaining instance of the for (int ... = static_cast<int>(first)...) pattern:

onnxruntime/core/providers/cpu/rnn/uni_directional_lstm.cc:36

This could be fixed in a follow-up PR for completeness.

5. Commit Message (P3 — cosmetic)

The single commit message is just "update" — not descriptive. A message like "Fix int truncation in GatherND/ScatterND/GatherGrad TryParallelFor loops" would better serve git history. This is a squash-merge repo, so the PR title will likely be used, which is good.


Code Quality Assessment

Strengths

  • Consistent fix pattern: All three files follow the same approach — change loop variable to ptrdiff_t, add explicit static_cast<int64_t>() for pointer arithmetic
  • Minimal diff: Pure type-change refactor with no logic changes, minimizing risk
  • Correct cast direction: The static_cast<int64_t>(ptrdiff_t_value) before multiplying with uint64_t fields ensures well-defined signed×unsigned arithmetic
  • Follows established precedent: Matches the approach taken in merged PR Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write  #27444

Verification

  • All static_cast<int64_t>() additions are in pointer arithmetic expressions where the index is multiplied by a stride/size field (uint64_t). Without the cast, on platforms where ptrdiff_t is long and uint64_t is unsigned long long, the implicit conversion rules would still work correctly, but the explicit cast makes the intent clear.
  • The onnxruntime::narrow<size_t>() calls for vector indexing (e.g., p.slice_offsets[onnxruntime::narrow<size_t>(slice_idx)]) are unchanged and correct — narrow will throw if the value doesn't fit in size_t.

Summary Verdict

Priority Issue Action
P2 No regression tests for overflow scenarios Add tests similar to Gather_overflow_check
P2 Pre-existing err_index data race (not from this PR) Track separately
P3 gather_grad.cc lambda parameter still int64_t Change to ptrdiff_t for consistency
P3 One remaining truncation site in uni_directional_lstm.cc Follow-up PR
P3 Commit message "update" not descriptive Cosmetic; PR title is good

Recommendation: Approve. This is a clean, low-risk security fix that addresses a real integer truncation vulnerability. The changes are mechanically correct and consistent with the established fix pattern from PR #27444. The missing regression tests (Issue 3) are the most notable gap but are not blocking given the straightforward nature of the type changes.

Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chilo-ms chilo-ms merged commit 32511df into main Mar 12, 2026
89 of 91 checks passed
@chilo-ms chilo-ms deleted the chi/gather_overflow_fix branch March 12, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants