Fix Potential Integer Truncation Leading to Heap Out-of-Bounds Read/Write by chilo-ms · Pull Request #27544 · microsoft/onnxruntime

chilo-ms · 2026-03-04T01:49:26Z

Description

This pull request refactors several tensor operation kernels (GatherND, ScatterND, and GatherGrad) to improve type safety and consistency in parallelized code execution. The main change is replacing int loop indices with ptrdiff_t to avoid overflow.

Parallelization and Type Safety Improvements

Updated lambda functions and parallel loop indices in gather_nd.cc (GatherNDBase::PrepareForCompute, GatherND::GatherNumber, and GatherND::GatherString) to use ptrdiff_t instead of int64_t, and replaced index arithmetic with explicit casts to maintain correctness. [1] [2] [3]
Refactored scatter_nd.cc (ScatterNDDispatchTarget) to use ptrdiff_t for loop indices and index arithmetic in all reduction cases, ensuring consistent type usage in parallel execution.
Modified gather_grad.cc (GatherGrad::ComputeImpl) to use ptrdiff_t for parallel loop indices, aligning with the changes in other tensor kernels.

Motivation and Context

Another same issue was fixed in #27444

Copilot

Pull request overview

This PR refactors several CPU tensor kernels to improve type safety in ThreadPool::TryParallelFor usage by replacing truncated 32-bit loop indices with ptrdiff_t, reducing the risk of overflow-driven heap OOB when iterating very large workloads.

Changes:

Update GatherND parallel loops and per-slice lambdas to use ptrdiff_t indices and adjust related pointer arithmetic casts.
Update ScatterND dispatch lambda and parallel loop to use ptrdiff_t indices consistently.
Update GatherGrad parallel loop to use ptrdiff_t indices (avoiding int truncation).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
orttraining/orttraining/training_ops/cpu/tensor/gather_grad.cc	Uses `ptrdiff_t` for `TryParallelFor` loop indices in `GatherGrad`.
onnxruntime/core/providers/cpu/tensor/scatter_nd.cc	Uses `ptrdiff_t` for `TryParallelFor` loop indices and dispatch lambda in `ScatterND`.
onnxruntime/core/providers/cpu/tensor/gather_nd.cc	Uses `ptrdiff_t` for `TryParallelFor` loop indices and slice lambdas in `GatherND`.

Comments suppressed due to low confidence (4)

onnxruntime/core/providers/cpu/tensor/gather_nd.cc:122

This change is intended to prevent integer truncation in parallel iteration (e.g., when the total work item count exceeds 32-bit). There isn’t a regression test exercising very large num_slices / output sizes for GatherND similar to Gather_overflow_check for Gather; adding one would help prevent reintroducing truncation issues in the future (with appropriate 32-bit skips and memory considerations).

  concurrency::ThreadPool::TryParallelFor(
      tp, onnxruntime::narrow<size_t>(num_slices), static_cast<double>(num_slice_dims),
      [&lambda](ptrdiff_t first, ptrdiff_t last) {
        for (ptrdiff_t slice_idx = first, end = last; slice_idx < end; ++slice_idx) {
          lambda(slice_idx);

onnxruntime/core/providers/cpu/tensor/scatter_nd.cc:405

This change is intended to prevent integer truncation in the parallel loop iteration. There isn’t a stress/regression test for ScatterND that forces prepare.element_offsets.size() to exceed 32-bit and exercises the TryParallelFor path; adding one (skipping 32-bit platforms and being mindful of memory) would help ensure this doesn’t regress.

    concurrency::ThreadPool::TryParallelFor(
        tp, prepare.element_offsets.size(), static_cast<double>(prepare.element_to_copy),
        [&lambda](ptrdiff_t first, ptrdiff_t last) {
          for (ptrdiff_t i = first, end = last; i < end; ++i) {
            lambda(i);
          }
        });

orttraining/orttraining/training_ops/cpu/tensor/gather_grad.cc:98

This change avoids truncating first/last (ptrdiff_t) to int in the parallel loop. There is currently no regression test that exercises grad_size values beyond 32-bit to validate this fix; consider adding a stress test (with 32-bit skips/memory constraints) to prevent future reintroductions of the truncation pattern.

  concurrency::ThreadPool::TryParallelFor(tp, grad_size, static_cast<double>(block_size),
                                          [&lambda](ptrdiff_t first, ptrdiff_t last) {
                                            for (ptrdiff_t index = first, end = last; index < end; ++index) {
                                              lambda(index);
                                            }
                                          });

onnxruntime/core/providers/cpu/tensor/gather_nd.cc:107

err_index is written from inside the TryParallelFor worker lambda without any synchronization. If multiple threads encounter an invalid index, this is a data race (undefined behavior) and the final error value is nondeterministic. Consider using an atomic (e.g., store first failure), or computing validation sequentially / with thread-local errors and combining after the parallel loop.

      int64_t index = static_cast<int64_t>(slice_indices[dim_idx]);
      const auto upper_limit = input_shape[SafeInt<size_t>(batch_dims_) + dim_idx];
      const auto lower_limit = -upper_limit;
      if (index < lower_limit || index >= upper_limit) {
        err_index = index;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/providers/cpu/tensor/scatter_nd.cc

tianleiwu · 2026-03-04T19:40:22Z

AI Summary

This PR fixes integer truncation bugs in TryParallelFor loop bodies across three tensor operation kernels: GatherND, ScatterND, and GatherGrad. The TryParallelFor callback receives ptrdiff_t parameters (first, last), but the inner loops were casting them to int — a 32-bit type on all platforms. When total > INT_MAX (~2.1 billion), the truncation causes out-of-bounds memory access.

The fix is straightforward and consistent across all three files:

Replace int loop variables with ptrdiff_t (matching the callback parameters)
Change inner lambda parameters from int64_t to ptrdiff_t
Add explicit static_cast<int64_t>() where ptrdiff_t is used in pointer arithmetic with uint64_t fields (to avoid signed/unsigned promotion issues)

Detailed Analysis

gather_nd.cc — 3 functions fixed

PrepareForCompute (line 96):

Lambda parameter: int64_t → ptrdiff_t ✓
Pointer arithmetic: slice_idx * num_slice_dims → static_cast<int64_t>(slice_idx) * num_slice_dims ✓
Inner loop: for (int slice_idx = static_cast<int>(first)...) → for (ptrdiff_t slice_idx = first...) ✓

GatherNumber (line 192):

Lambda parameter: int64_t → ptrdiff_t ✓
Pointer arithmetic: slice_idx * p.bytes_per_slice → static_cast<int64_t>(slice_idx) * p.bytes_per_slice ✓
- p.bytes_per_slice is uint64_t; the cast ensures signed×unsigned multiplication is well-defined
Inner loop: int → ptrdiff_t ✓

GatherString (line 206):

Lambda parameter: int64_t → ptrdiff_t ✓
Arithmetic: slice_idx * p.element_count_per_slice → static_cast<int64_t>(slice_idx) * p.element_count_per_slice ✓
Inner loop: int → ptrdiff_t ✓

scatter_nd.cc — 1 function fixed

ScatterNDDispatchTarget::operator() (line 359):

Lambda parameter: int64_t → ptrdiff_t ✓
All 5 reduction branches: i * prepare.element_to_copy → static_cast<int64_t>(i) * prepare.element_to_copy ✓
- prepare.element_to_copy is uint64_t; cast is correct
Inner loop: for (int i = static_cast<int>(first)...) → for (ptrdiff_t i = first...) ✓

gather_grad.cc — 1 loop fixed

GatherGrad::ComputeImpl (line 95):

Inner loop: for (int index = static_cast<int>(first)...) → for (ptrdiff_t index = first...) ✓
Note: The inner lambda parameter type remains int64_t (line 80). This is safe because ptrdiff_t → int64_t is a widening conversion on 32-bit platforms and same-width on 64-bit platforms. However, for consistency with the other files, the lambda parameter should ideally also be ptrdiff_t.

Issues

1. `gather_grad.cc` Lambda Parameter Type Inconsistency (P3 — minor)

In gather_grad.cc, the inner loop now passes ptrdiff_t to lambda, but lambda still accepts int64_t:

auto lambda = [&](int64_t g) {  // ← still int64_t
  ...
};
// ...
for (ptrdiff_t index = first, end = last; index < end; ++index) {
  lambda(index);  // ptrdiff_t → int64_t implicit conversion
}

This is functionally safe (widening or same-size conversion), but inconsistent with gather_nd.cc and scatter_nd.cc where the lambda parameter was changed to ptrdiff_t. For consistency and to prevent future confusion, consider changing it to ptrdiff_t.

2. Pre-existing: `err_index` Data Race in `PrepareForCompute` (P2 — pre-existing, not introduced by this PR)

The err_index variable (line 88) is written from inside the TryParallelFor worker lambda without synchronization:

int64_t err_index = 0;
// ...
auto lambda = [&](ptrdiff_t slice_idx) {
    // ...
    if (index < lower_limit || index >= upper_limit) {
        err_index = index;  // ← unsynchronized write from multiple threads
        break;
    }
    // ...
};

If multiple threads encounter invalid indices concurrently, this is a data race (undefined behavior per C++ standard). The final value is nondeterministic. This is a pre-existing issue not introduced by this PR, but worth noting. Consider using std::atomic<int64_t> with a compare-exchange or store-first-failure pattern.

3. No Regression Tests for Overflow Scenarios (P2 — missing coverage)

PR #27444 (the related Gather fix) included a regression test (Gather_overflow_check) that verifies large tensor dimensions exceeding 32-bit limits. This PR does not include corresponding tests for GatherND, ScatterND, or GatherGrad.

While testing overflow scenarios is challenging (requires allocating very large tensors, may need to skip on 32-bit platforms), at least one test per kernel would prevent future regressions. The existing Gather_overflow_check test in gather_op_test.cc provides a good template.

4. Only One Remaining Instance of the Pattern in Codebase (P3 — informational)

A codebase-wide search reveals one remaining instance of the for (int ... = static_cast<int>(first)...) pattern:

onnxruntime/core/providers/cpu/rnn/uni_directional_lstm.cc:36

This could be fixed in a follow-up PR for completeness.

5. Commit Message (P3 — cosmetic)

The single commit message is just "update" — not descriptive. A message like "Fix int truncation in GatherND/ScatterND/GatherGrad TryParallelFor loops" would better serve git history. This is a squash-merge repo, so the PR title will likely be used, which is good.

Code Quality Assessment

Strengths

Consistent fix pattern: All three files follow the same approach — change loop variable to ptrdiff_t, add explicit static_cast<int64_t>() for pointer arithmetic
Minimal diff: Pure type-change refactor with no logic changes, minimizing risk
Correct cast direction: The static_cast<int64_t>(ptrdiff_t_value) before multiplying with uint64_t fields ensures well-defined signed×unsigned arithmetic
Follows established precedent: Matches the approach taken in merged PR Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write #27444

Verification

All static_cast<int64_t>() additions are in pointer arithmetic expressions where the index is multiplied by a stride/size field (uint64_t). Without the cast, on platforms where ptrdiff_t is long and uint64_t is unsigned long long, the implicit conversion rules would still work correctly, but the explicit cast makes the intent clear.
The onnxruntime::narrow<size_t>() calls for vector indexing (e.g., p.slice_offsets[onnxruntime::narrow<size_t>(slice_idx)]) are unchanged and correct — narrow will throw if the value doesn't fit in size_t.

Summary Verdict

Priority	Issue	Action
P2	No regression tests for overflow scenarios	Add tests similar to `Gather_overflow_check`
P2	Pre-existing `err_index` data race (not from this PR)	Track separately
P3	`gather_grad.cc` lambda parameter still `int64_t`	Change to `ptrdiff_t` for consistency
P3	One remaining truncation site in `uni_directional_lstm.cc`	Follow-up PR
P3	Commit message "update" not descriptive	Cosmetic; PR title is good

Recommendation: Approve. This is a clean, low-risk security fix that addresses a real integer truncation vulnerability. The changes are mechanically correct and consistent with the established fix pattern from PR #27444. The missing regression tests (Issue 3) are the most notable gap but are not blocking given the straightforward nature of the type changes.

tianleiwu

LGTM

update

d05438c

tianleiwu requested a review from Copilot March 4, 2026 16:01

Copilot started reviewing on behalf of tianleiwu March 4, 2026 16:02 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

onnxruntime/core/providers/cpu/tensor/scatter_nd.cc Show resolved Hide resolved

fix issue as per copilot suggests

6f56f77

tianleiwu approved these changes Mar 12, 2026

View reviewed changes

chilo-ms merged commit 32511df into main Mar 12, 2026
89 of 91 checks passed

chilo-ms deleted the chi/gather_overflow_fix branch March 12, 2026 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Potential Integer Truncation Leading to Heap Out-of-Bounds Read/Write#27544

Fix Potential Integer Truncation Leading to Heap Out-of-Bounds Read/Write#27544
chilo-ms merged 2 commits intomainfrom
chi/gather_overflow_fix

chilo-ms commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tianleiwu commented Mar 4, 2026

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chilo-ms commented Mar 4, 2026

Description

Parallelization and Type Safety Improvements

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

tianleiwu commented Mar 4, 2026

AI Summary

Detailed Analysis

gather_nd.cc — 3 functions fixed

scatter_nd.cc — 1 function fixed

gather_grad.cc — 1 loop fixed

Issues

1. gather_grad.cc Lambda Parameter Type Inconsistency (P3 — minor)

2. Pre-existing: err_index Data Race in PrepareForCompute (P2 — pre-existing, not introduced by this PR)

3. No Regression Tests for Overflow Scenarios (P2 — missing coverage)

4. Only One Remaining Instance of the Pattern in Codebase (P3 — informational)

5. Commit Message (P3 — cosmetic)

Code Quality Assessment

Strengths

Verification

Summary Verdict

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `gather_grad.cc` Lambda Parameter Type Inconsistency (P3 — minor)

2. Pre-existing: `err_index` Data Race in `PrepareForCompute` (P2 — pre-existing, not introduced by this PR)