Skip to content

Fix two breaking CI failures: reshape-reshape rule test and logsumexp float16 tolerance#2907

Open
Copilot wants to merge 3 commits intomainfrom
copilot/fix-breaking-cis
Open

Fix two breaking CI failures: reshape-reshape rule test and logsumexp float16 tolerance#2907
Copilot wants to merge 3 commits intomainfrom
copilot/fix-breaking-cis

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 2, 2026

Two unrelated CI failures: ORT-nightly incorrectly folds consecutive Reshape ops with allowzero=1, causing a shape mismatch between the original and rewritten models; torch-nightly exposes float16 precision gaps in logsumexp that exceed default tolerances.

Changes

  • _basic_rules_test.pytest_reshape_reshape_dynamic_rule: switch numerical comparison to use_reference=True (ONNX reference evaluator) instead of ORT. ORT's graph optimizer incorrectly folds Reshape(Reshape(x, s1), s2, allowzero=1) when the intermediate has zero-sized dimensions, producing wrong output shapes ((1,0,6,3) vs the correct (1,0,0,3)). The rewrite rule itself is semantically correct; ORT is the broken oracle.

  • ops_test_data.py — add tolerance={torch.float16: (2e-2, 1e-4)} to the logsumexp entry. Observed float16 drift in torch-nightly: absolute Δ ≈ 2.3e-5 (limit 1e-5), relative Δ ≈ 2.3e-3 (limit 1e-3).

Copilot AI changed the title [WIP] Fix breaking CIs in ort-nightly and torch-nightly Fix two breaking CI failures: reshape-reshape rule test and logsumexp float16 tolerance May 2, 2026
Copilot AI requested a review from justinchuby May 2, 2026 04:01
@justinchuby
Copy link
Copy Markdown
Collaborator

@copilot create a PR on onnxruntime to fix the reshape bug

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 2, 2026

@copilot create a PR on onnxruntime to fix the reshape bug

I'm unable to create PRs on external repositories — I can only push to the current microsoft/onnxscript repo. To file the fix in onnxruntime, someone would need to:

  1. Locate the graph optimizer pass that folds consecutive Reshape ops (likely in onnxruntime/core/optimizer/reshape_fusion.cc or a similar file).
  2. Add a guard: if either Reshape node has allowzero=1 and any of the shape values are 0, the fold should be skipped, because the semantic meaning of 0 changes depending on the context (copy-from-input vs literal zero).
  3. Add a test case covering Reshape(Reshape(x, shape_with_zeros, allowzero=0), shape_with_zeros, allowzero=1) with a zero-sized dimension to verify the correct output shape is produced.

The repro is: input shape (0, 6, 2) → first Reshape([3, 2, -1], allowzero=0) → second Reshape([0, 0, 3], allowzero=1). ORT currently gives (0, 6, 3) but the correct output is (0, 0, 3).

@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.52%. Comparing base (79c0a99) to head (91bed69).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2907   +/-   ##
=======================================
  Coverage   72.52%   72.52%           
=======================================
  Files         258      258           
  Lines       31480    31480           
  Branches     2972     2972           
=======================================
  Hits        22831    22831           
  Misses       7640     7640           
  Partials     1009     1009           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@justinchuby justinchuby marked this pull request as ready for review May 5, 2026 01:41
justinchuby pushed a commit to microsoft/onnxruntime that referenced this pull request May 5, 2026
… dims (#28349)

### Description

`ReshapeFusion::FuseContiguousReshapes` collapses a chain of `Reshape` /
`Squeeze` / `Unsqueeze` nodes into a single `Reshape` whose shape data
is taken verbatim from the fully-inferred output shape of the last node
in the chain. The new node is created without an `allowzero` attribute,
so it defaults to `allowzero = 0`.

When that inferred shape contains a literal `0` dim (legitimate when the
original chain used `allowzero=1`, or when intermediate tensors had
zero-sized dimensions), the fused `Reshape` misinterprets the `0` as
"copy the corresponding dim from the input tensor" — but the input here
is the original input of the *first* reshape in the chain, with
unrelated dims. The result is a silently wrong output shape (and a
benign-looking `MergeShapeInfo` warning at graph load).

### Repro (before the fix)

```python
import numpy as np, onnx, onnxruntime as ort, onnx.reference
from onnx import helper, TensorProto

X  = helper.make_tensor_value_info("X", TensorProto.FLOAT, [0, 6, 2])
Y  = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [None, None, None])
s1 = helper.make_tensor("s1", TensorProto.INT64, [3], [3, 2, -1])
s2 = helper.make_tensor("s2", TensorProto.INT64, [3], [0, 0, 3])

n1 = helper.make_node("Reshape", ["X",   "s1"], ["mid"])
n2 = helper.make_node("Reshape", ["mid", "s2"], ["Y"], allowzero=1)
m  = helper.make_model(helper.make_graph([n1, n2], "g", [X], [Y], initializer=[s1, s2]),
                       opset_imports=[helper.make_opsetid("", 18)])

inp = np.random.default_rng(7).random((0, 6, 2), dtype=np.float32)
print("REF:", onnx.reference.ReferenceEvaluator(m).run(None, {"X": inp})[0].shape)
print("ORT:", ort.InferenceSession(m.SerializeToString(),
                                   providers=["CPUExecutionProvider"]).run(None, {"X": inp})[0].shape)
```

Output on `main` (`40c9f85f69`):

```
REF: (0, 0, 3)
[W ... graph.cc:122 MergeShapeInfo] Error merging shape info for output. 'Y' source:{0,6,3} target:{0,0,3}. Falling back to lenient merge.
ORT: (0, 6, 3)   ❌
```

### Fix

Setting `allowzero=1` on the fused node would also work but requires
opset >= 14, which this transformer cannot assume (it accepts `Reshape`
opset 5+). Bail out of fusion conservatively when `shape_value` contains
any literal `0` dim.

### Test

Adds `ReshapeFusionContiguousReshapesWithZeroDim` that builds the bug
repro programmatically and asserts:
- the two reshapes are NOT collapsed
- the inferred output shape stays `(0, 0, 3)`

The existing happy-path test `ReshapeFusion_Contiguous_Reshape` (added
in #22494) is unaffected — its inferred output shape `(2, 1, 64, 32)`
contains no zero dims, so the new guard does not trigger.

### Provenance

`FuseContiguousReshapes` was introduced in #22494 (Feb 2025). The bug
has been latent in `main` since then.

### Motivation and Context

Found while reviewing microsoft/onnxscript#2907
— the rewriter rule under test there is semantically correct, but its
numerical-equivalence check using ORT as the oracle fails because of
this fusion bug.

Fixes #28348.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@justinchuby justinchuby enabled auto-merge (squash) May 5, 2026 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

Fix breaking CIs

3 participants