Skip to content

Fix RemoveDuplicateCastTransformer incorrectly eliminating lossy Cast chains targeting bool#28102

Open
Copilot wants to merge 5 commits intomainfrom
copilot/fix-ort-enable-all-cast-issue
Open

Fix RemoveDuplicateCastTransformer incorrectly eliminating lossy Cast chains targeting bool#28102
Copilot wants to merge 5 commits intomainfrom
copilot/fix-ort-enable-all-cast-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 16, 2026

Description

RemoveDuplicateCastTransformer removes the first Cast when all consumers are also Cast nodes, but didn't check whether the first Cast is a lossy conversion targeting a bool destination. This caused Cast(float→int32) → Cast(int32→bool) to collapse into Cast(float→bool), changing truncation semantics:

float32(-0.1) → int32(0) → bool(false)   # correct: truncation zeros the value
float32(-0.1) → bool(true)               # wrong: non-zero maps to true

Fix: Block removal of the first Cast only when loss_precision_cast is true AND any kept child Cast targets bool. Bool conversion (non-zero → true, zero → false) is uniquely problematic because any lossy intermediate cast that maps non-zero values to zero changes the boolean result. For non-bool destinations, lossy intermediates may change specific boundary values but the overall behavior is acceptable.

This targeted constraint still allows optimization of lossy chains with non-bool destinations:

  • Cast(float→float16) → Cast(float16→int32)Cast(float→int32) ✓ (optimized)
  • Cast(float→int64) → Cast(int64→int32)Cast(float→int32) ✓ (optimized)
  • Cast(float→int32) → Cast(int32→bool) → blocked ✓ (preserved)

Changes:

  • onnxruntime/core/optimizer/insert_cast_transformer.cc — added any_child_casts_to_bool check using GetTypeGroup() + updated comments
  • onnxruntime/test/framework/insert_cast_transformer_test.cc — regression test CastFloatToIntToBoolNotFused + two tests verifying lossy chains with non-bool destinations are still optimized (LossyCastChainWithNonBoolDestIsOptimized, LossyCastFloatToInt64ToInt32IsOptimized)

Motivation and Context

The graph optimizer's cast deduplication assumed (high precision → low precision → lower precision) chains could always be collapsed. This is incorrect when the intermediate type introduces truncation that affects downstream bool semantics — specifically float→integer truncation before a bool cast, where small non-zero floats truncate to zero and then convert to false instead of true.

The fix is scoped to bool destinations because bool's zero/non-zero test is uniquely sensitive to truncation. Other destination types tolerate the small value differences from removing lossy intermediates (rounding differences, consistent saturation behavior).

@tianleiwu
Copy link
Copy Markdown
Contributor

@copilot, please also consider some cases like below. Although the first one is lossy conversion, it can still be optimized as Cast(float->int32).

Cast(float->float16)->Cast(float16->int32)
Cast(float->int64)->Cast(int64->int32)

Could you do some analysis of all combinations to find some rules to loose the constraint?

Copilot AI changed the title [WIP] Fix ORT_ENABLE_ALL optimizer casting issue Fix RemoveDuplicateCastTransformer incorrectly eliminating lossy Cast chains Apr 16, 2026
Copilot AI requested a review from tianleiwu April 16, 2026 22:26
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 16, 2026

@copilot, please also consider some cases like below. Although the first one is lossy conversion, it can still be optimized as Cast(float->int32).

Cast(float->float16)->Cast(float16->int32)
...

Refined the constraint in 7ebf3be. Now only blocks removal when loss_precision_cast && any_child_casts_to_bool.

Analysis: The bool destination is uniquely problematic because its conversion semantics (non-zero → true, zero → false) interact badly with any lossy intermediate cast that can map non-zero values to zero:

  • float(-0.1) → int32(0) → bool(false) vs float(-0.1) → bool(true) — WRONG
  • double(1e-46) → float(0) → bool(false) vs double(1e-46) → bool(true) — WRONG

For non-bool destinations, lossy intermediates may change specific boundary values but the overall behavior is acceptable (rounding differences, same saturation behavior):

  • Cast(float→float16) → Cast(float16→int32)Cast(float→int32): allowed
  • Cast(float→int64) → Cast(int64→int32)Cast(float→int32): allowed
  • Cast(float→int32) → Cast(int32→bool): blocked

Added tests for both the reviewer's examples (verifying they ARE optimized) and the original bug case (verifying it's blocked).

@tianleiwu
Copy link
Copy Markdown
Contributor

Verified the refined constraint in 7ebf3be. Built and ran all 24 TransformerTest.* tests locally — all pass, including the three new regression tests.

Analysis of the three guard conditions and how they interact:

The RemoveDuplicateCastTransformer has three independent gates that can block removal of the first Cast:

  1. inconsistent_casts — blocks high→low→high bit-count patterns (e.g., float32→float16→float32). This is the pre-existing check.
  2. cross_ep — blocks fusion across different execution providers (added in [WebGPU] Failed to find kernel for Cast(13) for WebGpuExecutionProvider #27291).
  3. loss_precision_cast && any_child_casts_to_bool — blocks lossy first cast when any child targets bool (the new check in this PR).

Why Cast(float→float16)→Cast(float16→int32) is optimized:

This chain is not directly optimized by the RemoveDuplicateCastTransformer (the inconsistent_casts check blocks it because UnsafeCast(int32, float16) = true). Instead, the InsertCastTransformer first inserts a float16→float32 cast before Cast_2 (since Cast_2's float16 input needs CPU float32 handling), creating float→float16→float32→int32. Then RemoveDuplicateCastTransformer collapses the A→B→A pattern (float→float16→float32) and removes the dangling first Cast, leaving float→int32.

Why Cast(float→int64)→Cast(int64→int32) is optimized:

Here inconsistent_casts = false (because UnsafeCast(int32, int64) returns false — 32 > 64 is false). No float16 inputs, so InsertCast does nothing. The cast_nodes_to_keep path runs, and since any_child_casts_to_bool = false (int32 is not bool), the first Cast is removed.

Why Cast(float→int32)→Cast(int32→bool) is blocked:

inconsistent_casts = false, but loss_precision_cast = true (float→int) and any_child_casts_to_bool = true (bool destination). The guard !(loss_precision_cast && any_child_casts_to_bool) evaluates to false, correctly preventing float(-0.1)→int32(0)→bool(false) from collapsing to float(-0.1)→bool(true).

@tianleiwu tianleiwu marked this pull request as ready for review April 16, 2026 23:06
Comment thread onnxruntime/core/optimizer/insert_cast_transformer.cc Outdated
@tianleiwu tianleiwu requested a review from skottmckay April 16, 2026 23:19
Copilot AI changed the title Fix RemoveDuplicateCastTransformer incorrectly eliminating lossy Cast chains Fix RemoveDuplicateCastTransformer incorrectly eliminating lossy Cast chains targeting bool Apr 16, 2026
Copilot AI requested a review from tianleiwu April 16, 2026 23:22
@skottmckay
Copy link
Copy Markdown
Contributor

We can add this, but my concern is we're adding complexity for a contrived example. Has this issue ever affected a production model?

@tianleiwu
Copy link
Copy Markdown
Contributor

concern is we're adding complexity for a contrived example. Has this issue ever affected a production model?

I searched Transformers modeling code and did not find a direct float->int->bool source pattern. I did find many production model families, especially DETR-style vision models, that do float mask interpolation followed by bool conversion. So float-to-bool mask conversion is common in real model code, but the specific intermediate integer Cast in #28089 is still only proven by the minimal repro unless we can show an exported ONNX graph containing that exact chain.

@skottmckay
Copy link
Copy Markdown
Contributor

concern is we're adding complexity for a contrived example. Has this issue ever affected a production model?

I searched Transformers modeling code and did not find a direct float->int->bool source pattern. I did find many production model families, especially DETR-style vision models, that do float mask interpolation followed by bool conversion. So float-to-bool mask conversion is common in real model code, but the specific intermediate integer Cast in #28089 is still only proven by the minimal repro unless we can show an exported ONNX graph containing that exact chain.

I'll sign off and leave it up to you to decide if we need this. It's not too much new code so the binary size cost shouldn't be high, and I assume any future edits will be done by AI so the extra complexity shouldn't cost developer time to understand it.

@yuslepukhin
Copy link
Copy Markdown
Member

Coverage gaps (nice-to-have, not blockers):

Fan-out case: float→int32 with two children: int32→bool and int32→int64. Should block removal since one child targets bool. Not tested.
double→float→bool: Another lossy-to-bool pattern; would be a cheap additional test.
Non-lossy chain to bool: e.g., int32→int64→bool should be optimized (lossless upcast). Not tested directly.

Minor style nit

There are trailing whitespace characters on a few lines in the production code (lines ~442, ~475). These are cosmetic but may be flagged by lint CI:

// that can map non-zero values to zero, changing the semantics. (trailing spaces)
auto kept_dst_type = kept_node.OutputDefs()[0]->Type(); (trailing spaces)

@microsoft microsoft deleted a comment from azure-pipelines bot Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ORT_ENABLE_ALL incorrectly eliminates Cast(float->int32)->Cast(int32->bool) and changes semantics

4 participants