fix(core): make appendcol row ordering deterministic on parallel engines by ahkcs · Pull Request #5474 · opensearch-project/sql

ahkcs · 2026-05-27T20:38:19Z

Description

appendcol zips a subsearch's columns onto the main search's rows by position. Its lowering (CalciteRelNodeVisitor.visitAppendCol) implements this as a FULL JOIN of two ROW_NUMBER() OVER () windows (empty PARTITION BY / ORDER BY) on _row_number_main_ = _row_number_subsearch_, with no trailing sort.

That positional zip is only correct on a serial, order-preserving executor: a bare ROW_NUMBER() OVER () assigns sequence numbers in input order, and the join preserves it. On a parallel/distributed backend the row-number assignment is arbitrary and the hash join drops ordering, so columns get zipped onto the wrong rows and a downstream head slices a non-deterministic subset.

This is currently masked on the serial v2/Calcite engine, but it is a latent correctness bug for any parallel backend (the analytics engine, and the Spark pushdown path — the verifyPPLToSparkSQL golden output bakes in the same non-deterministic ROW_NUMBER() OVER ()).

Root cause (observed)

Running the query below through a parallel backend returned rows out of sort order, with cnt attached to the wrong rows and M rows leaking into the top 10:

source=<idx> | stats sum(age) as sum by gender, state | sort gender, state
  | appendcol [ stats count(age) as cnt by gender | sort gender ]
  | fields gender, state, sum, cnt | head 10

A baseline ... | sort gender, state | head 10 (no appendcol) returned correctly ordered rows on the same backend, isolating the cause to the row-number join.

Fix

Make visitAppendCol independent of implicit input-order preservation:

Deterministic assignment — derive an explicit window ORDER BY from each child's collation (deriveCollationOrderKeys), so ROW_NUMBER follows the upstream sort. Falls back to the prior bare OVER () when the input carries no collation (positional correspondence is undefined without a sort).
Deterministic output order — add a trailing sort by the row-number columns after the join (NULLS LAST; extra subsearch-only rows sort last), the same pattern streamstats already uses, so output order no longer depends on how the backend executes the join.

No behavior change on the serial v2/Calcite engine; the lowering becomes correct on parallel backends.

Results

CalcitePPLAppendcolIT run against the analytics-engine route (force-routed, parquet-backed indices) before/after, and on the v2/Calcite path:

Test	analytics route (before)	analytics route (after)	v2/Calcite
`testAppendCol`	❌	✅	✅
`testAppendColOverride`	❌	✅	✅
Total	0/2	2/2	2/2

Testing

CalcitePPLAppendcolTest (5 unit tests) — updated expected logical plans + Spark SQL; all pass.
CalcitePPLAppendcolIT — 2/2 on the analytics-engine route and 2/2 on v2/Calcite.
NewAddedCommandsIT.testAppendcol — passes.
spotlessCheck clean on :core and :ppl.

Check List

New functionality includes testing.
New functionality has been documented (n/a — behavior-preserving fix).
Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

github-actions · 2026-05-27T20:39:20Z

PR Reviewer Guide 🔍

(Review updated until commit `41ef5f7`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Empty collation fallback When `deriveCollationOrderKeys` returns an empty list (no collation available), `ROW_NUMBER() OVER (ORDER BY <empty>)` is generated. This is semantically equivalent to the old `ROW_NUMBER() OVER ()` and still non-deterministic on parallel engines. The fix only works if a collation is present; queries without one remain broken. private static List<RexNode> deriveCollationOrderKeys(CalcitePlanContext context) { RelBuilder relBuilder = context.relBuilder; List<RelCollation> collations = relBuilder.getCluster().getMetadataQuery().collations(relBuilder.peek()); if (collations == null \|\| collations.isEmpty()) { return List.of(); } List<RexNode> orderKeys = new ArrayList<>(); for (RelFieldCollation fieldCollation : collations.get(0).getFieldCollations()) { RexNode key = relBuilder.field(fieldCollation.getFieldIndex()); if (fieldCollation.direction.isDescending()) { key = relBuilder.desc(key); } if (fieldCollation.nullDirection == RelFieldCollation.NullDirection.LAST) { key = relBuilder.nullsLast(key); } else if (fieldCollation.nullDirection == RelFieldCollation.NullDirection.FIRST) { key = relBuilder.nullsFirst(key); } orderKeys.add(key); } return orderKeys; }

appendcol lowers to a FULL JOIN of two ROW_NUMBER() OVER () windows (empty PARTITION BY / ORDER BY) on _row_number_main_ = _row_number_subsearch_, with no trailing sort. That positional zip is only correct on a serial, order-preserving executor: a bare ROW_NUMBER() OVER () assigns sequence numbers in input order and the join preserves it. On a parallel/distributed backend the row-number assignment is arbitrary and the hash join drops ordering, so columns get zipped onto the wrong rows and downstream `head` slices a non-deterministic subset. Fix visitAppendCol to not depend on implicit input-order preservation: - derive an explicit window ORDER BY from each child's collation (deriveCollationOrderKeys), so ROW_NUMBER assignment follows the upstream sort; falls back to the prior bare OVER () when the input has no collation (positional correspondence is undefined without a sort). - add a trailing sort by the row-number columns after the join (NULLS LAST, same pattern as streamstats) so output order is deterministic regardless of how the backend executes the join. No behavior change on the serial v2/Calcite engine; makes the lowering correct on parallel backends. Updates CalcitePPLAppendcolTest expected plans/SparkSQL. Signed-off-by: Kai Huang <ahkcs@amazon.com>

github-actions · 2026-05-27T22:12:57Z

Persistent review updated to latest commit 41ef5f7

ahkcs requested review from LantaoJin, RyanL1997, Swiddis, acarbonetto, anirudha, dai-chen, joshuali925, mengweieric, noCharger, penghuo, ps48, qianheng-aws, songkant-aws, vamsimanohar, ykmr1224 and yuancu as code owners May 27, 2026 20:38

ahkcs added PPL Piped processing language enhancement New feature or request labels May 27, 2026

ahkcs force-pushed the fix/appendcol-deterministic-ordering branch from 094e05e to 41ef5f7 Compare May 27, 2026 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): make appendcol row ordering deterministic on parallel engines#5474

fix(core): make appendcol row ordering deterministic on parallel engines#5474
ahkcs wants to merge 1 commit into
opensearch-project:mainfrom
ahkcs:fix/appendcol-deterministic-ordering

ahkcs commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ahkcs commented May 27, 2026

Description

Root cause (observed)

Fix

Results

Testing

Check List

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 41ef5f7)

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 27, 2026 •

edited

Loading

(Review updated until commit `41ef5f7`)