Skip to content

[Calcite] Make containsNestedAggregator tolerate flat-leaf schemas#5423

Merged
ahkcs merged 2 commits into
opensearch-project:feature/mustang-ppl-integrationfrom
mch2:feature/mustang-ppl-integration
May 8, 2026
Merged

[Calcite] Make containsNestedAggregator tolerate flat-leaf schemas#5423
ahkcs merged 2 commits into
opensearch-project:feature/mustang-ppl-integrationfrom
mch2:feature/mustang-ppl-integration

Conversation

@mch2
Copy link
Copy Markdown
Member

@mch2 mch2 commented May 8, 2026

Description

The check splits each aggregator argument on '.' and looks up the
leading token as a top-level ARRAY column (the OpenSearch nested
marker). It used relBuilder.field(root), which throws when the column
doesn't exist.

That works for the classic path, which always exposes a top-level
column for object/nested parents. It breaks the analytics-engine
path, which emits only flat leaves ("city.name", "city.location.latitude")
— parent placeholders can't round-trip through Substrait. Any
aggregation against an object-field leaf (e.g. stats max(city.location.latitude))
crashed with "Field [city] not found".

Use RelDataType.getField, which returns null on miss. A null lookup
is the correct "not nested" answer. Behavior unchanged for relations
that do have a top-level ARRAY column.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

  The check splits each aggregator argument on '.' and looks up the
  leading token as a top-level ARRAY column (the OpenSearch `nested`
  marker). It used relBuilder.field(root), which throws when the column
  doesn't exist.

  That works for the classic path, which always exposes a top-level
  column for object/nested parents. It breaks the analytics-engine
  path, which emits only flat leaves ("city.name", "city.location.latitude")
  — parent placeholders can't round-trip through Substrait. Any
  aggregation against an object-field leaf (e.g. stats max(city.location.latitude))
  crashed with "Field [city] not found".

  Use RelDataType.getField, which returns null on miss. A null lookup
  is the correct "not nested" answer. Behavior unchanged for relations
  that do have a top-level ARRAY column.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

PR Reviewer Guide 🔍

(Review updated until commit ba80518)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

PR Code Suggestions ✨

Latest suggestions up to ba80518
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add index bounds validation

Add bounds checking before accessing the field name by index. If r.getIndex()
exceeds the size of getFieldNames(), this will throw an IndexOutOfBoundsException.
Validate the index is within bounds before accessing.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [1551-1559]

 return aggCallRefs.stream()
+    .filter(r -> r.getIndex() < rowType.getFieldNames().size())
     .map(r -> rowType.getFieldNames().get(r.getIndex()))
     .map(name -> org.apache.commons.lang3.StringUtils.substringBefore(name, "."))
     .anyMatch(
         root -> {
           RelDataTypeField field =
               rowType.getField(root, /* caseSensitive= */ true, /* elideRecord= */ false);
           return field != null && field.getType().getSqlTypeName() == SqlTypeName.ARRAY;
         });
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential IndexOutOfBoundsException risk when accessing rowType.getFieldNames().get(r.getIndex()). Adding bounds checking is a valid defensive programming practice, though the likelihood of this occurring depends on the upstream data integrity guarantees.

Medium

Previous suggestions

Suggestions up to commit 772be11
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add index bounds validation

Add bounds checking before accessing the field name by index. If r.getIndex()
exceeds the size of getFieldNames(), this will throw an IndexOutOfBoundsException.
Consider validating the index or using a safer access pattern.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [1551-1557]

 return aggCallRefs.stream()
+    .filter(r -> r.getIndex() < rowType.getFieldCount())
     .map(r -> rowType.getFieldNames().get(r.getIndex()))
     .map(name -> org.apache.commons.lang3.StringUtils.substringBefore(name, "."))
     .anyMatch(root -> {
       RelDataTypeField field = rowType.getField(root, /*caseSensitive=*/ true, /*elideRecord=*/ false);
       return field != null && field.getType().getSqlTypeName() == SqlTypeName.ARRAY;
     });
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential IndexOutOfBoundsException risk when accessing rowType.getFieldNames().get(r.getIndex()). Adding bounds checking with filter(r -> r.getIndex() < rowType.getFieldCount()) is a valid defensive programming practice that improves robustness, though the likelihood of this occurring in practice depends on the upstream data integrity.

Medium

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit ba80518

@penghuo penghuo added the PPL Piped processing language label May 8, 2026
@ahkcs ahkcs merged commit 109375d into opensearch-project:feature/mustang-ppl-integration May 8, 2026
33 of 38 checks passed
ahkcs added a commit to ahkcs/sql that referenced this pull request May 11, 2026
Single squashed delivery of the long-running
feature/mustang-ppl-integration branch into main, consolidating 22
feature-branch PRs plus the conflict-resolved merge of current main.
Squashed because the feature branch's history includes commits with
missing or mismatched Signed-off-by trailers that block DCO at this
scope — the equivalent issue documented for the catch-up squashes
(opensearch-project#5397).

The feature branch f006e29 is retained for individual-commit lineage.

### What this delivers

Analytics-engine PPL integration — a new execution path that routes
Parquet-backed (non-Lucene) indices through an analytics engine while
keeping Lucene-backed indices on the existing v2 / Calcite paths.

Headline pieces:
- Query routing (opensearch-project#5267) — PPL queries against Parquet-backed indices
  hand off to the analytics-engine execution path; Lucene-backed indices
  continue through the legacy path
- Explain support (opensearch-project#5275) — EXPLAIN covers the analytics-engine path
- Profiling + UnifiedQueryParser (opensearch-project#5285) — migrates PPL parsing to the
  unified parser and wires profiling metrics through the analytics path
- extendedPlugins wiring (opensearch-project#5302) — analytics-engine attaches as an
  OpenSearch extension via SPI
- SQL REST endpoint integration (opensearch-project#5317) — same analytics-route fork
  applied to the SQL transport, plus delegateToV2Engine extraction in
  RestSqlAction
- Async QueryPlanExecutor (opensearch-project#5396) — async execution for analytics-engine
  plans + version bump to OpenSearch 3.7
- Optional dependency (opensearch-project#5403) — analytics-engine becomes an optional
  runtime dep so the SQL bundle is shippable without it
- Index-setting-based routing (opensearch-project#5429) — replaces the earlier
  table-name-prefix heuristic with an authoritative index-setting check

Supporting infrastructure:
- Gradle wrapper bump to 9.4.1 (opensearch-project#5406)
- Jar-hell exclusions for arrow-flight-rpc / httpcore5-h2 /
  httpcore5-reactive / httpclient5 (opensearch-project#5400, opensearch-project#5409)
- IT plumbing: CalciteEvalCommandIT / CalciteFieldFormatCommandIT
  carried through the helper-managed index path (opensearch-project#5407, opensearch-project#5417);
  CalciteReplaceCommandIT column-order-agnostic (opensearch-project#5415); @ignore'd
  Calcite ITs dropped from CalciteNoPushdownIT (opensearch-project#5416)
- plugins.calcite.enabled=true defaulted on the unified query path
  (opensearch-project#5413)
- PPL_REX_MAX_MATCH_LIMIT bridged into UnifiedQueryContext (opensearch-project#5418)
- Calcite tolerance fixes: array() default type (opensearch-project#5421),
  containsNestedAggregator flat-leaf schemas (opensearch-project#5423)
- Sandbox deps switched to analytics-api JDK 21 surface (opensearch-project#5426)

### Feature-branch commits squashed (22)

opensearch-project#5429, opensearch-project#5426, opensearch-project#5423, opensearch-project#5421, opensearch-project#5418, opensearch-project#5403, opensearch-project#5417, opensearch-project#5415, opensearch-project#5416, opensearch-project#5413,
opensearch-project#5407, opensearch-project#5409, opensearch-project#5406, opensearch-project#5400, opensearch-project#5396, opensearch-project#5317, opensearch-project#5302, opensearch-project#5285, opensearch-project#5275, opensearch-project#5267,
opensearch-project#5397, opensearch-project#5286

### Main commits absorbed via the merge (54)

Brings the branch up to current upstream/main (54 commits since the
last catch-up at opensearch-project#5397, divergence point 513e1b2). Highlights: opensearch-project#5419,
opensearch-project#5408, opensearch-project#5414, opensearch-project#5399, opensearch-project#5394, opensearch-project#5361, opensearch-project#5360, opensearch-project#5240, opensearch-project#5266, opensearch-project#5278, plus 44
others (bugfixes, doc updates, infra).

### Conflict resolutions (7)

Resolved during the merge of main into the feature branch. Resolution
kept the feature branch's analytics-engine-path semantics where main's
changes would have regressed them.

- api/.../UnifiedQueryContext.java
  Blank-line-only conflict; took main's tighter formatting.

- core/.../executor/QueryService.java
  Kept feature's CalciteClassLoaderHelper.withCalciteClassLoader(...)
  wrapping (required for analytics-engine classloader isolation) and
  the matching import.

- integ-test/build.gradle
  Kept feature's detailed root-cause comment on the Gradle 9.4.1
  TestEventReporterAsListener workaround; kept ASCII ordering of
  JSONRequestIT / JoinIT and SQLFunctionsIT / ShowIT / SourceFieldIT
  entries.

- integ-test/.../CalciteEvalCommandIT.java
  Kept feature's if (!TestUtils.isIndexExist(...)) idempotency guards
  on test_eval and test_eval_agent setup (needed for the helper-managed
  index analytics-engine compatibility run).

- legacy/.../RestSqlAction.java
  Kept feature's delegateToV2Engine(...) (extracted from the
  analytics-engine routing path). Both sides added handleException /
  getRestStatus / getRawErrorCode; removed the duplicate set git
  produced.

- plugin/.../SQLPlugin.java
  Took the union of imports: ExecutionEngine +
  ExecutionEngine.ExplainResponse + QueryType.

- plugin/.../transport/TransportPPLQueryAction.java
  Combined main's OpenSearchPluginModule(extensionsHolder.engines()) and
  feature's local pluginSettings / pluginSettingsRef wiring.

EngineExtensionsHolder.java is a new file from main (opensearch-project#5298) preserved
as-is.

### Compatibility / opt-in

The analytics-engine path is gated by the extendedPlugins extension
being installed (opensearch-project#5403 makes the dep optional). Clusters without
analytics-engine installed see no behavior change. Clusters with
analytics-engine installed route only Parquet-backed indices through
the new path (opensearch-project#5429 — by index setting).

### Verification

- ./gradlew :api:compileJava :core:compileJava :legacy:compileJava
  :opensearch-sql-plugin:compileJava :integ-test:compileTestJava passes
  locally

Signed-off-by: Kai Huang <ahkcs@amazon.com>
ahkcs added a commit to ahkcs/sql that referenced this pull request May 11, 2026
Single squashed delivery of the long-running
feature/mustang-ppl-integration branch into main, consolidating 22
feature-branch PRs plus the conflict-resolved merge of current main.
Squashed because the feature branch's history includes commits with
missing or mismatched Signed-off-by trailers that block DCO at this
scope — the equivalent issue documented for the catch-up squashes
(opensearch-project#5397).

The feature branch f006e29 is retained for individual-commit lineage.

### What this delivers

Analytics-engine PPL integration — a new execution path that routes
Parquet-backed (non-Lucene) indices through an analytics engine while
keeping Lucene-backed indices on the existing v2 / Calcite paths.

Headline pieces:
- Query routing (opensearch-project#5267) — PPL queries against Parquet-backed indices
  hand off to the analytics-engine execution path; Lucene-backed indices
  continue through the legacy path
- Explain support (opensearch-project#5275) — EXPLAIN covers the analytics-engine path
- Profiling + UnifiedQueryParser (opensearch-project#5285) — migrates PPL parsing to the
  unified parser and wires profiling metrics through the analytics path
- extendedPlugins wiring (opensearch-project#5302) — analytics-engine attaches as an
  OpenSearch extension via SPI
- SQL REST endpoint integration (opensearch-project#5317) — same analytics-route fork
  applied to the SQL transport, plus delegateToV2Engine extraction in
  RestSqlAction
- Async QueryPlanExecutor (opensearch-project#5396) — async execution for analytics-engine
  plans + version bump to OpenSearch 3.7
- Optional dependency (opensearch-project#5403) — analytics-engine becomes an optional
  runtime dep so the SQL bundle is shippable without it
- Index-setting-based routing (opensearch-project#5429, opensearch-project#5432) — replaces the earlier
  table-name-prefix heuristic with an authoritative index-setting check

Supporting infrastructure:
- Gradle wrapper bump to 9.4.1 (opensearch-project#5406)
- Jar-hell exclusions for arrow-flight-rpc / httpcore5-h2 /
  httpcore5-reactive / httpclient5 (opensearch-project#5400, opensearch-project#5409)
- IT plumbing: CalciteEvalCommandIT / CalciteFieldFormatCommandIT
  carried through the helper-managed index path (opensearch-project#5407, opensearch-project#5417);
  CalciteReplaceCommandIT column-order-agnostic (opensearch-project#5415); @ignore'd
  Calcite ITs dropped from CalciteNoPushdownIT (opensearch-project#5416)
- plugins.calcite.enabled=true defaulted on the unified query path
  (opensearch-project#5413)
- PPL_REX_MAX_MATCH_LIMIT bridged into UnifiedQueryContext (opensearch-project#5418)
- Calcite tolerance fixes: array() default type (opensearch-project#5421),
  containsNestedAggregator flat-leaf schemas (opensearch-project#5423)
- Sandbox deps switched to analytics-api JDK 21 surface (opensearch-project#5426)

### Feature-branch commits squashed (22)

opensearch-project#5432, opensearch-project#5429, opensearch-project#5426, opensearch-project#5423, opensearch-project#5421, opensearch-project#5418, opensearch-project#5403, opensearch-project#5417, opensearch-project#5415, opensearch-project#5416, opensearch-project#5413,
opensearch-project#5407, opensearch-project#5409, opensearch-project#5406, opensearch-project#5400, opensearch-project#5396, opensearch-project#5317, opensearch-project#5302, opensearch-project#5285, opensearch-project#5275, opensearch-project#5267,
opensearch-project#5397, opensearch-project#5286

### Main commits absorbed via the merge (54)

Brings the branch up to current upstream/main (54 commits since the
last catch-up at opensearch-project#5397, divergence point 513e1b2). Highlights: opensearch-project#5419,
opensearch-project#5408, opensearch-project#5414, opensearch-project#5399, opensearch-project#5394, opensearch-project#5361, opensearch-project#5360, opensearch-project#5240, opensearch-project#5266, opensearch-project#5278, plus 44
others (bugfixes, doc updates, infra).

### Conflict resolutions (7)

Resolved during the merge of main into the feature branch. Resolution
kept the feature branch's analytics-engine-path semantics where main's
changes would have regressed them.

- api/.../UnifiedQueryContext.java
  Blank-line-only conflict; took main's tighter formatting.

- core/.../executor/QueryService.java
  Kept feature's CalciteClassLoaderHelper.withCalciteClassLoader(...)
  wrapping (required for analytics-engine classloader isolation) and
  the matching import.

- integ-test/build.gradle
  Kept feature's detailed root-cause comment on the Gradle 9.4.1
  TestEventReporterAsListener workaround; kept ASCII ordering of
  JSONRequestIT / JoinIT and SQLFunctionsIT / ShowIT / SourceFieldIT
  entries.

- integ-test/.../CalciteEvalCommandIT.java
  Kept feature's if (!TestUtils.isIndexExist(...)) idempotency guards
  on test_eval and test_eval_agent setup (needed for the helper-managed
  index analytics-engine compatibility run).

- legacy/.../RestSqlAction.java
  Kept feature's delegateToV2Engine(...) (extracted from the
  analytics-engine routing path). Both sides added handleException /
  getRestStatus / getRawErrorCode; removed the duplicate set git
  produced.

- plugin/.../SQLPlugin.java
  Took the union of imports: ExecutionEngine +
  ExecutionEngine.ExplainResponse + QueryType.

- plugin/.../transport/TransportPPLQueryAction.java
  Combined main's OpenSearchPluginModule(extensionsHolder.engines()) and
  feature's local pluginSettings / pluginSettingsRef wiring.

EngineExtensionsHolder.java is a new file from main (opensearch-project#5298) preserved
as-is.

### Compatibility / opt-in

The analytics-engine path is gated by the extendedPlugins extension
being installed (opensearch-project#5403 makes the dep optional). Clusters without
analytics-engine installed see no behavior change. Clusters with
analytics-engine installed route only Parquet-backed indices through
the new path (opensearch-project#5429 — by index setting).

### Verification

- ./gradlew :api:compileJava :core:compileJava :legacy:compileJava
  :opensearch-sql-plugin:compileJava :integ-test:compileTestJava passes
  locally

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Co-authored-by: bowenlan-amzn <bowenlan23@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants