Skip to content

Fix CalcitePPLAggregationIT on the analytics-engine route (parquet testSimpleCount0 + APPROX_COUNT_DISTINCT name)#5525

Merged
ahkcs merged 2 commits into
opensearch-project:mainfrom
ahkcs:fix-simplecount0-parquet-index
Jun 8, 2026
Merged

Fix CalcitePPLAggregationIT on the analytics-engine route (parquet testSimpleCount0 + APPROX_COUNT_DISTINCT name)#5525
ahkcs merged 2 commits into
opensearch-project:mainfrom
ahkcs:fix-simplecount0-parquet-index

Conversation

@ahkcs

@ahkcs ahkcs commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Description

Two small fixes that make CalcitePPLAggregationIT pass on the analytics-engine route (-Dtests.analytics.parquet_indices=true). Both are no-ops on the v2 / Calcite path.

1. testSimpleCount0 — use a parquet-backed index. The test built an ad-hoc test index with raw _doc PUTs; a bare auto-created index isn't composite/parquet-backed, so RestUnifiedQueryAction.isAnalyticsIndex doesn't route it to the analytics engine. Switched to TEST_INDEX_BANK (loaded in init() via loadIndex, which injects the parquet settings when the flag is set, 7 docs), and dropped the now-unused Request import. Verified passing on both routes.

2. distinct_count_approx — emit the Substrait-standard operator name. PPLBuiltinOperators.DISTINCT_COUNT_APPROX created its SqlAggFunction with the runtime-resolution name "DISTINCT_COUNT_APPROX". The analytics-engine (DataFusion) backend resolves aggregates by the Calcite/Substrait-standard name APPROX_COUNT_DISTINCT, so distinct_count_approx() failed to bind on the analytics route. Emit APPROX_COUNT_DISTINCT instead (the Java field name and PPL function name are unchanged).

The OpenSearch V3 / Lucene path is unaffected: it overrides this operator via the external HyperLogLog registration in OpenSearchExecutionEngine (whose name is unchanged), so explain output and execution on that path are byte-identical — verified by the unchanged explain_*distinct_count* / explain_*dc* expected-output files and the CalciteExplainIT distinct-count tests still passing. The analytics-route binding (APPROX_COUNT_DISTINCT → DataFusion approx_distinct) is completed by opensearch-project/OpenSearch#22013.

Testing

Test route before after
testSimpleCount0 v2 / analytics ✅ / ❌ ✅ / ✅
CalciteExplainIT distinct-count/dc explain (5) v2 ✅ (unchanged)
testCountDistinctApprox / …WithAlias analytics ✅ once #22013 is in the analytics base¹

¹ The operator rename is verified to be a no-op on v2 (external HLL registration takes precedence). The analytics-route pass for distinct_count_approx is completed by OpenSearch#22013 (already merged on main); local verification of that leg requires an analytics cluster built on a base that includes #22013.

Diagnosis / analytics-engine side by Sandesh Kumar.

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit 2a8ca85)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@ahkcs ahkcs changed the title Use a parquet-backed index in CalcitePPLAggregationIT.testSimpleCount0 Fix CalcitePPLAggregationIT on the analytics-engine route (parquet testSimpleCount0 + APPROX_COUNT_DISTINCT name) Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ce2a02a

@ahkcs ahkcs added the enhancement New feature or request label Jun 8, 2026
ahkcs added 2 commits June 8, 2026 15:24
A bare auto-created index isn't composite/parquet-backed, so on the analytics-engine
route it doesn't route to the analytics engine. Switch to TEST_INDEX_BANK (loaded via
loadIndex, which injects parquet settings when the flag is set, 7 docs) so the test is
meaningful on both routes. Diagnosis by Sandesh Kumar.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
distinct_count_approx() failed to bind on the analytics-engine (DataFusion) route
because the SqlAggFunction was named DISTINCT_COUNT_APPROX; the backend resolves
aggregates by the Calcite/Substrait-standard name APPROX_COUNT_DISTINCT. The Java
field name and PPL function name are unchanged. The OpenSearch V3 path is unaffected
(it overrides this via the external HLL registration). Analytics-route binding is
completed by opensearch-project/OpenSearch#22013. Per Sandesh Kumar.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the fix-simplecount0-parquet-index branch from ce2a02a to 2a8ca85 Compare June 8, 2026 22:26
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 2a8ca85

@ahkcs ahkcs merged commit 74cdf9e into opensearch-project:main Jun 8, 2026
40 checks passed
@ahkcs ahkcs deleted the fix-simplecount0-parquet-index branch June 8, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants