Skip to content

Carry CalciteFieldFormatCommandIT through the helper-managed index path#5417

Merged
ahkcs merged 1 commit intoopensearch-project:feature/mustang-ppl-integrationfrom
ahkcs:mustang/fieldformat-it-helper-managed-index
May 7, 2026
Merged

Carry CalciteFieldFormatCommandIT through the helper-managed index path#5417
ahkcs merged 1 commit intoopensearch-project:feature/mustang-ppl-integrationfrom
ahkcs:mustang/fieldformat-it-helper-managed-index

Conversation

@ahkcs
Copy link
Copy Markdown
Collaborator

@ahkcs ahkcs commented May 7, 2026

Summary

Same shape as #5407 for CalciteEvalCommandIT. The IT's init() previously created the in-test test_eval index via direct PUT /test_eval/_doc/N requests, relying on dynamic mapping. Two problems:

  1. The doc PUTs auto-create the index with whatever settings the cluster defaults to. The analytics-engine compatibility path (force-routing on; tests.analytics.parquet_indices=true) needs parquet-backed indices, which TestUtils.createIndexByRestClient applies via TestUtils.makeParquetBacked when the system property is set. Direct PUTs sidestep that helper, so test_eval lands as Lucene-backed and the analytics planner rejects it with No backend can scan all requested fields on index [test_eval]. All four otherwise-working tests fail at execution.
  2. init() runs before every @Test method. The doc PUTs are doc-level idempotent, so re-running was wasteful but not failing. Once we switch to createIndexByRestClient, the index-level PUT is no longer idempotent and re-running throws resource_already_exists_exception.

Both addressed in one change:

  • test_eval is created via TestUtils.createIndexByRestClient with an explicit mapping (name/title=keyword, age=long). The helper honours tests.analytics.parquet_indices=true and produces a parquet-backed index for the analytics-engine sweep; on the v2 path the helper is a no-op around the index PUT, so behaviour is unchanged.
  • The whole init body is guarded by TestUtils.isIndexExist — same idempotency idiom that loadIndex uses for predefined fixtures. First @Test method provisions; subsequent methods skip.

Also pins the projection order on testFieldFormatStringConcatenation. The original query had no | fields clause and relied on the implicit projection's column order — v2 returns Lucene-source insertion order, analytics returns parquet-storage order (alphabetical), so the assertion only matched on v2 by coincidence. Adding | fields name, title, age, greeting makes the assertion deterministic across paths; the existing expected rows already match this order, so v2 behaviour is preserved. The other four tests already have explicit | fields ... clauses.

No semantic change for the v2 path: the explicit mapping types (keyword, long, keyword) resolve to the same PPL types ("string", "bigint", "string") that dynamic mapping inferred, and fieldformat reads from _source either way.

Pass rate

CalciteFieldFormatCommandIT against a runTask cluster with analytics-engine + opensearch-sql-plugin installed, invoked via :integ-test:analyticsCompatibilityTest --tests 'org.opensearch.sql.calcite.remote.CalciteFieldFormatCommandIT':

Test Before After
testFieldFormatStringConcatenation No backend can scan all requested fields on index [test_eval]
testFieldFormatStringConcatenationWithNullField ❌ same
testFieldFormatStringConcatWithSuffix ❌ same
testFieldFormatStringConcatWithPrefixSuffix ❌ same
testFieldFormatStringConcatenationWithNullFieldToString ❌ same No backend supports scalar function [TOSTRING] among [datafusion]
Total (analytics path) 0 / 5 4 / 5
Total (v2 path) 5 / 5 5 / 5 (no regression)

The remaining tostring()-using test is blocked on a multi-mode UDF (binary / hex / commas / duration / duration_millis) that the analytics path doesn't yet wire — tracked as out of scope here, plausible follow-up estimated at ~1 day for a native Rust UDF + Substrait extension + Java adapter.

Test plan

  • ./gradlew :integ-test:integTest --tests 'org.opensearch.sql.calcite.remote.CalciteFieldFormatCommandIT'5 / 5 green (v2 path, no regression).
  • ./gradlew :integ-test:analyticsCompatibilityTest --tests 'org.opensearch.sql.calcite.remote.CalciteFieldFormatCommandIT' against a runTask cluster — 4 / 5 pass (was 0 / 5 before this PR).

Related

Note on base

This PR targets feature/mustang-ppl-integration rather than main so it lands alongside the rest of the analytics-engine compatibility scaffolding (analyticsCompatibilityTest task, tests.analytics.parquet_indices propagation, loadIndex parquet-aware variant) that the helper-managed pattern relies on. The change is purely additive over the mustang feature branch.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Reviewer Guide 🔍

(Review updated until commit 96fec3e)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Null Age Handling

Document 3 sets "age": null explicitly. With the explicit mapping age as long, inserting a null value may behave differently across Lucene and parquet-backed indices. Verify that the analytics-engine path handles null long values correctly and that test assertions account for this null value in result rows.

request3.setJsonEntity("{\"name\": \"Charlie\", \"age\": null, \"title\": \"Analyst\"}");
client().performRequest(request3);
Missing Index Cleanup

The test_eval index is created once and guarded by isIndexExist, but there is no corresponding teardown (e.g., @AfterAll or deleteIndex) to remove it after the test class finishes. This could leave stale state that affects other test classes or test suite reruns if the index persists between runs.

if (!TestUtils.isIndexExist(client(), "test_eval")) {
  String testEvalMapping =
      "{\"mappings\":{\"properties\":{"
          + "\"name\":{\"type\":\"keyword\"},"
          + "\"age\":{\"type\":\"long\"},"
          + "\"title\":{\"type\":\"keyword\"}}}}";
  TestUtils.createIndexByRestClient(client(), "test_eval", testEvalMapping);

  // Create test data for string concatenation
  Request request1 = new Request("PUT", "/test_eval/_doc/1?refresh=true");
  request1.setJsonEntity("{\"name\": \"Alice\", \"age\": 25, \"title\": \"Engineer\"}");
  client().performRequest(request1);

  Request request2 = new Request("PUT", "/test_eval/_doc/2?refresh=true");
  request2.setJsonEntity("{\"name\": \"Bob\", \"age\": 30, \"title\": \"Manager\"}");
  client().performRequest(request2);

  Request request3 = new Request("PUT", "/test_eval/_doc/3?refresh=true");
  request3.setJsonEntity("{\"name\": \"Charlie\", \"age\": null, \"title\": \"Analyst\"}");
  client().performRequest(request3);
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Code Suggestions ✨

Latest suggestions up to 96fec3e
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Ensure all documents are visible before tests

The documents are inserted without a ?refresh=true wait after the last document, but
more critically, there is no refresh call after all three documents are indexed to
ensure they are all visible before tests run. Since each request uses refresh=true
individually, the last document's refresh is fine, but if the index creation itself
is asynchronous, tests may run before all documents are searchable. Consider adding
a dedicated refresh request (e.g., POST /test_eval/_refresh) after all three
documents are inserted to guarantee consistency.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteFieldFormatCommandIT.java [34-54]

 if (!TestUtils.isIndexExist(client(), "test_eval")) {
   String testEvalMapping =
       "{\"mappings\":{\"properties\":{"
           + "\"name\":{\"type\":\"keyword\"},"
           + "\"age\":{\"type\":\"long\"},"
           + "\"title\":{\"type\":\"keyword\"}}}}";
   TestUtils.createIndexByRestClient(client(), "test_eval", testEvalMapping);
 
   // Create test data for string concatenation
   Request request1 = new Request("PUT", "/test_eval/_doc/1?refresh=true");
   request1.setJsonEntity("{\"name\": \"Alice\", \"age\": 25, \"title\": \"Engineer\"}");
   client().performRequest(request1);
 
   Request request2 = new Request("PUT", "/test_eval/_doc/2?refresh=true");
   request2.setJsonEntity("{\"name\": \"Bob\", \"age\": 30, \"title\": \"Manager\"}");
   client().performRequest(request2);
 
   Request request3 = new Request("PUT", "/test_eval/_doc/3?refresh=true");
   request3.setJsonEntity("{\"name\": \"Charlie\", \"age\": null, \"title\": \"Analyst\"}");
   client().performRequest(request3);
+
+  // Ensure all documents are visible before tests run
+  client().performRequest(new Request("POST", "/test_eval/_refresh"));
 }
Suggestion importance[1-10]: 2

__

Why: Each individual PUT request already uses ?refresh=true, which forces a refresh after each document is indexed, making them immediately searchable. Adding an extra _refresh call after all three documents are already individually refreshed provides no additional benefit and is redundant.

Low

Previous suggestions

Suggestions up to commit 2f052e0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure index refresh after all documents indexed

The document PUT requests do not use ?refresh=true in a way that guarantees all
three documents are visible before the test queries run, since each refresh only
covers that single document. Consider adding a final explicit refresh call (e.g.,
client().performRequest(new Request("POST", "/test_eval/_refresh"))) after all three
documents are indexed to ensure full index visibility before tests execute.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteFieldFormatCommandIT.java [34-54]

 if (!TestUtils.isIndexExist(client(), "test_eval")) {
   String testEvalMapping =
       "{\"mappings\":{\"properties\":{"
           + "\"name\":{\"type\":\"keyword\"},"
           + "\"age\":{\"type\":\"long\"},"
           + "\"title\":{\"type\":\"keyword\"}}}}";
   TestUtils.createIndexByRestClient(client(), "test_eval", testEvalMapping);
 
   // Create test data for string concatenation
-  Request request1 = new Request("PUT", "/test_eval/_doc/1?refresh=true");
+  Request request1 = new Request("PUT", "/test_eval/_doc/1");
   request1.setJsonEntity("{\"name\": \"Alice\", \"age\": 25, \"title\": \"Engineer\"}");
   client().performRequest(request1);
 
-  Request request2 = new Request("PUT", "/test_eval/_doc/2?refresh=true");
+  Request request2 = new Request("PUT", "/test_eval/_doc/2");
   request2.setJsonEntity("{\"name\": \"Bob\", \"age\": 30, \"title\": \"Manager\"}");
   client().performRequest(request2);
 
-  Request request3 = new Request("PUT", "/test_eval/_doc/3?refresh=true");
+  Request request3 = new Request("PUT", "/test_eval/_doc/3");
   request3.setJsonEntity("{\"name\": \"Charlie\", \"age\": null, \"title\": \"Analyst\"}");
   client().performRequest(request3);
+
+  client().performRequest(new Request("POST", "/test_eval/_refresh"));
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion to add a final _refresh call after all documents are indexed is technically valid, but each individual PUT request already uses ?refresh=true, which forces a refresh after each document is indexed. This means all documents are already visible before the next request is made, making the additional refresh call redundant. The improved_code removes ?refresh=true from individual requests and adds a single final refresh, which is a valid alternative approach but not strictly necessary given the existing per-document refresh.

Low

@ahkcs ahkcs added the enhancement New feature or request label May 7, 2026
@ahkcs ahkcs enabled auto-merge (squash) May 7, 2026 19:12
@ahkcs ahkcs disabled auto-merge May 7, 2026 19:13
Same shape as opensearch-project#5407 for CalciteEvalCommandIT. The IT's init() previously
created the in-test test_eval index via direct `PUT /test_eval/_doc/N`
requests, relying on dynamic mapping. Two problems:

  1. The doc PUTs auto-create the index with whatever settings the cluster
     defaults to. The analytics-engine compatibility path (force-routing on;
     tests.analytics.parquet_indices=true) needs parquet-backed indices,
     which TestUtils.createIndexByRestClient applies via
     TestUtils.makeParquetBacked when the system property is set. Direct
     PUTs sidestep that helper, so test_eval lands as Lucene-backed and
     the analytics planner rejects it with "No backend can scan all
     requested fields on index [test_eval]". All four working tests fail
     at execution.

  2. init() runs before every @test method. The doc PUTs are doc-level
     idempotent, so re-running was wasteful but not failing. Once we
     switch to createIndexByRestClient, the index-level PUT is no longer
     idempotent and re-running throws "resource_already_exists_exception".

Both addressed in one change:

  - test_eval is created via TestUtils.createIndexByRestClient with an
    explicit mapping (name/title=keyword, age=long). The helper honours
    tests.analytics.parquet_indices=true and produces a parquet-backed
    index for the analytics-engine sweep; on the v2 path the helper is a
    no-op around the index PUT, so behaviour is unchanged.

  - The whole init body is guarded by TestUtils.isIndexExist — same
    idempotency idiom that loadIndex uses for predefined fixtures. First
    @test method provisions; subsequent methods skip.

Also pins the projection order on testFieldFormatStringConcatenation. The
original query (`source=test_eval | fieldformat greeting = 'Hello ' +
name`) had no `| fields` clause and relied on the implicit projection's
column order — v2 returns Lucene-source insertion order, analytics returns
parquet-storage order (alphabetical), so the assertion only matched on v2
by coincidence. Adding `| fields name, title, age, greeting` makes the
assertion deterministic across paths; the existing expected rows
(`rows("Alice", "Engineer", 25, "Hello Alice")`) already match this
order, so v2 behaviour is preserved.

The other four tests already had explicit `| fields ...` clauses, so no
change there.

No semantic change for the v2 path: the explicit mapping types (keyword,
long, keyword) resolve to the same PPL types ("string", "bigint", "string")
that dynamic mapping inferred, and fieldformat reads from _source either
way.

Analytics-route compatibility goes from 1/5 to 4/5 (verified locally
against a runTask cluster with analytics-engine + opensearch-sql-plugin).
The remaining `testFieldFormatStringConcatenationWithNullFieldToString`
needs a `tostring()` UDF on the analytics path — a multi-mode UDF (binary
/ hex / commas / duration) tracked separately as out of scope.

Test plan:
- ./gradlew :integ-test:integTest --tests
  'org.opensearch.sql.calcite.remote.CalciteFieldFormatCommandIT'
  -> 5/5 green (v2 path, no regression).
- ./gradlew :integ-test:analyticsCompatibilityTest --tests
  'org.opensearch.sql.calcite.remote.CalciteFieldFormatCommandIT'
  -> 4/5 pass; the 5th fails on `tostring`'s missing capability registration,
  which is the documented out-of-scope category.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the mustang/fieldformat-it-helper-managed-index branch from 2f052e0 to 96fec3e Compare May 7, 2026 19:17
@ahkcs ahkcs changed the base branch from main to feature/mustang-ppl-integration May 7, 2026 19:17
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 96fec3e.

'Diff too large, requires skip by maintainers after manual review'


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Persistent review updated to latest commit 96fec3e

@ahkcs ahkcs merged commit 7da02d5 into opensearch-project:feature/mustang-ppl-integration May 7, 2026
68 of 74 checks passed
ahkcs added a commit to ahkcs/sql that referenced this pull request May 8, 2026
…ation

Brings the catch-up branch up to current upstream/main (4 commits since
this PR was opened) and current feature/mustang-ppl-integration (9
commits since this PR was opened), so the PR is mergeable into
feature/mustang-ppl-integration without conflicts.

Squashed (rather than two real merge commits) for the same DCO reason
the original commit was squashed: upstream commits authored by many
contributors with inconsistent or missing Signed-off-by trailers would
otherwise be brought into this PR's history.

Newer main commits absorbed (4):
- opensearch-project#5419 (LENGTH/REGEXP_REPLACE/DATE_TRUNC unified function spec)
- opensearch-project#5408 (datetime type normalization)
- opensearch-project#5414 (Gradle wrapper bump + @ignore exclusion)
- opensearch-project#5399 (FGAC-scoped SQL cursor continuation)

Newer feature commits absorbed (9):
- opensearch-project#5403 (analytics-engine optional dependency — major rewiring)
- opensearch-project#5407 (Carry CalciteEvalCommandIT through helper-managed index path)
- opensearch-project#5413 (Default plugins.calcite.enabled=true on unified path)
- opensearch-project#5415, opensearch-project#5416, opensearch-project#5417, opensearch-project#5409, opensearch-project#5400, opensearch-project#5406 (smaller carryovers + bumps)

Conflict resolutions (10 from main side, 3 from feature side):

api/spec/* (LanguageSpec, UnifiedFunctionSpec, UnifiedPplSpec,
UnifiedSqlSpec): took main. Main is a strict superset — adds
postAnalysisRules and preCompilationRules extension points, the new
FunctionSpecBuilder DSL, SCALAR category for length/regexp_replace/
date_trunc, the DatetimeExtension on PPL spec, and the CoreExtension
wiring on SQL spec. PR's RELEVANCE category is preserved unchanged.

api/UnifiedQueryPlanner.java, api/compiler/UnifiedQueryCompiler.java:
took main. Both adopt the new postAnalysisRules / preCompilationRules
hooks introduced in opensearch-project#5408 / opensearch-project#5419.

core/executor/QueryService.java: composed both sides — kept HEAD's
CalciteClassLoaderHelper.withCalciteClassLoader wrapper around main's
StageErrorHandler stage tracking. Same pattern as the original PR
resolution; both improvements are orthogonal.

legacy/plugin/RestSqlAction.java: took HEAD. The 3-way merge produced
a duplicated handleException/getRawErrorCode block; HEAD already
contained both the delegateToV2Engine refactor and the ErrorReport
unwrap from main, so HEAD is the correct superset.

integ-test/build.gradle: took feature. Both sides added the same
@ignore exclusion block; feature has alphabetical ordering and a more
detailed comment explaining the Gradle 9.4.1 TestEventReporterAsListener
cast bug.

integ-test/.../CalciteEvalCommandIT.java: composed both sides. Took
feature's helper-managed test_eval provisioning (createIndexByRestClient
+ isIndexExist guard, from opensearch-project#5407) so analytics-engine compatibility runs
get a parquet-backed index. Added back PR HEAD's test_eval_agent setup
(needed by the dotted-path eval tests for opensearch-project#5351) wrapped in its own
isIndexExist guard for the same parquet-aware idempotency.

plugin/.../TransportPPLQueryAction.java: took feature. PR opensearch-project#5403 made
analytics-engine an optional dependency by moving QueryPlanExecutor
from a required constructor parameter to an @Inject(optional=true)
setter. Feature's design supersedes our prior wiring.

plugin/.../SQLPlugin.java: took feature. The same opensearch-project#5403 simplification
removed loadExtensions/EngineExtensionsHolder/executionEngineExtensions
plumbing (no longer needed once analytics-engine is optionally bound).
Feature retains the createSqlAnalyticsRouter method this PR introduced.

plugin/.../config/EngineExtensionsHolder.java: deleted. Unreferenced
after taking feature's SQLPlugin/TransportPPLQueryAction; not present
on feature branch.

Build: :api, :core, :opensearch-sql-plugin, :legacy compileJava +
:integ-test compileTestJava all pass; unit tests pass; spotlessCheck
clean.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant