spath: parquet-backed test indices for analytics-engine route by ahkcs · Pull Request #5441 · opensearch-project/sql

ahkcs · 2026-05-14T18:07:57Z

Pairs with opensearch-project/OpenSearch#21664. Both PRs are required to move CalcitePPLSpathCommandIT off 0 / 16 on the analytics-engine route.

What the change does

CalcitePPLSpathCommandIT.init() was creating its four test indices by raw PUT /<idx>/_doc/N requests, which auto-creates the index via the default Lucene path. The analytics-engine compatibility run (-Dtests.analytics.parquet_indices=true) only injects parquet/composite settings inside TestUtils.createIndexByRestClient, so the raw-PUT indices were Lucene-only — and DataFusion fails with UnsupportedOperationException: acquireReader is not supported in EngineBackedIndexer on any Lucene-only index.

The fix is one line per index: create the empty index up-front through the helper so the parquet toggle gets a chance to inject its settings, then let the existing doc PUTs populate it via dynamic mapping.

Before: PUT /test_spath/_doc/1 {…}                     → auto-creates Lucene-backed index
After:  PUT /test_spath {}     (via helper, no mapping) → empty index inherits parquet settings
        PUT /test_spath/_doc/1 {…}                     → doc lands in parquet-backed index

No mapping is declared (null mapping argument) — DataFusion handles dynamic mapping on parquet-backed composite indices just fine. Same pattern as CalciteEvalCommandIT and CalciteFieldFormatCommandIT. No change for the v2 / Calcite path; the helper is a no-op when the parquet toggle isn't set.

Pass rate

IT	Route	Before	After
`CalcitePPLSpathCommandIT`	analytics-engine (`-Dtests.analytics.force_routing=true -Dtests.analytics.parquet_indices=true`)	0 / 16	16 / 16
`CalcitePPLSpathCommandIT`	default v2 / Calcite (no flags)	16 / 16	16 / 16 (no regression)

github-actions · 2026-05-14T18:09:09Z

PR Reviewer Guide 🔍

(Review updated until commit `5eb2dd4`)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Issue The `init()` method may be called multiple times across test runs or in parallel test execution scenarios. The `isIndexExist` checks prevent duplicate index creation within a single run, but if indices persist between runs (e.g., test cleanup fails), subsequent runs will skip index creation but still attempt to insert documents. This can lead to duplicate documents with the same IDs, potentially causing test flakiness. The original code's direct PUT approach was idempotent (same doc ID overwrites), but the new guarded approach only creates indices once while still executing PUTs every time. if (!TestUtils.isIndexExist(client(), "test_spath")) { TestUtils.createIndexByRestClient(client(), "test_spath", null); Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); client().performRequest(request1); Request request2 = new Request("PUT", "/test_spath/_doc/2?refresh=true"); request2.setJsonEntity("{\"doc\": \"{\\\"n\\\": 2}\"}"); client().performRequest(request2); Request request3 = new Request("PUT", "/test_spath/_doc/3?refresh=true"); request3.setJsonEntity("{\"doc\": \"{\\\"n\\\": 3}\"}"); client().performRequest(request3); } // Auto-extract mode: flatten rules and edge cases (empty, malformed) if (!TestUtils.isIndexExist(client(), "test_spath_auto")) { TestUtils.createIndexByRestClient(client(), "test_spath_auto", null); Request autoExtractDoc = new Request("PUT", "/test_spath_auto/_doc/1?refresh=true"); autoExtractDoc.setJsonEntity( "{\"nested_doc\": \"{\\\"user\\\":{\\\"name\\\":\\\"John\\\"}}\"," + " \"array_doc\": \"{\\\"tags\\\":[\\\"java\\\",\\\"sql\\\"]}\"," + " \"merge_doc\": \"{\\\"a\\\":{\\\"b\\\":1},\\\"a.b\\\":2}\"," + " \"stringify_doc\": \"{\\\"n\\\":30,\\\"b\\\":true,\\\"x\\\":null}\"," + " \"empty_doc\": \"{}\"," + " \"malformed_doc\": \"{\\\"user\\\":{\\\"name\\\":\"}"); client().performRequest(autoExtractDoc); } // Auto-extract mode: 2-doc index for spath + command (eval/where/stats/sort) tests if (!TestUtils.isIndexExist(client(), "test_spath_cmd")) { TestUtils.createIndexByRestClient(client(), "test_spath_cmd", null); Request cmdDoc1 = new Request("PUT", "/test_spath_cmd/_doc/1?refresh=true"); cmdDoc1.setJsonEntity( "{\"doc\": \"{\\\"user\\\":{\\\"name\\\":\\\"John\\\",\\\"age\\\":30}}\"}"); client().performRequest(cmdDoc1); Request cmdDoc2 = new Request("PUT", "/test_spath_cmd/_doc/2?refresh=true"); cmdDoc2.setJsonEntity( "{\"doc\": \"{\\\"user\\\":{\\\"name\\\":\\\"Alice\\\",\\\"age\\\":25}}\"}"); client().performRequest(cmdDoc2); } // Auto-extract mode: null input handling (doc 1 establishes mapping, doc 2 has null) if (!TestUtils.isIndexExist(client(), "test_spath_null")) { TestUtils.createIndexByRestClient(client(), "test_spath_null", null); Request nullDoc1 = new Request("PUT", "/test_spath_null/_doc/1?refresh=true"); nullDoc1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); client().performRequest(nullDoc1); Request nullDoc2 = new Request("PUT", "/test_spath_null/_doc/2?refresh=true"); nullDoc2.setJsonEntity("{\"doc\": null}"); client().performRequest(nullDoc2); }

github-actions · 2026-05-14T18:09:37Z

PR Code Suggestions ✨

Latest suggestions up to 5eb2dd4

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Handle index creation failures explicitly If `createIndexByRestClient` fails but the index check passes later, subsequent document insertions will execute without verifying index creation success. Add error handling to ensure index creation completes successfully before inserting documents. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java [37-42] if (!TestUtils.isIndexExist(client(), "test_spath")) { - TestUtils.createIndexByRestClient(client(), "test_spath", null); + try { + TestUtils.createIndexByRestClient(client(), "test_spath", null); + } catch (Exception e) { + throw new RuntimeException("Failed to create test_spath index", e); + } Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); client().performRequest(request1); Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that `createIndexByRestClient` failures should be handled. However, the `init()` method already declares `throws Exception`, so exceptions would propagate naturally. The explicit try-catch adds marginal value by providing a more descriptive error message, but doesn't fundamentally change error handling behavior.	Low
Possible issue	Prevent race conditions in index creation The index existence check may cause race conditions in parallel test execution. Consider using a synchronized block or test isolation mechanism to prevent multiple threads from attempting to create the same index simultaneously, which could lead to index creation conflicts. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java [37-51] -if (!TestUtils.isIndexExist(client(), "test_spath")) { - TestUtils.createIndexByRestClient(client(), "test_spath", null); +synchronized (CalcitePPLSpathCommandIT.class) { + if (!TestUtils.isIndexExist(client(), "test_spath")) { + TestUtils.createIndexByRestClient(client(), "test_spath", null); - Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); - request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); - client().performRequest(request1); - ... + Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); + request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); + client().performRequest(request1); + ... + } } Suggestion importance[1-10]: 4 __ Why: While the suggestion addresses a valid concern about race conditions in parallel test execution, the `init()` method is typically called once per test instance in JUnit, making this scenario less likely. The suggestion may add unnecessary complexity without significant benefit in typical test execution contexts.	Low

Previous suggestions

Suggestions up to commit c4d4a8c

Category	Suggestion	Impact
General	Ensure idempotent document creation for tests The index existence check may race with concurrent test execution or cleanup. If the index exists but is empty or partially populated, subsequent tests may fail. Consider adding validation that the expected documents exist, or use a test fixture that ensures idempotent setup. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java [37-51] if (!TestUtils.isIndexExist(client(), "test_spath")) { TestUtils.createIndexByRestClient(client(), "test_spath", null); +} +// Ensure documents are present (idempotent PUT with same IDs) +Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); +request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); +client().performRequest(request1); +... - Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); - ... - client().performRequest(request1); - ... -} - Suggestion importance[1-10]: 6 __ Why: The suggestion addresses a potential race condition where the index might exist but be empty. Moving document creation outside the `isIndexExist` check makes the setup more idempotent and robust, though the current approach is likely sufficient for most test scenarios.	Low
General	Use Class.forName for robust type checking The string comparison for class name matching is fragile and may fail if the Arrow library is relocated or shaded. Consider using `Class.forName()` with exception handling to check if the class exists and perform an `instanceof` check, which is more robust and handles class loading correctly. core/src/main/java/org/opensearch/sql/data/model/ExprValueUtils.java [121-123] private static boolean isArrowText(Object o) { - return o != null && ARROW_TEXT_CLASS_NAME.equals(o.getClass().getName()); + if (o == null) return false; + try { + Class<?> arrowTextClass = Class.forName(ARROW_TEXT_CLASS_NAME); + return arrowTextClass.isInstance(o); + } catch (ClassNotFoundException e) { + return false; + } } Suggestion importance[1-10]: 4 __ Why: While `Class.forName()` with `isInstance()` is more robust for type checking, the current string comparison approach is intentional to avoid adding an Arrow dependency to the `core/` module, as explicitly stated in the PR comments. The suggestion contradicts the design goal but offers a valid alternative approach.	Low

Suggestions up to commit 9f6aef8

Category	Suggestion	Impact
General	Use Class.forName for robust type checking The string comparison for class name matching is fragile and may fail if the Arrow library uses different class loaders or if the class is relocated/shaded. Consider using `Class.forName()` with exception handling to check if the class exists and then use `instanceof` check, or cache the Class object during initialization for better performance and reliability. core/src/main/java/org/opensearch/sql/data/model/ExprValueUtils.java [121-123] -private static boolean isArrowText(Object o) { - return o != null && ARROW_TEXT_CLASS_NAME.equals(o.getClass().getName()); +private static final Class<?> ARROW_TEXT_CLASS; +static { + Class<?> clazz = null; + try { + clazz = Class.forName(ARROW_TEXT_CLASS_NAME); + } catch (ClassNotFoundException e) { + // Arrow not on classpath, will remain null + } + ARROW_TEXT_CLASS = clazz; } +private static boolean isArrowText(Object o) { + return o != null && ARROW_TEXT_CLASS != null && ARROW_TEXT_CLASS.isInstance(o); +} + Suggestion importance[1-10]: 7 __ Why: The suggestion improves robustness by using `Class.forName()` and `isInstance()` instead of string comparison for class name matching. This approach is more reliable and handles class loader scenarios better, though the current implementation is functional for the stated goal of avoiding Arrow dependencies in `core/`.	Medium
General	Prevent race condition in index creation The index existence check followed by creation and document insertion creates a race condition in parallel test execution. If multiple test instances run concurrently, they could both pass the existence check and attempt creation simultaneously, causing conflicts. Consider using a synchronized block or test isolation mechanism. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java [49-63] -if (!TestUtils.isIndexExist(client(), "test_spath")) { - TestUtils.createIndexByRestClient(client(), "test_spath", SIMPLE_DOC_MAPPING); +synchronized (CalcitePPLSpathCommandIT.class) { + if (!TestUtils.isIndexExist(client(), "test_spath")) { + TestUtils.createIndexByRestClient(client(), "test_spath", SIMPLE_DOC_MAPPING); - Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); - request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); - client().performRequest(request1); - ... + Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); + request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); + client().performRequest(request1); + ... + } } Suggestion importance[1-10]: 6 __ Why: The suggestion addresses a potential race condition in parallel test execution by adding synchronization around index creation. While this improves test reliability, the impact depends on whether tests actually run in parallel, and the `init()` method may already have test framework guarantees about sequential execution.	Low

Suggestions up to commit f5ea743

Category	Suggestion	Impact
General	Ensure documents exist on every run The index creation and document insertion are not atomic. If the test fails between `createIndexByRestClient` and document insertion, subsequent runs will skip document creation because the index exists, leading to test failures with missing documents. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java [49-63] if (!TestUtils.isIndexExist(client(), "test_spath")) { TestUtils.createIndexByRestClient(client(), "test_spath", SIMPLE_DOC_MAPPING); +} +// Always ensure documents exist, even if index was created in a previous run +Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); +request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); +client().performRequest(request1); +... - Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); - request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); - client().performRequest(request1); - ... -} - Suggestion importance[1-10]: 7 __ Why: The suggestion addresses a valid concern about test reliability when index creation and document insertion are not atomic. However, the proposed solution of always inserting documents could lead to duplicate document issues or unnecessary overhead. A better approach might involve checking document existence or using upsert operations.	Medium
General	Optimize type-check ordering for performance The `isArrowText` check should be positioned earlier in the type-checking chain, before the `String` instanceof check, to avoid potential performance overhead from checking String first when Arrow Text objects are common in the analytics-engine route. core/src/main/java/org/opensearch/sql/data/model/ExprValueUtils.java [155-169] } else if (isArrowText(o)) { - // Arrow MapVector / StructVector yields values as - // `org.apache.arrow.vector.util.Text` — a UTF-8 byte-buffer wrapper that - // does NOT implement CharSequence and therefore wouldn't match any of the - // typed branches above. `Text.toString()` decodes to a real Java String. - // Matched by FQN rather than instanceof so `core/` doesn't acquire an - // Arrow dependency for one type-system bridge. Without this branch the - // analytics-engine route surfaces `ExpressionEvaluationException: - // unsupported object class org.apache.arrow.vector.util.Text` from any - // UDF returning Map<Utf8, Utf8> (first such UDF is `json_extract_all` - // powering PPL `spath`). return stringValue(o.toString()); +} else if (o instanceof String) { + return stringValue((String) o); } else if (o instanceof Float f) { Suggestion importance[1-10]: 3 __ Why: While the suggestion about ordering is theoretically valid for performance, the impact is minimal since `instanceof String` is a very fast operation. The current ordering (checking `String` before `isArrowText`) is more logical since `String` is a more common type in general usage, and the Arrow Text case is specific to the analytics-engine route.	Low

Suggestions up to commit 1a07691

Category	Suggestion	Impact
Possible issue	Handle null map keys explicitly The `String.valueOf(k)` call will produce `"null"` string literal when `k` is null, which pollutes the map with a synthetic key. Add an explicit null-check before the `instanceof String` branch to skip null keys or throw an exception, preventing silent data corruption. core/src/main/java/org/opensearch/sql/data/model/ExprValueUtils.java [99-102] map.forEach( - (k, v) -> valueMap.put( - k instanceof String ? (String) k : String.valueOf(k), - v instanceof ExprValue ? (ExprValue) v : fromObjectValue(v))); + (k, v) -> { + if (k == null) { + return; // or throw new IllegalArgumentException("Map keys cannot be null"); + } + valueMap.put( + k instanceof String ? (String) k : String.valueOf(k), + v instanceof ExprValue ? (ExprValue) v : fromObjectValue(v)); + }); Suggestion importance[1-10]: 7 __ Why: Valid concern about `String.valueOf(null)` producing `"null"` literal. However, the suggestion assumes null keys are possible from Arrow MapVector, which may not occur in practice. The fix prevents potential data corruption but may be defensive programming for an edge case not demonstrated in the PR context.	Medium
General	Guard against Arrow Text decoding failures The `o.toString()` call on Arrow Text objects may throw exceptions if the underlying UTF-8 buffer is malformed or corrupted. Wrap the `toString()` call in a try-catch block to handle potential decoding failures gracefully, returning a null value or logging the error instead of propagating the exception. core/src/main/java/org/opensearch/sql/data/model/ExprValueUtils.java [164-175] } else if (isArrowText(o)) { - ... - return stringValue(o.toString()); + try { + return stringValue(o.toString()); + } catch (Exception e) { + // Log the error or return LITERAL_NULL for malformed UTF-8 + return LITERAL_NULL; + } Suggestion importance[1-10]: 6 __ Why: Error handling suggestion for `toString()` on Arrow Text objects. While defensive, the PR documentation doesn't indicate UTF-8 decoding failures are a known issue. The suggestion adds robustness but may be over-engineering for a stable Arrow library operation.	Low
General	Add error handling for index setup The index creation and document insertion logic lacks error handling for REST client failures. If `createIndexByRestClient` or `performRequest` throws an IOException, the test setup will fail silently or with unclear error messages. Wrap these operations in try-catch blocks or let the IOException propagate with descriptive context to aid debugging. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java [49-63] if (!TestUtils.isIndexExist(client(), "test_spath")) { - TestUtils.createIndexByRestClient(client(), "test_spath", SIMPLE_DOC_MAPPING); - Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); - ... + try { + TestUtils.createIndexByRestClient(client(), "test_spath", SIMPLE_DOC_MAPPING); + Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); + request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); + client().performRequest(request1); + ... + } catch (IOException e) { + throw new RuntimeException("Failed to initialize test_spath index", e); + } +} Suggestion importance[1-10]: 4 __ Why: The suggestion adds error handling for REST client operations in test setup. However, the `init()` method already declares `throws Exception`, so IOExceptions will propagate naturally. The suggested try-catch with RuntimeException wrapping doesn't add significant value and may obscure the original exception type.	Low

github-actions · 2026-05-14T18:47:01Z

Persistent review updated to latest commit f5ea743

penghuo · 2026-05-14T21:45:00Z

+  private static final String ARROW_TEXT_CLASS_NAME = "org.apache.arrow.vector.util.Text";
+
+  /**
+   * Whether {@code o} is an Arrow {@code Text} (the UTF-8 byte-buffer wrapper that arrow's Map /
+   * Struct / List vectors emit for string values). FQN match keeps {@code core/} free of an Arrow
+   * dependency.
+   */
+  private static boolean isArrowText(Object o) {
+    return o != null && ARROW_TEXT_CLASS_NAME.equals(o.getClass().getName());
+  }


ExprValueUtils should do not know ARROW data type. Why ExprValueUtils been used on execution code path?

Good catch, updated to remove the change

penghuo · 2026-05-14T21:49:47Z

+  private static final String AUTO_DOC_MAPPING =
+      "{\"mappings\":{\"properties\":{"
+          + "\"nested_doc\":{\"type\":\"keyword\"},"
+          + "\"array_doc\":{\"type\":\"keyword\"},"
+          + "\"merge_doc\":{\"type\":\"keyword\"},"
+          + "\"stringify_doc\":{\"type\":\"keyword\"},"
+          + "\"empty_doc\":{\"type\":\"keyword\"},"
+          + "\"malformed_doc\":{\"type\":\"keyword\"}}}}";


Why add auto_doc_mapping? Becuae Analytics Eengine does not support dynamic mapping?

Updated to remove explicit mapping, currently our IT creates the index using the lazy way, which makes it a default lucene-backed index, the change is to create the empty index up-front through the helper so the parquet toggle gets a chance to inject its settings

`CalcitePPLSpathCommandIT.init()` was creating its four test indices by raw `PUT /<idx>/_doc/N` requests, which auto-creates the index via the default Lucene path. The analytics-engine compatibility run (`-Dtests.analytics.parquet_indices=true`) injects the parquet/composite settings *inside* `TestUtils.createIndexByRestClient`, so the raw-PUT indices were Lucene-only and DataFusion failed with `UnsupportedOperationException: acquireReader is not supported in EngineBackedIndexer` for every test on the analytics-engine route. Fix: create the empty index up-front via `createIndexByRestClient(..., null)` so the toggle has a chance to inject parquet settings, then let the subsequent doc PUTs populate it via dynamic mapping. No mapping is declared — DataFusion is fine with dynamic mapping on a parquet-backed composite index. Same pattern as `CalciteEvalCommandIT` and `CalciteFieldFormatCommandIT`. No change for the v2 / Calcite path (the helper is a no-op when the parquet toggle isn't set). ## Pass rate Pairs with opensearch-project/OpenSearch#21664. Both PRs are required to move the analytics-engine route off 0 / 16. | IT | Route | Before | After | |---|---|---|---| | `CalcitePPLSpathCommandIT` | analytics-engine (`-Dtests.analytics.force_routing=true -Dtests.analytics.parquet_indices=true`) | 0 / 16 | **16 / 16** | | `CalcitePPLSpathCommandIT` | default v2 / Calcite (no flags) | 16 / 16 | 16 / 16 (no regression) | Signed-off-by: Kai Huang <ahkcs@amazon.com>

github-actions · 2026-05-14T22:19:44Z

Persistent review updated to latest commit 5eb2dd4

…gine route Closes the analytics-engine gap for the PPL `spath` command. The path-mode variant (`spath path=...`) already worked via the existing `json_extract` wiring; this PR adds the auto-extract mode (`spath input=doc` → `JSON_EXTRACT_ALL` returning `MAP<VARCHAR, VARCHAR>`) and its downstream operators (ITEM lookup, WHERE on extracted values). ## Pass rate | IT | Before | After | |---|---|---| | `sql/integ-test/.../CalcitePPLSpathCommandIT` (`-Dtests.analytics.force_routing=true -Dtests.analytics.parquet_indices=true`) | 0 / 16 | 16 / 16 | | `sql/integ-test/.../CalcitePPLSpathCommandIT` (default v2/Calcite route) | 16 / 16 | 16 / 16 (no regression) | | `sandbox/qa/analytics-engine-rest/.../SpathCommandIT` (new) | n/a | 16 / 16 | Baseline failure modes on the analytics-engine route: - 15 tests: `OpenSearchProjectRule.annotateExpr` → `No backend supports scalar function [JSON_EXTRACT_ALL] among [datafusion]`. - 1 test (`testSimpleSpath`): `EngineBackedIndexer.acquireReader` → `UnsupportedOperationException` (test-infra issue, fixed on the SQL plugin side in a paired PR). ## What's in this PR 1. **`json_extract_all` Rust UDF** (`sandbox/plugins/analytics-backend-datafusion/rust/src/udf/json_extract_all.rs`). ~550 lines + 16 unit tests. Returns Arrow `Map<Utf8, Utf8>`; mirrors `JsonExtractAllFunctionImpl`'s legacy contract (dot-path flatten, `{}` array marker, `[a, b, c]` merge format for duplicate keys / arrays, `"null"` literal for JSON nulls, malformed → empty map, top-level scalar → NULL). 2. **SPI enum additions** in `analytics-framework`: - `ScalarFunction.JSON_EXTRACT_ALL` enum constant. - `FieldType.MAP` enum constant + `case MAP -> FieldType.MAP` in `fromSqlTypeName`. 3. **Capability registrations** in `DataFusionAnalyticsBackendPlugin`: - New `MAP_RETURNING_PROJECT_OPS` set (mirrors `ARRAY_RETURNING_PROJECT_OPS`) registered with `FieldType.MAP`. Required because `OpenSearchProjectRule.resolveScalarViableBackends` keys on the call's return type, and JSON_EXTRACT_ALL's `MAP<VARCHAR, VARCHAR>` return wouldn't match `SUPPORTED_FIELD_TYPES`. - `STANDARD_FILTER_OPS` registered against `FieldType.MAP` so `where doc.user.name = 'John'` (which references the underlying MAP column through ITEM) survives the filter-rule's field-index-keyed viability check. - Adapter binding `ScalarFunction.JSON_EXTRACT_ALL → JsonExtractAllAdapter`. 4. **Substrait wiring**: - `opensearch_scalar_functions.yaml` — entries for `json_extract_all` and `map_extract`. - `DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS` — function mappings for both names. - `JsonFunctionAdapters.JsonExtractAllAdapter` — name-mapping adapter. 5. **`ITEM(Map, key)` dispatch** in `ArrayElementAdapter`. PPL's `result.user.name` lowers to `ITEM(JSON_EXTRACT_ALL(doc), 'user.name')`. Two transforms for the MAP-input branch: - Route to `map_extract` (DataFusion's native map accessor) instead of `array_element`. Since `map_extract` returns `List<value>` (maps permit duplicate keys), wrap the call in `array_element(..., 1)` to project the singleton list back to a scalar. - Coerce the lookup key (CHAR(N) literal) to VARCHAR before emission so it unifies with the substrait `any1` type-variable binding the YAML declares. 6. **`ArrowValues.MapVector` flattening** in `analytics-engine`. Arrow `MapVector` is laid out as `List<Struct{key, value}>`, so `MapVector.getObject(i)` returns a `JsonStringArrayList` of entry structs rather than a proper map. Reassemble into a `LinkedHashMap<String, Object>` (Text→String normalization on keys and values) so the SQL-plugin response marshaller sees the same shape as a legacy v2 `Map<String, Object>` column. 7. **`gradle/run.gradle`** — the `arrow-flight-rpc` plugin block now also sets `opensearch.experimental.feature.transport.stream.enabled=true`, so the analytics-engine + SQL-plugin co-install boots without the duplicate-PPL-transport-handler Guice failure. 8. **QA-side `SpathCommandIT`** under `sandbox/qa/analytics-engine-rest/...`. Mirrors `CalcitePPLSpathCommandIT` one test method to one, sends queries via `POST /_analytics/ppl`, no SQL-plugin dependency. Verifies the full spath surface end-to-end (both modes, ITEM-on-MAP eval / where / stats / sort, edge cases). Four small datasets under `resources/datasets/spath_{simple,auto,cmd,null}/`. ## Knock-on coverage Every piece in this PR is reusable beyond `spath`: - The MAP_RETURNING_PROJECT_OPS pattern + MAP filter capability are generic for any future PPL function emitting a Calcite MAP RelDataType. - `ArrayElementAdapter`'s ITEM-on-MAP branch + the `map_extract` YAML entry handle every `result['key']` / `result.field` access on a map column, not just spath's. - `ArrowValues.MapVector` flattening unblocks any UDF returning `Map<Utf8, Utf8>` from the analytics-engine route. ## Paired SQL plugin PR The SQL plugin side has a test-infrastructure change to ensure the v2 / Calcite IT's test indices get parquet-backed for the analytics-engine compatibility run: opensearch-project/sql#5441. ## How to verify ```bash # Start the cluster with all sandbox plugins JAVA_HOME=/path/to/temurin-25 ./gradlew :run -Dsandbox.enabled=true \ -PinstalledPlugins="['opensearch-job-scheduler:3.7.0.0-SNAPSHOT', \ 'arrow-flight-rpc', 'analytics-engine', 'parquet-data-format', \ 'analytics-backend-datafusion', 'analytics-backend-lucene', \ 'composite-engine', 'opensearch-sql-plugin:3.7.0.0-SNAPSHOT']" # QA-side IT (no SQL plugin needed) ./gradlew :sandbox:qa:analytics-engine-rest:integTest \ -Dsandbox.enabled=true --tests "*SpathCommandIT" # v2 / Calcite IT (in the SQL plugin checkout, with opensearch-project#5441 applied) ./gradlew :integ-test:integTestRemote \ -Dtests.rest.cluster=localhost:9200 \ -Dtests.cluster=localhost:9300 \ -Dtests.clustername=runTask \ -Dtests.analytics.force_routing=true \ -Dtests.analytics.parquet_indices=true \ --tests "org.opensearch.sql.calcite.remote.CalcitePPLSpathCommandIT" ``` Signed-off-by: Kai Huang <ahkcs@amazon.com>

ahkcs requested review from LantaoJin, RyanL1997, Swiddis, acarbonetto, anirudha, dai-chen, joshuali925, mengweieric, noCharger, penghuo, ps48, qianheng-aws, songkant-aws, vamsimanohar, ykmr1224 and yuancu as code owners May 14, 2026 18:07

ahkcs added the enhancement New feature or request label May 14, 2026

ahkcs force-pushed the feat/spath-analytics-route branch from 1a07691 to f5ea743 Compare May 14, 2026 18:45

ahkcs force-pushed the feat/spath-analytics-route branch from f5ea743 to 9f6aef8 Compare May 14, 2026 18:57

penghuo reviewed May 14, 2026

View reviewed changes

ahkcs force-pushed the feat/spath-analytics-route branch from 9f6aef8 to c4d4a8c Compare May 14, 2026 22:11

ahkcs force-pushed the feat/spath-analytics-route branch from c4d4a8c to 5eb2dd4 Compare May 14, 2026 22:18

ahkcs changed the title ~~spath: Arrow Map response marshalling + parquet-backed test indices~~ spath: parquet-backed test indices for analytics-engine route May 14, 2026

ahkcs mentioned this pull request May 14, 2026

[analytics-engine] Wire PPL spath end-to-end through the analytics-engine route opensearch-project/OpenSearch#21664

Open

penghuo approved these changes May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spath: parquet-backed test indices for analytics-engine route#5441

spath: parquet-backed test indices for analytics-engine route#5441
ahkcs wants to merge 1 commit into
opensearch-project:mainfrom
ahkcs:feat/spath-analytics-route

ahkcs commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

penghuo May 14, 2026

Uh oh!

ahkcs May 14, 2026

Uh oh!

penghuo May 14, 2026 •

edited

Loading

Uh oh!

ahkcs May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ahkcs commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What the change does

Pass rate

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 5eb2dd4)

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

penghuo May 14, 2026

Choose a reason for hiding this comment

Uh oh!

ahkcs May 14, 2026

Choose a reason for hiding this comment

Uh oh!

penghuo May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahkcs May 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahkcs commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

(Review updated until commit `5eb2dd4`)

github-actions Bot commented May 14, 2026 •

edited

Loading

penghuo May 14, 2026 •

edited

Loading