Use leastRestrictive for mvappend element-type widening#5424
Use leastRestrictive for mvappend element-type widening#5424ahkcs wants to merge 3 commits intoopensearch-project:feature/mustang-ppl-integrationfrom
Conversation
`MVAppendFunctionImpl.updateMostGeneralType` used strict {@code Object.equals}
to compare each operand's component type against the running "most general"
type, falling back to Calcite's {@code ANY} on any mismatch. That's too
aggressive: {@code Object.equals} returns false for type pairs that differ
only in nullability tag (e.g. {@code array(1, 2)} synthesizes INTEGER NULLABLE
for its component while literal {@code 3} is INTEGER NOT NULL), and for
straightforwardly-widenable numerics like INTEGER + DOUBLE. The PPL UDF result
would then be {@code ARRAY<ANY>}.
The Calcite engine's enumerable runtime tolerates {@code ANY} because
{@code MVAppendImplementor.eval} processes elements through {@code Object} —
the declared element type is unused at execution time. The analytics-engine
route is stricter: substrait can't serialize {@code ANY}, so isthmus throws
{@code UnsupportedOperationException: Unable to convert the type ANY} during
the substrait conversion phase.
Widen with {@link RelDataTypeFactory#leastRestrictive} — the same routine
{@code SqlLibraryOperators.ARRAY} uses for its return-type inference. Falls
back to ANY only when {@code leastRestrictive} returns null (genuinely
incompatible operand types like INT + VARCHAR), preserving the original
behavior on those queries.
# Test plan
* {@code :core:test --tests "*MVAppend*"} — passes (no existing test asserted
on the {@code ANY} fallback).
* Companion to opensearch-project/OpenSearch#21554 — unblocks 8+ tests in
{@code CalciteMVAppendFunctionIT} force-routed through the analytics-engine
path that previously failed with "Unable to convert the type ANY".
Signed-off-by: Kai Huang <ahkcs@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit 6cf2b65)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 6cf2b65 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 5fc778f
Suggestions up to commit 5ab670b
|
Calling Calcite's leastRestrictive widens mixed numerics like INT + DECIMAL to a single common numeric type (e.g. DECIMAL(11,1)). The Calcite engine then casts each operand to that type at codegen — Integer(1) becomes BigDecimal with scale 1, which renders as 0.1 (or 0 after JSON round-trip), breaking testMvappendWithIntAndDouble that expects mvappend(1, 2.5) to return [1, 2.5]. The original goal was just to bridge the nullability-tag mismatch that synthesizes an array's component as INTEGER NULLABLE versus a bare literal's INTEGER NOT NULL. Limit the widening to that case via equalSansNullability and keep the ANY fallback for genuinely different types — preserving the Calcite engine's heterogeneous-Object[] runtime semantics that pre-existing tests rely on. Signed-off-by: Kai Huang <huangkaics@gmail.com> Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 5fc778f |
The previous nullability-only bridge fixed `array(1, 2) + literal 3` but left `mvappend(1, 2.5)` falling back to ARRAY<ANY>. ARRAY<ANY> is not substrait-serializable, so any analytics-engine query through that call fails at substrait conversion. Aggressive `leastRestrictive` widening was the obvious next step but earlier triggered a runtime corruption — Integer 1 showed up as 0 in the response — because the Avatica result-set's ArrayAccessor uses element-type-specific accessors (e.g. `DoubleAccessor.getDouble` does `(Double) value`), and an Integer cell in a declared-DOUBLE list triggered a ClassCastException that the error path masked as `[0, 2.5]`. Fix the corruption by pre-casting each scalar operand to the call's element Java class in `MVAppendImplementor` via `EnumUtils.convert`. The result list is now homogeneously typed at codegen, so Avatica's per-element cast succeeds. Promote DECIMAL → DOUBLE on the way through `updateMostGeneralType` because `RowResponseCodec` maps DECIMAL cells to FloatingPoint(DOUBLE) anyway; an explicit DECIMAL element type triggers Calcite's element coercion to BigDecimal, which the JSON formatter renders inconsistently across paths. For genuinely incompatible operand pairs (INT + VARCHAR, …) `leastRestrictive` returns null and the existing `ANY` fallback stands — heterogeneous mvappend output stays on the Calcite engine path; only the analytics-engine route can't emit substrait for those. Local verification: - :core:test --tests *MVAppend* — green - :integ-test:integTest --tests CalciteMVAppendFunctionIT — 15/15 - :integ-test:integTest --tests CalciteArrayFunctionIT — 60/60 Signed-off-by: Kai Huang <huangkaics@gmail.com> Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 6cf2b65 |
Description
MVAppendFunctionImpl.updateMostGeneralTypeused strictObject.equalsto compare each operand's component type against the running "most general" type, falling back to Calcite'sANYon any mismatch.Object.equalsreturns false for type pairs that differ only in nullability tag — e.g.array(1, 2)synthesizesINTEGER NULLABLEfor its component while literal3isINTEGER NOT NULL— and for straightforwardly-widenable numerics like INT + DECIMAL.The Calcite engine's enumerable runtime tolerates
ANYbecauseMVAppendImplementor.implementprocesses elements throughObject[]— the declared element type is unused at execution time. The analytics-engine route is stricter: substrait can't serializeANY, so isthmus throwsUnsupportedOperationException: Unable to convert the type ANYduring substrait conversion.Two changes:
updateMostGeneralTypewidens viaRelDataTypeFactory.leastRestrictive— the same routineSqlLibraryOperators.ARRAYuses for its return-type inference. For genuinely incompatible operand types (INT + VARCHAR, …)leastRestrictivereturns null; fall back toANYthere to preserve the existing in-process Calcite engineObject[]runtime semantics that themvappend(1, 'text', 2.5)-style tests rely on. Promote DECIMAL → DOUBLE on the way through:RowResponseCodecmaps DECIMAL cells toFloatingPoint(DOUBLE)anyway, and an explicit DECIMAL element type triggers Calcite's element coercion to BigDecimal, which the JSON formatter renders inconsistently across paths.MVAppendImplementor.implementpre-casts each scalar operand to the call's element Java class viaEnumUtils.convert. Without this, Avatica'sAbstractCursor.ArrayAccessordispatches the per-element accessor by the declared SQL type — e.g.DoubleAccessor.getDoubledoes(Double) value— and would throw a runtimeClassCastExceptionon anIntegercell when the call's element type widens to DOUBLE. Array operands pass through; their element-type alignment is the planner's responsibility.A previous revision of this PR went straight to
leastRestrictivewidening and triggered a runtime corruption (mvappend(1, 2.5)→[0, 2.5]) for exactly the Avatica-cast reason above. The pre-cast inMVAppendImplementorresolves that without reintroducing the regression.Test plan
./gradlew :core:test --tests "*MVAppend*"→ green../gradlew :integ-test:integTest --tests "org.opensearch.sql.calcite.remote.CalciteMVAppendFunctionIT"(Calcite engine path) → 15/15 pass../gradlew :integ-test:integTest --tests "org.opensearch.sql.calcite.remote.CalciteArrayFunctionIT"→ 60/60 pass (no regression on the sister IT).Analytics-engine compatibility
This change is a companion to opensearch-project/OpenSearch#21554, which onboards
mvappendto the DataFusion route. Pass-rate projection forCalciteMVAppendFunctionITonce the backend-lucene + parquet-fixture infrastructure is in place (currently tracked separately under helper-managed-index migration):testMvappendWithMultipleElementsARRAY<INT>testMvappendWithSingleElementARRAY<INT>testMvappendWithArrayFlatteningARRAY<INT>,ARRAY<INT>ARRAY<INT>testMvappendWithMixedArrayAndScalarARRAY<INT>, INT, INTleastRestrictivetestMvappendWithStringValuesARRAY<VARCHAR>testMvappendWithRealFieldsARRAY<VARCHAR>testMvappendWithNestedArraysARRAY<VARCHAR>ARRAY<VARCHAR>testMvappendWithNumericArraysARRAY<DOUBLE>, DOUBLEARRAY<DOUBLE>testMvappendInWhereClausetestMvappendWithComplexExpressionARRAY<INT>,ARRAY<INT>, INTleastRestrictivetestMvappendWithIntAndDoubletestMvappendWithMixedTypesARRAY<ANY>(substrait can't encode)testMvappendWithFieldsAndLiteralsARRAY<ANY>testMvappendWithEmptyArrayARRAY<VARCHAR>(#5421 default), INT, INTARRAY<ANY>. The empty-arrayarray()is bound toempty_arrviaeval; at MVAppend's type-inference site we see the column reference, not the literal, so no per-call detection helps.testMvappendWithNullARRAY<ANY>(nullif(1, 1)is INTEGER-typed, not NULL-typed)Projected: 11/15 on the analytics-engine route after this fix, up from 10/15. The deeper investigation behind this projection — why each remaining failure isn't fixable without architectural changes (Arrow has no Union-array stdlib, the analytics-engine planner's filter rule doesn't track per-leaf-call types, and
array()'s element-type default is constrained by substrait-serializability) — is summarized in the corresponding section of opensearch-project/OpenSearch#21554.The 4 remaining failures are an architectural limitation, not a regression introduced here:
ANYis a JVMObject[]pass-through with no substrait/Arrow/DataFusion equivalent.datafusion-functions-arraydoesn't operate on them.Integer 1, not"1").