Reject outer ORDER BY / OFFSET / aggregation / DISTINCT over subqueries#5385
Merged
mengweieric merged 2 commits intoopensearch-project:feature/vector-search-p0from Apr 28, 2026
Conversation
Extend the subquery-boundary walker to also reject outer ORDER BY, OFFSET, GROUP BY, aggregation, and DISTINCT when separated from the scan by a subquery Project. Previously only outer WHERE was rejected, so these operators would be applied in memory after k-NN had already returned the top-k by vector distance and could silently yield zero rows or mis-ordered results. Outer LIMIT without OFFSET is still allowed and locked in by a positive test. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…score IT, reword pushdown phrasing Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
c06c472
into
opensearch-project:feature/vector-search-p0
36 checks passed
mengweieric
added a commit
to mengweieric/sql
that referenced
this pull request
Apr 29, 2026
- Move translation-failure in-memory fallback note from explicit `post` to the default (omitted) section; under explicit `post` the query errors instead. - Expand Limitations to cover outer ORDER BY, non-zero OFFSET, GROUP BY, aggregation, and DISTINCT over a vectorSearch() subquery (matches the expanded rejection landed via opensearch-project#5385); note that plain outer LIMIT without OFFSET is allowed. - Add engine/method caveat for default `efficient` filtering and soften "pre-filtering during ANN search" phrasing to "native efficient k-NN filtering". - Clarify that full-text predicates under WHERE act as filters, not as hybrid relevance scorers. - Rename Example 5 to "Post-filtering for predicates not supported by efficient mode". - Tighten explicit `efficient` wording to emphasize it fails closed. - Reword radial examples and supported option keys to say "matches / returns up to the specified LIMIT documents" instead of "returns all". - Add alias fan-out note under the `table` argument. - Sweep remaining em dashes in the file to plain text. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
mengweieric
added a commit
that referenced
this pull request
Apr 29, 2026
- Move translation-failure in-memory fallback note from explicit `post` to the default (omitted) section; under explicit `post` the query errors instead. - Expand Limitations to cover outer ORDER BY, non-zero OFFSET, GROUP BY, aggregation, and DISTINCT over a vectorSearch() subquery (matches the expanded rejection landed via #5385); note that plain outer LIMIT without OFFSET is allowed. - Add engine/method caveat for default `efficient` filtering and soften "pre-filtering during ANN search" phrasing to "native efficient k-NN filtering". - Clarify that full-text predicates under WHERE act as filters, not as hybrid relevance scorers. - Rename Example 5 to "Post-filtering for predicates not supported by efficient mode". - Tighten explicit `efficient` wording to emphasize it fails closed. - Reword radial examples and supported option keys to say "matches / returns up to the specified LIMIT documents" instead of "returns all". - Add alias fan-out note under the `table` argument. - Sweep remaining em dashes in the file to plain text. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Swiddis
pushed a commit
that referenced
this pull request
Apr 29, 2026
* [Feature] Add table function relation syntax to SQL grammar (#5318) * [Feature] Add table function relation to SQL grammar for vectorSearch() Add table function relation support to the SQL parser: - New `tableFunctionRelation` alternative in `relation` grammar rule - Named argument syntax: `key=value` (e.g., table='index', field='vec') - Alias is required by grammar (FROM func(...) AS alias) - AstBuilder emits existing TableFunction + SubqueryAlias AST nodes - 3 parser unit tests: basic parse, with WHERE/ORDER BY/LIMIT, alias required This is a pure grammar change — no execution support yet. Queries will parse successfully but fail at the Analyzer with "unsupported function". Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review feedback on table function relation grammar 1. Canonicalize argument names at parser boundary: unquoteIdentifier + toLowerCase(Locale.ROOT) in visitTableFunctionRelation so FIELD='x' and `field`='x' both produce argName="field" 2. Make AS keyword optional (AS? alias) for consistency with tableAsRelation and subqueryAsRelation grammar rules 3. Strengthen test coverage: - Full structural AST assertion for WHERE + ORDER BY + LIMIT (verifies Sort, Limit, Filter nodes, not just toString) - Argument reorder test proves names resolve by name not position - Case canonicalization test (TABLE= → table=) - Alias-without-AS test (FROM func(...) v) Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Apply spotless formatting Signed-off-by: Eric Wei <mengwei.eric@gmail.com> --------- Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add knn_vector as recognized MappingType in OpenSearchDataType Maps knn_vector fields to ExprCoreType.ARRAY so they appear in DESCRIBE output and can be referenced in projections. This is a visibility shim — not a full vector type. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Widen OpenSearchIndexScanBuilder constructor to public VectorSearchIndex.createScanBuilder() needs to construct an OpenSearchIndexScanBuilder with a custom VectorSearchQueryBuilder delegate. The existing constructor was protected (test-only). Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add vector search table function, query builder, and index Introduces the core execution pipeline for vectorsearch(): - VectorSearchTableFunctionResolver: registers vectorsearch with 4 STRING args - VectorSearchTableFunctionImplementation: parses named args, vector literal, options string, validates search mode (k/max_distance/min_score) - VectorSearchIndex: extends OpenSearchIndex with knn query seeding, score tracking, and WrapperQueryBuilder DSL construction - VectorSearchQueryBuilder: keeps knn in must (scoring) context, WHERE filters in filter (non-scoring) context Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Register VectorSearchTableFunctionResolver in OpenSearchStorageEngine Override getFunctions() to expose vectorsearch() table function to the query analysis pipeline. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add VectorSearchQueryBuilder unit tests Verifies knn query is placed in scoring (must) context, not wrapped in bool.filter when no WHERE clause is present. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review feedback: add validation guards and pushDownFilter test - Add pushDownFilter() unit test asserting knn stays in bool.must (scoring) and WHERE predicate goes to bool.filter (non-scoring) - Add option key allowlist (k, max_distance, min_score) to reject unknown/unsupported keys before they reach DSL generation - Add field name validation to reject characters that could corrupt the WrapperQueryBuilder JSON (allows alphanumeric, dots, underscores, hyphens) - Add named-arg type guard to reject non-NamedArgumentExpression args early with a clear error message Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Canonicalize option values as numeric types before DSL generation Parse k as integer, max_distance and min_score as double before they reach buildKnnQuery(). Rejects non-numeric and non-finite values with clear errors. This closes the residual JSON-injection path through option values without requiring full XContent migration. Also fixes toString() to be consistent with the named-arg guard (no longer blindly casts to NamedArgumentExpression). Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Harden input validation and add size=k default for top-k mode - parseOptions: reject malformed segments and duplicate keys - parseVector: wrap errors in ExpressionEvaluationException, reject non-finite floats (Infinity, NaN) - VectorSearchIndex: default requestedTotalSize to k via pushDownLimitToRequestTotal so queries without LIMIT return k results - Add 5 new tests: malformed option, duplicate key, empty vector, malformed vector component, non-finite vector component Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add null-arg-name guard and make storage engine test less brittle - validateNamedArgs() now rejects null/empty arg names defensively, closing a potential NPE if the shared table-function path is later wired into PPL - OpenSearchStorageEngineTest uses contains-check instead of exact collection size assertion - Add testNullArgNameThrows test Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Clean up dead code, fix misleading comment and test name - Remove unused VECTOR_OPTION constant from VectorSearchIndex - Clarify buildKnnQuery() comment: quoted fallback is for forward compatibility, all P0 values are already canonicalized as numeric - Rename testMissingSearchModeOptionThrows to testUnknownOptionKeyOnlyThrows to match what it actually tests Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add test for missing required option validation path Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add mutual exclusivity and k range validation - Enforce exactly one of k, max_distance, or min_score - Validate k is in [1, 10000] range - Add 6 tests: mutual exclusivity (3 combos), k too small, k too large, k boundary values (1 and 10000) Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add LIMIT > k rejection in top-k vector search mode VectorSearchQueryBuilder now accepts options map and rejects pushDownLimit when LIMIT exceeds k. Radial modes (max_distance, min_score) have no LIMIT restriction. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add comprehensive test coverage for vector search hardening - Create VectorSearchIndexTest: 7 tests covering buildKnnQueryJson() for top-k, max_distance, min_score, nested fields, multi-element and single-element vectors, numeric option rendering - Add edge case tests to VectorSearchTableFunctionImplementationTest: NaN vector component, empty option key/value, negative k, NaN for max_distance and min_score (6 new tests) - Add VectorSearchQueryBuilderTest: min_score radial mode LIMIT, pushDownSort delegation to parent (2 new tests) - Extract buildKnnQueryJson() as package-private for direct testing Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add resolver argument count edge case tests Test too-many (5) and zero arguments paths in VectorSearchTableFunctionResolver to complement existing too-few (2) test. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add radial size policy, sort restriction, and integration tests - Cap radial mode (max_distance/min_score) results at maxResultWindow to prevent unbounded result sets - Reject ORDER BY on non-_score fields and _score ASC in vectorSearch since knn results are naturally sorted by _score DESC - Add 12 integration tests: 4 _explain DSL shape verification tests and 8 validation error path tests Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Fill test coverage gaps for vector search hardening - Add multi-sort expression test: ORDER BY _score DESC, name ASC correctly rejects the non-_score field (VectorSearchQueryBuilderTest) - Add case-insensitive argument name lookup test to verify TABLE='x' resolves same as table='x' (Implementation test) - Add non-numeric option fallback test: verifies string options are quoted in JSON output (VectorSearchIndexTest) - Add 4 integration tests: ORDER BY _score DESC succeeds, ORDER BY non-score rejects, ORDER BY _score ASC rejects, LIMIT within k succeeds (VectorSearchIT, now 16 tests) Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Preserve sort.getCount() limit pushdown contract in pushDownSort The base OpenSearchIndexScanQueryBuilder.pushDownSort() pushes sort.getCount() as a limit when non-zero. Our override validated _score DESC and returned true, but did not preserve this contract. SQL always sets count=0, so this was not reachable today, but PPL or future callers may set a non-zero count to combine sort+limit in one LogicalSort node. Preserve the behavior defensively. Add focused test: LogicalSort(count=7) with _score DESC verifies the count is pushed down as request size. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add compound predicate and radial+WHERE test coverage - Unit test: compound AND predicate survives pushdown into bool.filter - Integration test: compound WHERE (term + range) produces bool query - Integration test: radial max_distance with WHERE produces bool query Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Route pushDownSort count through LIMIT > k validation pushDownSort() called requestBuilder.pushDownLimit() directly, bypassing the LIMIT > k guard in pushDownLimit(). Extract validateLimitWithinK() helper and call it from both paths so the invariant holds when PPL or future callers set a non-zero sort count. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Split explain tests into dedicated VectorSearchExplainIT Move all explainQuery()-based DSL shape tests into a dedicated VectorSearchExplainIT suite. VectorSearchIT now contains only validation and error-path tests. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add FilterType enum for post|efficient filter placement Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add filter_type to allowed option keys with post|efficient validation Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Strip filter_type from options and pass as typed FilterType to VectorSearchIndex Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Collapse buildKnnQueryJson to accept optional filter clause Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Implement efficient filter pushdown branching and build-time validation in VectorSearchQueryBuilder Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Wire FilterType and rebuild callback through createScanBuilder Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add build-time validation and regression tests for LIMIT/sort under efficient mode Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add integration tests for filter_type=post|efficient and spotless formatting Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Reject radial vector search without LIMIT Radial search (max_distance or min_score) can return unbounded results. Add build-time validation that rejects radial queries without an explicit LIMIT clause, with a clear error message guiding the user. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Fix limitPushed not set when limit comes through pushDownSort count path pushDownSort with a non-zero sort.getCount() pushes a limit to requestBuilder directly, bypassing pushDownLimit() and leaving limitPushed=false. This causes build() to incorrectly reject radial vector search when the limit arrives via the sort-with-count path (e.g. PPL sort command). Set limitPushed=true in the sort.getCount() block alongside the existing requestBuilder.pushDownLimit() call. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Handle non-pushdownable WHERE with explicit filter_type pushDownFilter() did not catch ScriptQueryUnSupportedException, so non-pushdownable filters (e.g. struct-type fields) would propagate a raw internal exception instead of a clean SQL-layer error. With explicit filter_type: throw a clear error explaining the WHERE clause cannot be pushed down for the requested filter placement. Without explicit filter_type: return false to fall back to in-memory filtering, matching the base class behavior. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add defensive null guard for EFFICIENT mode rebuild callback Reject construction of VectorSearchQueryBuilder with FilterType.EFFICIENT and a null rebuildKnnWithFilter callback at construction time instead of deferring to an NPE in pushDownFilter. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review feedback: reword user-facing error, strengthen explain test, rename constructor comment - Reword filter_type error message to be user-friendly and actionable (no longer leaks internal ScriptQueryUnSupportedException text) - Strengthen efficient-mode explain IT: assert no bool/must (proves not post-filter shape), decode base64 knn payload to verify filter and predicate field are embedded inside knn query - Rename "Backward-compatible constructor" to clarify intent Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Rename remaining backward-compatible constructor comment for consistency Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Reject GROUP BY / aggregations on vectorSearch() relations Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review: reframe as SQL-preview constraint; add GROUP BY and bare-aggregate ITs; drop subquery workaround text Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Strengthen VectorSearchExplainIT with structural DSL assertions The explain suite previously relied on substring checks (contains("wrapper"), contains("bool"), contains("filter")) that could false-positive across distinct push-down shapes — notably, post-filter vs. efficient-mode filter_type output look similar at the substring level. Add shared helpers to (1) decode every base64 WrapperQueryBuilder payload into its inner k-NN JSON and (2) extract and unescape the outer sourceBuilder JSON from explain output. Strengthen the top-k, radial (max_distance, min_score), post-filter, compound-predicate, radial-with-WHERE, and filter_type=post tests to assert both the outer bool/must/filter shape and that WHERE predicates stay out of the inner k-NN payload. Refactor the efficient-mode test onto the same helpers so its no-outer-bool/must guard is verified against the unescaped sourceBuilder JSON rather than accidentally passing on escaped output. Leave the ORDER BY _score DESC and LIMIT <= k smoke tests untouched — those are guard-rail "does planning succeed" checks where shape is covered elsewhere. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review nits: outer-bool guard, helper comment, efficient+ORDER BY - Add assertFalse(contains("\"bool\"")) on the outer sourceBuilder JSON for the plain top-k and radial (max_distance, min_score) cases so a regression that wrapped the knn in an outer bool while preserving the inner payload is caught. - Document the SOURCE_BUILDER_JSON helper's test-only coupling to the surrounding "sourceBuilder=...", "pitId=" tokens in the request-string toString() output — if that format changes, update the regex anchors. - Strengthen testEfficientFilterWithOrderByScoreDescSucceeds onto the same helpers as testExplainFilterTypeEfficientProducesKnnWithFilter: verify no outer bool/must in the sourceBuilder JSON and that the WHERE predicate is embedded inside the decoded knn payload. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Fail fast when k-NN plugin is missing vectorSearch() currently lets queries proceed even when the k-NN plugin is not installed; the user sees a cryptic native OpenSearch error deep in execution. This change adds a lazy capability probe that runs on the first vectorSearch() invocation per JVM and throws a clear message when the plugin is absent. - KnnPluginCapability: probes Nodes Info once via NodeClient, caches the result. Successful results (positive or negative) are cached; probe failures (IO, timeout) are not, so subsequent calls retry. If no NodeClient is available (REST-client / standalone mode), the check is skipped and execution-time errors remain the signal. - The probe uses the canonical k-NN plugin class name rather than the artifact name so it survives repackaging variants. - Wired into VectorSearchTableFunctionImplementation.applyArguments() so the failure surfaces before any DSL is built. Resolver instantiates one shared capability per registration; the constructor remains backward-compatible (two-arg form delegates to the three-arg form). - Unit tests cover: node-client absent (skip), plugin present (pass), plugin absent (throw), positive/negative caching, and probe-failure non-caching. Plugin registration stays unconditional — failing at registration would break introspection (SHOW FUNCTIONS etc.) on clusters without k-NN. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Defer k-NN plugin probe to scan open() so _explain keeps working The original placement ran knnCapability.requireInstalled() at the very top of applyArguments(). That hook runs during analysis for both _query AND _explain, which regressed two existing contracts on clusters without the k-NN plugin: - VectorSearchExplainIT explicitly states "_explain only parses and plans the query and does NOT require the k-NN plugin". The check would have made every explain test fail with "k-NN plugin missing". - VectorSearchIT is the local-validation/error-path suite. Malformed queries on a pluginless cluster would have collapsed to the k-NN error instead of the existing SQL validation errors. Move the probe to VectorSearchIndexScan.open() — a new thin subclass of OpenSearchIndexScan whose only override calls requireInstalled() before super.open(). Scan open() runs only on the execute path; _explain walks the built plan without opening it, so explain stays functional. Also reorder applyArguments() — nothing to relocate now that the capability check has moved, but the comment documents the intent: local validation runs first so malformed-query errors are stable regardless of cluster state or probe outcome. Add two ITs on the default (pluginless) integ-test cluster: - testExecutionWithoutKnnPluginReturnsCapabilityError — verifies the clear "k-NN plugin not installed" error surfaces on execute. - testExplainWithoutKnnPluginStillWorks — guards the _explain contract so future probe relocations can't silently break it. Also drops the "SQL preview" framing from VectorSearchIndexScanBuilder's aggregation-rejection error per prior user instruction. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [BugFix] Reject positional args and tighten table-name validation in vectorSearch() Close the silent-no-op path where `vectorSearch('idx', field='f', ...)` parsed through the legacy engine and returned HTTP 200 with zero rows. The V2 SQL grammar now accepts `tableFunctionArg : ident EQUAL_SYMBOL functionArg | functionArg`, so the positional shape reaches AstBuilder, which throws a SemanticCheckException (not SyntaxCheckException) to prevent legacy fallback. Also: - Route arity mismatch through ExpressionEvaluationException for consistency with the other user-facing resolver errors. - Reject `_all`, `.`, and `..` as table names explicitly so the preview's single-index contract is not bypassed by wildcard or pathologic names. - Duplicate-name check in the Implementation layer uses Locale.ROOT when lower-casing so `TABLE` and `table` are always treated as the same key. - Update the resolver comment to describe exactly which checks the Implementation repeats as defense-in-depth (non-named and duplicate-name, not the unknown-name allowlist). - Add IT coverage for the real positional+named shape, case-insensitive duplicates, missing required args, and `_all` / `.` / `..` table names. - Add unit tests mirroring the new table-name rejections. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Drop PR-reference from Argument-shape section header Remove the "(PR B)" qualifier from the section header per the reviewer request. Code should not reference other PRs in the stack — the section header describes the tests' scope. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [BugFix] Lock in BETWEEN / NOT IN pushdown shapes for vectorSearch Round 2 manual testing confirmed that WHERE clauses with BETWEEN and NOT IN predicates on a vectorSearch() relation push down as native OpenSearch DSL (range / bool.must_not) rather than falling back to a serialized script query. These two shapes were previously unwitnessed in VectorSearchExplainIT. Add explain-plan integration tests that pin the current behaviour so any regression that causes BETWEEN or NOT IN to degrade to a script query, or to leak the predicate into the knn payload under POST filter mode, fails loudly in CI instead of silently regressing query performance. - testBetweenPushesAsRange: asserts BETWEEN desugars to native range DSL with from=50 / to=200 bounds pushed outside the knn payload. - testNotInPushesAsMustNotTerms: asserts NOT IN desugars to bool.must_not wrapping bool.should[term, term] on the keyword subfield, also outside the knn payload. Pure test addition; no production code changes. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Rewrite IS NOT NULL / IS NULL to native exists DSL in v2 filter pushdown The v2 FilterQueryBuilder's luceneQueries map did not register entries for BuiltinFunctionName.IS_NULL or IS_NOT_NULL, so visitFunction fell through to buildScriptQuery for these unary predicates. This produced a compounded script query in the pushed-down DSL, e.g.: {"script": {"script": {"source": "{\"langType\":\"v2\",\"script\":\"is not null(age)\"}", ...}}} instead of the native OpenSearch exists query. Script fallback is more expensive to evaluate, is not supported on AOSS / serverless, and diverges from what the Calcite path and downstream tooling already produce. This change adds a new ExistsQuery lucene query class that overrides the LuceneQuery.canSupport(FunctionExpression) / build(FunctionExpression) pair to accept a single ReferenceExpression argument (the standard base class contract is binary {ref, literal}). Two instances are registered in FilterQueryBuilder, one for each predicate sense, producing: IS NOT NULL col -> {"exists": {"field": "col"}} IS NULL col -> {"bool": {"must_not": [{"exists": {"field": "col"}}]}} matching the Calcite PredicateAnalyzer output for these predicates. Nested-field predicates are explicitly guarded: ExistsQuery.canSupport returns false when the argument is a nested() function, mirroring the equivalent guard in PredicateAnalyzer ("OpenSearch DSL does not handle IS_NULL / IS_NOT_NULL on nested fields correctly"). When canSupport returns false the existing default branch in FilterQueryBuilder.visitFunction falls through to the script query path, preserving the prior behavior for that edge case. This closes the tracker row "Pushdown coverage - IS NOT NULL pushes script instead of exists". The change affects all v2 SQL queries, not just vectorSearch table function queries. Test coverage: - FilterQueryBuilderTest: updated should_build_script_query_for_unsupported_lucene_query to two new assertions - should_build_exists_query_for_is_not_null and should_build_must_not_exists_query_for_is_null - pinning the new DSL shape. - ExistsPushdownIT (new): explain-plan integration tests asserting that WHERE col IS NOT NULL pushes as native exists DSL (no "script" key) and WHERE col IS NULL pushes as bool/must_not[exists]. - AggregationIT (existing) continues to pass, confirming result-level correctness is preserved for queries like "where int0 IS NOT NULL". Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Force script fallback for nested IS NULL / IS NOT NULL ExistsQuery.canSupport() already returns false for nested references, but the default branch of FilterQueryBuilder.visitFunction() has a second dispatch: when canSupport() is false, it checks isNestedPredicate() and, if true, routes to NestedQuery.buildNested(func, query). That helper reads func.getArguments().get(1) on the assumption of a binary predicate shape, so unary IS NULL / IS NOT NULL would throw at runtime. Override isNestedPredicate() to return false in ExistsQuery so the dispatch falls through to the compiled-script path for nested-field predicates. Add unit tests covering both IS NULL and IS NOT NULL to lock in the fallback. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Clearer error when vectorSearch() is used without a table alias Running `SELECT ... FROM vectorSearch(...)` without an AS <alias> previously produced an opaque parser error from the legacy SQL engine fallback (`ERROR. token : TABLE, pos : 32`). Make the alias optional in the grammar rule for `tableFunctionRelation` and reject the missing-alias case in AstBuilder with a SemanticCheckException that tells the user to add an alias, for example `FROM vectorSearch(...) AS v`. SemanticCheckException (not SyntaxCheckException) is used on purpose so the request does not fall back to the legacy engine, whose opaque error would otherwise mask this message. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Reject outer WHERE on vectorSearch() subqueries When vectorSearch() is wrapped in a subquery and the outer query has its own WHERE clause, for example: SELECT * FROM (SELECT v.firstname, v.state FROM vectorSearch(table='idx', field='emb', vector='[1,2]', option='k=5') AS v) t WHERE t.state = 'TX' the outer predicate never reaches the push-down contract. The plan shape after analysis is LogicalFilter over LogicalProject over the scan builder, so the PUSH_DOWN_FILTER rule (which matches filter directly above scanBuilder) does not fire. The filter is then applied in memory after k-NN has already returned top-k documents ranked by vector distance, which can silently yield zero rows with no explanation. This commit adds a post-optimization validation hook on TableScanBuilder (default no-op) and invokes it from Planner.plan() once all push-down rules have settled. VectorSearchIndexScanBuilder overrides the hook to walk the fully optimized plan and reject the outer-WHERE-over-subquery shape: a LogicalFilter whose descendant chain reaches the scan builder through one or more LogicalProject nodes. A filter directly above the scan builder (WHERE on vectorSearch() itself) is explicitly allowed because that path has already been exercised by push-down. The error message names the shape and tells the user to move the predicate inside the subquery so it is applied during the k-NN search. Tests: - VectorSearchIndexScanBuilderTest adds six cases covering the bad shape, nested subqueries, and the three positive controls (filter directly on scanBuilder, inner filter wrapped in outer project, no filter at all) plus a bare-scanBuilder defensive baseline. - VectorSearchSubqueryIT adds integration tests for the exact tracker query shape, the with-LIMIT variant, the _explain path, and positive controls for inner WHERE and no-WHERE subqueries. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Preserve subquery Project marker across inner Filter in walker The previous walker reset sawProjectSinceFilter to false when entering a new LogicalFilter, which missed the shape Filter(outer) -> Project(subquery) -> Filter(inner) -> scanBuilder. That shape is produced by outer WHERE + inner WHERE pairs such as SELECT * FROM ( SELECT v.* FROM vectorSearch(...) WHERE v.age > 10 ) t WHERE t.state = 'TX' and the outer predicate is still separated from the k-NN search by the subquery Project boundary regardless of the inner filter. Keep the project marker across nested LogicalFilter nodes so the walker still rejects outer WHERE over a subquery when both outer and inner WHERE are present. Add a unit test for filter(project(filter(scanBuilder))) and an IT for the outer+inner SQL shape. Also soften the remediation hint: "so it can participate in the vectorSearch WHERE pushdown contract" instead of "applied during the k-NN search", which more accurately describes why the inner WHERE is accepted. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Apply spotless formatting to walker Javadoc Google Java Format wraps the list-item continuation line at the 100-char mark; the previous commit landed a non-conforming break. Apply :opensearch:spotlessApply to keep the formatter happy. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [BugFix] Reject vectorSearch on index with user _score field vectorSearch() exposes a synthetic v._score column backed by the document metadata _score. If a user index also defines a stored field named _score, the response tuple builder collides the two values into the same key and fails with an opaque duplicate-key error from ImmutableMap.Builder. Probe result on OpenSearch 3.6: - _id: mapping creation is rejected by OpenSearch at index time, so no collision is possible. A lock-in IT guards that assumption. - _score: mapping creation is accepted. Without a guard, vectorSearch() against such an index fails opaquely at response time. Fix: reject the mapping in VectorSearchIndex.createScanBuilder() with a clear, user-facing SQL error before any request is sent. The guard runs during planning, so _explain also surfaces the problem without needing the k-NN plugin. The check lives in VectorSearchIndex to keep the AstBuilder and TableFunction resolver paths untouched. Tests: - VectorSearchIndexTest.createScanBuilderRejectsIndexWithScoreField covers the guard via a mocked mapping. - VectorSearchIT.testUserMappingWithIdFieldIsRejectedByOpenSearch locks in OpenSearch's _id-rejection behavior. - VectorSearchIT.testVectorSearchAgainstIndexWithScoreFieldRejects exercises the guard end-to-end via _explain (no k-NN plugin needed). Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review nits on synthetic _score collision guard - Use OpenSearchIndex.METADATA_FIELD_SCORE constant in place of the bare "_score" literal for the mapping-field check. - Drop "stored field" wording in the error message and code comment; the mapping does not use "store": true, so "user field" is what we actually mean. - Scope the block comment to the _id / _score cases this PR actually locks in; drop the broader "(and most other metadata names)" claim. - Shorten comments on the guard and both ITs to minimal WHY-only form. - Assert HTTP 400 in testVectorSearchAgainstIndexWithScoreFieldRejects so the IT surfaces an accidental status-code regression explicitly. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Reject OFFSET, WHERE on _score, and script subtrees under filter_type=efficient Three bug fixes in VectorSearchQueryBuilder that close silent-data-loss paths on vectorSearch() relations: 1. OFFSET on vectorSearch(): pushDownLimit now rejects any LogicalLimit with a non-zero offset. The parent path would push `from: <offset>` into the OpenSearch request, silently shifting the top-k window and dropping the highest-scoring results. LogicalSort carries only a count (no offset), so no change is needed on the sort path. 2. WHERE on _score: pushDownFilter walks the condition with an ExpressionNodeVisitor before compiling to DSL and rejects any ReferenceExpression whose attr matches "_score" case-insensitively. Previously the synthetic _score column compiled to a range query on a non-existent stored field, which returned zero rows at the cluster with no error. Users who want a score floor should use option='min_score=...'. 3. Script subtrees under filter_type=efficient: after building the WHERE QueryBuilder, the EFFICIENT branch recursively scans for any ScriptQueryBuilder (traversing BoolQueryBuilder's must/filter/should/ mustNot lists and ConstantScoreQueryBuilder wrappers). If found, refuse to embed it under knn.filter. AOSS and serverless vector collections reject script queries inside knn.filter; without this guard the user saw an opaque cluster-side failure instead of a clear plan-time error. Unit tests cover each rejection (including the compound predicate path for _score, the uppercase _SCORE bypass attempt, and the POST-mode negative case for script predicates). Integration tests assert 400 status and the user-facing message from the SQL endpoint, including the combined sort + limit + offset shape that exercises the pushDownSort() path. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Replace script-query blacklist with fail-closed allow-list for efficient filter The prior containsScriptQuery walker missed any wrapper type it didn't know about, which risked silently embedding unsupported queries (e.g., NestedQueryBuilder) under knn.filter. Switch to an allow-list validator over FilterQueryBuilder's actual leaf output (term/range/wildcard/match*/query_string/simple_query_string/match_bool_prefix/exists), recurse bool and constant_score, reject script and nested with targeted messages, and fail closed on unknown shapes. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [BugFix] Tighten vectorSearch() option and vector parsing error messages Polishes validation behavior for vectorSearch() so malformed options and vector literals surface a clear, actionable error instead of a generic one: - Reject empty option values and mid-segment empty values with a "expected key=value" message that names the offending segment. - Trim whitespace around option keys and values so natural spacing (e.g., "k = 5") works. - Reject unknown option keys with a message listing the supported set. - Detect the common non-comma separators (';', ':', '|') in vector literals and surface a dedicated message pointing to the correct comma-separated form, rather than a generic "Invalid vector component". - Reject negative max_distance and min_score values with a clear non-negative requirement, while continuing to accept zero. Adds unit coverage in VectorSearchTableFunctionImplementationTest and integration coverage in VectorSearchIT for the new error paths. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review feedback: reject empty vector / option segments, stabilize key order - parseVector now uses split(",", -1) and rejects empty components so malformed literals like "[1.0,]" or "[1.0,,2.0]" surface an explicit error instead of silently shrinking the vector to a lower dimension. - parseOptions now rejects empty option segments with an explicit "Malformed option segment" error; a wholly empty option string continues to flow to the "Missing required option" gate for the existing, clearer message. - ALLOWED_OPTION_KEYS is now a List so the unknown-key error renders the supported keys in a stable, user-friendly order (k, max_distance, min_score, filter_type) rather than Set iteration order. Adds unit coverage for trailing, leading, and consecutive empty segments in both vector and option parsing, and for stable key order in the unknown-option-key error message. Adds an IT for the trailing comma vector case. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Apply second-pass polish: return parsed numeric options, tighten key-order test - parseIntOption / parseDoubleOption now return the parsed value so validateOptions does not re-parse the canonicalized string solely to check numeric bounds. - The unknown-option-key stable-order test now matches the rendered list literal "[k, max_distance, min_score, filter_type]" rather than using indexOf("k"), which would match the "k" in "Unknown option key" and reduce the assertion to a tautology. - Adds a direct unit test pinning parseOptions("") -> empty map, so the wholly-empty option string contract (which flows to the "Missing required option" gate) is exercised explicitly. No functional change for user-visible paths. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [BugFix] Surface dedicated error for wildcard or multi-target table in vectorSearch() vectorSearch() requires a single concrete index because top-k semantics, dimension checks, and the embedded filter JSON are not defined across heterogeneous shards. Values like 'sql_vector_*' or 'idx_a,idx_b' were already rejected by the generic SAFE_FIELD_NAME character-class regex, but surfaced the fallback message "must contain only alphanumeric characters, dots, underscores, or hyphens", which does not tell the user what the actual constraint is. This change adds a dedicated check (before the regex) that flags '*' and ',' in the table name and surfaces: "Invalid table name '...': vectorSearch() requires a single concrete index or alias; wildcards ('*') and multi-target patterns (comma-separated) are not supported" No DSL or execution-path change. Accepted inputs are unchanged (simple names, dotted names, hyphens, underscores all still pass). Adds four unit tests (bare '*', trailing '*', mid-name '*', multi-target ',') and two integration tests pinning the new message on real queries. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Trim duplicated inline comment on wildcard-table check The inline block repeated the method-level Javadoc rationale for why vectorSearch() targets a single concrete index. Collapse to a one-line reference so readers do not see the same reasoning twice. No functional change. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Extend validateTableName Javadoc to cover wildcard/multi-target rejection The Javadoc previously described only the regex allow-list and the _all / . / .. rejections; update it to also mention the dedicated wildcard ('*') and comma-separated multi-target rejection branch introduced in this PR, so the doc matches the method body. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add happy-path execution tests for vectorSearch() VectorSearchIT covered only rejection paths; this adds positive end-to-end coverage for top-k, POST-filter, EFFICIENT-filter, and both radial modes (max_distance / min_score). The k-NN plugin is not provisioned by the default integ-test cluster, so each test guards init() with Assume.assumeTrue(isKnnPluginInstalled()) — tests skip cleanly when k-NN is absent and run when it is (e.g. locally via scripts/setup-knn-local.sh). Provisioning k-NN in CI is a separate follow-up tracked outside this PR. Test data is a 6-doc 2D knn_vector index with two well-separated clusters so filter correctness is assertable by document id. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Tighten VectorSearchExecutionIT per reviewer feedback Strengthen the happy-path IT so the efficient-filter test actually discriminates EFFICIENT from POST and so the other tests assert exact id sets instead of "at least one row, all in expected range". - Pin the test index to Lucene HNSW + L2 so efficient filtering is deterministic (k-NN only supports efficient filtering on lucene+hnsw and faiss+hnsw/ivf) and the L2 -> 1/(1+d) scoring used by min_score is well-defined. - Rework the efficient-filter test to query near TX with k=3 and WHERE state='CA'. A POST-filter implementation would return 0 rows here (the 3 nearest candidates are all TX and get filtered out), so the assertion on exactly {4,5,6} is discriminative between modes. - Tighten POST and radial tests to exact id-set assertions. - Drop the scripts/setup-knn-local.sh reference in the class javadoc (the helper is local-only working-tree tooling, not a tracked script). Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Fix efficient-filter test to satisfy LIMIT <= k contract The efficient-filter test used LIMIT 5 with option='k=3', which the branch rejects at push-down time via VectorSearchQueryBuilder#validateLimitWithinK. Change LIMIT to 3 so the query passes validation and actually exercises the k-NN execution path. The discriminator is preserved: query near TX with WHERE state='CA' and k=3 would return zero rows under POST filtering (the three nearest candidates are all TX and get filtered out) and returns exactly {4,5,6} under efficient filtering, which navigates HNSW toward the CA cluster. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add user doc page for vectorSearch() table function Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review: soften aggregation language; tighten filter_type section; drop subquery workaround; note case-sensitive option keys Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Correct vector-search doc to match current behavior Address review feedback on PR #5363: - filter_type=omitted section: describe the actual path (native DSL, then ScriptQueryBuilder fallback executed server-side, and only then in-memory evaluation when predicate translation itself fails), not "falls back to in-memory" as the general case. - Limitations: GROUP BY / aggregations are actively rejected by VectorSearchIndexScanBuilder.pushDownAggregation, not merely "not validated". Add outer-WHERE-on-subquery rejection, which is also enforced by the same scan builder's validatePlan walker. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Drop preview framing from Limitations section vectorSearch() is not being shipped as an experimental/preview feature, so reword the Limitations header to avoid "preview contract" framing. Addresses dai-chen's review question on #5363. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Tighten Limitations wording after dropping preview framing "Are not validated" was accurate only under the prior "preview contract" framing. Reword the section header and the JOIN / UNION / multi-call bullets to "not covered by tests", which is what we actually mean. GROUP BY and outer-WHERE-on-subquery remain explicitly rejected. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Align doc with current branch behavior after hardening PRs Update the doc to reflect restrictions that landed in later PRs on feature/vector-search-p0 and are now enforced in code: - OFFSET rejection on vectorSearch() (VectorSearchQueryBuilder) - _score rejected in WHERE; point users to option='min_score=...' - filter_type=efficient allowlist: native leaf queries + bool/ constant_score; scripts, nested predicates, and unknown shapes are rejected - table= accepts single concrete index or alias; wildcards, comma multi-targets, _all, ., and .. are rejected - k-NN plugin missing produces a dedicated error; _explain still works - Native k-NN tuning options are rejected as unknown option keys - Non-negative max_distance / min_score - Vector literal component validation (separator + empty component) - Alias required on table functions - Positional / mixed-positional arguments are rejected - Synthetic _score collision with a user-mapped _score field Also reword the Limitations header to "unsupported or not guaranteed" now that the preview framing is gone. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Align Limitations wording with house style docs/user/limitations/limitations.rst and docs/user/ppl/limitations/ limitations.md both use plain "is not supported" / "are not supported" phrasing. "Behavior not guaranteed" was not used anywhere else in the user docs, so replace it with the idiomatic "not supported" wording. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Polish doc wording to match existing user-doc house style Replace non-idiomatic phrasing (rejected, descriptive error, fail fast) with the standard 'is not supported' / 'returns an error' / 'fails with an error' forms used across docs/user/. No behavioral claims changed. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Strengthen no-kNN integration tests with exact message and scan-operator assertions Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Add integration test for vectorSearch() alias with multiple backing indices Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Remove em dashes from added comments Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Reject outer operators over vectorSearch() subqueries Extend the subquery-boundary walker to also reject outer ORDER BY, OFFSET, GROUP BY, aggregation, and DISTINCT when separated from the scan by a subquery Project. Previously only outer WHERE was rejected, so these operators would be applied in memory after k-NN had already returned the top-k by vector distance and could silently yield zero rows or mis-ordered results. Outer LIMIT without OFFSET is still allowed and locked in by a positive test. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address PR review: operator-specific error messages, inner ORDER BY _score IT, reword pushdown phrasing Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [Feature] Mark vectorSearch() experimental and default to efficient filtering Mark the vectorSearch() table function as experimental in the user doc following the repo convention (title [Experimental] suffix), and flip the default WHERE filter placement from post-filtering to efficient pre-filtering so a query without filter_type embeds the predicate under knn.filter for ANN-time pruning. Production code: filterType=null in VectorSearchIndex now resolves to FilterType.EFFICIENT, and VectorSearchQueryBuilder's full constructor defaults to EFFICIENT when passed null. The test-only 3-arg constructor stays pinned to POST because it does not wire a rebuildKnnWithFilter callback and EFFICIENT mode requires one. Allow-list error messages are reworded to neutral wording ("vectorSearch WHERE pre-filtering does not support...") so default-path users never see internal filter_type=efficient terminology and get a clear "set filter_type=post" fallback hint. Doc updates the Filtering section to describe Omitted=efficient as the default, with post framed as the opt-in fallback for predicates outside the efficient allow-list. Example 4 shows the default knn.filter shape; Example 5 shows filter_type=post for arithmetic predicates. Tests: BETWEEN / NOT IN regression guards pin filter_type=post explicitly so they continue to assert the post-filter DSL shape. testPostFilterReturnsOnlyMatchingDocs pins filter_type=post so the test name still reflects what it exercises. New default-shape IT coverage asserts knn.filter embeds the predicate and there is no outer bool wrapping. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [Doc] Apply reviewer feedback on vectorSearch() user doc - Move translation-failure in-memory fallback note from explicit `post` to the default (omitted) section; under explicit `post` the query errors instead. - Expand Limitations to cover outer ORDER BY, non-zero OFFSET, GROUP BY, aggregation, and DISTINCT over a vectorSearch() subquery (matches the expanded rejection landed via #5385); note that plain outer LIMIT without OFFSET is allowed. - Add engine/method caveat for default `efficient` filtering and soften "pre-filtering during ANN search" phrasing to "native efficient k-NN filtering". - Clarify that full-text predicates under WHERE act as filters, not as hybrid relevance scorers. - Rename Example 5 to "Post-filtering for predicates not supported by efficient mode". - Tighten explicit `efficient` wording to emphasize it fails closed. - Reword radial examples and supported option keys to say "matches / returns up to the specified LIMIT documents" instead of "returns all". - Add alias fan-out note under the `table` argument. - Sweep remaining em dashes in the file to plain text. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * [Doc] Apply second-pass review polish to vectorSearch() user doc - Introduction: add a sentence noting that the SQL layer translates vectorSearch() into an OpenSearch search request whose body is native k-NN query DSL, with the query vector parsed into a numeric array before emission. - Soften the multi-backing alias note: SQL validates the table string shape only; it does not prevalidate per-backing-index mapping, dimension, or engine compatibility. OpenSearch execution remains the source of truth for those checks. - Rewrite the full-text paragraph: placement now follows filter_type, so under default `efficient` full-text predicates are embedded under `knn.filter` (not only "alongside" the k-NN query). Keep the not-hybrid-scorer clarification. - Reword `post` bullet to describe Boolean filter placement (`bool.must(knn)` + `bool.filter(where)`) instead of "runs first"; explicitly contrast with the REST `post_filter` parameter, and note that selective filters can yield fewer than k rows. - Rename Example 4 to "Default efficient filtering (no filter_type)" and replace the remaining "pre-filtering" mention with "efficient filtering" to align with OpenSearch k-NN terminology. - Scoring section: use a concrete `v._score` example for readability alongside the `<alias>._score` form. - Limitations: replace "top-k rows" with "finite result set" to cover both top-k and radial modes. Signed-off-by: Eric Wei <mengwei.eric@gmail.com> --------- Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the vectorSearch() subquery-boundary walker to reject outer
ORDER BY,OFFSET,GROUP BY, aggregation, andDISTINCTin addition to outerWHERE. Previously these would be applied in memory after k-NN had already returned the top-k by vector distance, silently yielding zero rows or mis-ordered results.Outer
LIMITwithoutOFFSETremains allowed and is locked in by a positive test.Test plan
VectorSearchIndexScanBuilderTestfor outer Sort / Limit-with-offset / Limit-without-offset / AggregationVectorSearchSubqueryITfor outerORDER BY,OFFSET,LIMIT(positive),COUNT(*),GROUP BY,DISTINCT:opensearch:testand:integ-test:integTest -Dtests.class=...VectorSearchSubqueryITpass locallyspotlessCheckclean