Skip to content

Reject outer ORDER BY / OFFSET / aggregation / DISTINCT over subqueries#5385

Merged
mengweieric merged 2 commits intoopensearch-project:feature/vector-search-p0from
mengweieric:pr/outer-subquery-bypass
Apr 28, 2026
Merged

Reject outer ORDER BY / OFFSET / aggregation / DISTINCT over subqueries#5385
mengweieric merged 2 commits intoopensearch-project:feature/vector-search-p0from
mengweieric:pr/outer-subquery-bypass

Conversation

@mengweieric
Copy link
Copy Markdown
Collaborator

Summary

Extends the vectorSearch() subquery-boundary walker to reject outer ORDER BY, OFFSET, GROUP BY, aggregation, and DISTINCT in addition to outer WHERE. Previously these would be applied in memory after k-NN had already returned the top-k by vector distance, silently yielding zero rows or mis-ordered results.

Outer LIMIT without OFFSET remains allowed and is locked in by a positive test.

Test plan

  • New unit tests in VectorSearchIndexScanBuilderTest for outer Sort / Limit-with-offset / Limit-without-offset / Aggregation
  • New ITs in VectorSearchSubqueryIT for outer ORDER BY, OFFSET, LIMIT (positive), COUNT(*), GROUP BY, DISTINCT
  • :opensearch:test and :integ-test:integTest -Dtests.class=...VectorSearchSubqueryIT pass locally
  • spotlessCheck clean

Extend the subquery-boundary walker to also reject outer ORDER BY,
OFFSET, GROUP BY, aggregation, and DISTINCT when separated from the
scan by a subquery Project. Previously only outer WHERE was rejected,
so these operators would be applied in memory after k-NN had already
returned the top-k by vector distance and could silently yield zero
rows or mis-ordered results.

Outer LIMIT without OFFSET is still allowed and locked in by a
positive test.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@mengweieric mengweieric added bug Something isn't working skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. SQL feature labels Apr 24, 2026
@mengweieric mengweieric changed the title [BugFix] vectorSearch(): reject outer ORDER BY / OFFSET / aggregation / DISTINCT over subqueries [BugFix] Reject outer ORDER BY / OFFSET / aggregation / DISTINCT over subqueries Apr 24, 2026
@mengweieric mengweieric changed the title [BugFix] Reject outer ORDER BY / OFFSET / aggregation / DISTINCT over subqueries Reject outer ORDER BY / OFFSET / aggregation / DISTINCT over subqueries Apr 24, 2026
…score IT, reword pushdown phrasing

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@mengweieric mengweieric merged commit c06c472 into opensearch-project:feature/vector-search-p0 Apr 28, 2026
36 checks passed
mengweieric added a commit to mengweieric/sql that referenced this pull request Apr 29, 2026
- Move translation-failure in-memory fallback note from explicit `post`
  to the default (omitted) section; under explicit `post` the query
  errors instead.
- Expand Limitations to cover outer ORDER BY, non-zero OFFSET, GROUP BY,
  aggregation, and DISTINCT over a vectorSearch() subquery (matches the
  expanded rejection landed via opensearch-project#5385); note that plain outer LIMIT
  without OFFSET is allowed.
- Add engine/method caveat for default `efficient` filtering and
  soften "pre-filtering during ANN search" phrasing to "native efficient
  k-NN filtering".
- Clarify that full-text predicates under WHERE act as filters, not as
  hybrid relevance scorers.
- Rename Example 5 to "Post-filtering for predicates not supported by
  efficient mode".
- Tighten explicit `efficient` wording to emphasize it fails closed.
- Reword radial examples and supported option keys to say "matches /
  returns up to the specified LIMIT documents" instead of "returns all".
- Add alias fan-out note under the `table` argument.
- Sweep remaining em dashes in the file to plain text.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
mengweieric added a commit that referenced this pull request Apr 29, 2026
- Move translation-failure in-memory fallback note from explicit `post`
  to the default (omitted) section; under explicit `post` the query
  errors instead.
- Expand Limitations to cover outer ORDER BY, non-zero OFFSET, GROUP BY,
  aggregation, and DISTINCT over a vectorSearch() subquery (matches the
  expanded rejection landed via #5385); note that plain outer LIMIT
  without OFFSET is allowed.
- Add engine/method caveat for default `efficient` filtering and
  soften "pre-filtering during ANN search" phrasing to "native efficient
  k-NN filtering".
- Clarify that full-text predicates under WHERE act as filters, not as
  hybrid relevance scorers.
- Rename Example 5 to "Post-filtering for predicates not supported by
  efficient mode".
- Tighten explicit `efficient` wording to emphasize it fails closed.
- Reword radial examples and supported option keys to say "matches /
  returns up to the specified LIMIT documents" instead of "returns all".
- Add alias fan-out note under the `table` argument.
- Sweep remaining em dashes in the file to plain text.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Swiddis pushed a commit that referenced this pull request Apr 29, 2026
* [Feature] Add table function relation syntax to SQL grammar (#5318)

* [Feature] Add table function relation to SQL grammar for vectorSearch()

Add table function relation support to the SQL parser:
- New `tableFunctionRelation` alternative in `relation` grammar rule
- Named argument syntax: `key=value` (e.g., table='index', field='vec')
- Alias is required by grammar (FROM func(...) AS alias)
- AstBuilder emits existing TableFunction + SubqueryAlias AST nodes
- 3 parser unit tests: basic parse, with WHERE/ORDER BY/LIMIT, alias required

This is a pure grammar change — no execution support yet. Queries will
parse successfully but fail at the Analyzer with "unsupported function".

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review feedback on table function relation grammar

1. Canonicalize argument names at parser boundary:
   unquoteIdentifier + toLowerCase(Locale.ROOT) in visitTableFunctionRelation
   so FIELD='x' and `field`='x' both produce argName="field"

2. Make AS keyword optional (AS? alias) for consistency with
   tableAsRelation and subqueryAsRelation grammar rules

3. Strengthen test coverage:
   - Full structural AST assertion for WHERE + ORDER BY + LIMIT
     (verifies Sort, Limit, Filter nodes, not just toString)
   - Argument reorder test proves names resolve by name not position
   - Case canonicalization test (TABLE= → table=)
   - Alias-without-AS test (FROM func(...) v)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Apply spotless formatting

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

---------

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add knn_vector as recognized MappingType in OpenSearchDataType

Maps knn_vector fields to ExprCoreType.ARRAY so they appear in
DESCRIBE output and can be referenced in projections. This is a
visibility shim — not a full vector type.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Widen OpenSearchIndexScanBuilder constructor to public

VectorSearchIndex.createScanBuilder() needs to construct an
OpenSearchIndexScanBuilder with a custom VectorSearchQueryBuilder
delegate. The existing constructor was protected (test-only).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add vector search table function, query builder, and index

Introduces the core execution pipeline for vectorsearch():
- VectorSearchTableFunctionResolver: registers vectorsearch with 4 STRING args
- VectorSearchTableFunctionImplementation: parses named args, vector literal,
  options string, validates search mode (k/max_distance/min_score)
- VectorSearchIndex: extends OpenSearchIndex with knn query seeding,
  score tracking, and WrapperQueryBuilder DSL construction
- VectorSearchQueryBuilder: keeps knn in must (scoring) context,
  WHERE filters in filter (non-scoring) context

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Register VectorSearchTableFunctionResolver in OpenSearchStorageEngine

Override getFunctions() to expose vectorsearch() table function
to the query analysis pipeline.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add VectorSearchQueryBuilder unit tests

Verifies knn query is placed in scoring (must) context, not wrapped
in bool.filter when no WHERE clause is present.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review feedback: add validation guards and pushDownFilter test

- Add pushDownFilter() unit test asserting knn stays in bool.must
  (scoring) and WHERE predicate goes to bool.filter (non-scoring)
- Add option key allowlist (k, max_distance, min_score) to reject
  unknown/unsupported keys before they reach DSL generation
- Add field name validation to reject characters that could corrupt
  the WrapperQueryBuilder JSON (allows alphanumeric, dots, underscores,
  hyphens)
- Add named-arg type guard to reject non-NamedArgumentExpression args
  early with a clear error message

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Canonicalize option values as numeric types before DSL generation

Parse k as integer, max_distance and min_score as double before they
reach buildKnnQuery(). Rejects non-numeric and non-finite values with
clear errors. This closes the residual JSON-injection path through
option values without requiring full XContent migration.

Also fixes toString() to be consistent with the named-arg guard
(no longer blindly casts to NamedArgumentExpression).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Harden input validation and add size=k default for top-k mode

- parseOptions: reject malformed segments and duplicate keys
- parseVector: wrap errors in ExpressionEvaluationException, reject
  non-finite floats (Infinity, NaN)
- VectorSearchIndex: default requestedTotalSize to k via
  pushDownLimitToRequestTotal so queries without LIMIT return k results
- Add 5 new tests: malformed option, duplicate key, empty vector,
  malformed vector component, non-finite vector component

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add null-arg-name guard and make storage engine test less brittle

- validateNamedArgs() now rejects null/empty arg names defensively,
  closing a potential NPE if the shared table-function path is later
  wired into PPL
- OpenSearchStorageEngineTest uses contains-check instead of exact
  collection size assertion
- Add testNullArgNameThrows test

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Clean up dead code, fix misleading comment and test name

- Remove unused VECTOR_OPTION constant from VectorSearchIndex
- Clarify buildKnnQuery() comment: quoted fallback is for forward
  compatibility, all P0 values are already canonicalized as numeric
- Rename testMissingSearchModeOptionThrows to
  testUnknownOptionKeyOnlyThrows to match what it actually tests

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add test for missing required option validation path

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add mutual exclusivity and k range validation

- Enforce exactly one of k, max_distance, or min_score
- Validate k is in [1, 10000] range
- Add 6 tests: mutual exclusivity (3 combos), k too small, k too
  large, k boundary values (1 and 10000)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add LIMIT > k rejection in top-k vector search mode

VectorSearchQueryBuilder now accepts options map and rejects
pushDownLimit when LIMIT exceeds k. Radial modes (max_distance,
min_score) have no LIMIT restriction.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add comprehensive test coverage for vector search hardening

- Create VectorSearchIndexTest: 7 tests covering buildKnnQueryJson()
  for top-k, max_distance, min_score, nested fields, multi-element
  and single-element vectors, numeric option rendering
- Add edge case tests to VectorSearchTableFunctionImplementationTest:
  NaN vector component, empty option key/value, negative k, NaN for
  max_distance and min_score (6 new tests)
- Add VectorSearchQueryBuilderTest: min_score radial mode LIMIT,
  pushDownSort delegation to parent (2 new tests)
- Extract buildKnnQueryJson() as package-private for direct testing

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add resolver argument count edge case tests

Test too-many (5) and zero arguments paths in
VectorSearchTableFunctionResolver to complement existing
too-few (2) test.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add radial size policy, sort restriction, and integration tests

- Cap radial mode (max_distance/min_score) results at maxResultWindow
  to prevent unbounded result sets
- Reject ORDER BY on non-_score fields and _score ASC in vectorSearch
  since knn results are naturally sorted by _score DESC
- Add 12 integration tests: 4 _explain DSL shape verification tests
  and 8 validation error path tests

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Fill test coverage gaps for vector search hardening

- Add multi-sort expression test: ORDER BY _score DESC, name ASC
  correctly rejects the non-_score field (VectorSearchQueryBuilderTest)
- Add case-insensitive argument name lookup test to verify
  TABLE='x' resolves same as table='x' (Implementation test)
- Add non-numeric option fallback test: verifies string options
  are quoted in JSON output (VectorSearchIndexTest)
- Add 4 integration tests: ORDER BY _score DESC succeeds,
  ORDER BY non-score rejects, ORDER BY _score ASC rejects,
  LIMIT within k succeeds (VectorSearchIT, now 16 tests)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Preserve sort.getCount() limit pushdown contract in pushDownSort

The base OpenSearchIndexScanQueryBuilder.pushDownSort() pushes
sort.getCount() as a limit when non-zero. Our override validated
_score DESC and returned true, but did not preserve this contract.

SQL always sets count=0, so this was not reachable today, but PPL
or future callers may set a non-zero count to combine sort+limit
in one LogicalSort node. Preserve the behavior defensively.

Add focused test: LogicalSort(count=7) with _score DESC verifies
the count is pushed down as request size.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add compound predicate and radial+WHERE test coverage

- Unit test: compound AND predicate survives pushdown into bool.filter
- Integration test: compound WHERE (term + range) produces bool query
- Integration test: radial max_distance with WHERE produces bool query

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Route pushDownSort count through LIMIT > k validation

pushDownSort() called requestBuilder.pushDownLimit() directly, bypassing
the LIMIT > k guard in pushDownLimit(). Extract validateLimitWithinK()
helper and call it from both paths so the invariant holds when PPL or
future callers set a non-zero sort count.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Split explain tests into dedicated VectorSearchExplainIT

Move all explainQuery()-based DSL shape tests into a dedicated
VectorSearchExplainIT suite. VectorSearchIT now contains only
validation and error-path tests.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add FilterType enum for post|efficient filter placement

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add filter_type to allowed option keys with post|efficient validation

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Strip filter_type from options and pass as typed FilterType to VectorSearchIndex

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Collapse buildKnnQueryJson to accept optional filter clause

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Implement efficient filter pushdown branching and build-time validation in VectorSearchQueryBuilder

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Wire FilterType and rebuild callback through createScanBuilder

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add build-time validation and regression tests for LIMIT/sort under efficient mode

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add integration tests for filter_type=post|efficient and spotless formatting

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Reject radial vector search without LIMIT

Radial search (max_distance or min_score) can return unbounded results.
Add build-time validation that rejects radial queries without an explicit
LIMIT clause, with a clear error message guiding the user.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Fix limitPushed not set when limit comes through pushDownSort count path

pushDownSort with a non-zero sort.getCount() pushes a limit to
requestBuilder directly, bypassing pushDownLimit() and leaving
limitPushed=false. This causes build() to incorrectly reject radial
vector search when the limit arrives via the sort-with-count path
(e.g. PPL sort command). Set limitPushed=true in the sort.getCount()
block alongside the existing requestBuilder.pushDownLimit() call.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Handle non-pushdownable WHERE with explicit filter_type

pushDownFilter() did not catch ScriptQueryUnSupportedException, so
non-pushdownable filters (e.g. struct-type fields) would propagate a
raw internal exception instead of a clean SQL-layer error.

With explicit filter_type: throw a clear error explaining the WHERE
clause cannot be pushed down for the requested filter placement.
Without explicit filter_type: return false to fall back to in-memory
filtering, matching the base class behavior.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add defensive null guard for EFFICIENT mode rebuild callback

Reject construction of VectorSearchQueryBuilder with
FilterType.EFFICIENT and a null rebuildKnnWithFilter callback at
construction time instead of deferring to an NPE in pushDownFilter.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review feedback: reword user-facing error, strengthen explain test, rename constructor comment

- Reword filter_type error message to be user-friendly and actionable
  (no longer leaks internal ScriptQueryUnSupportedException text)
- Strengthen efficient-mode explain IT: assert no bool/must (proves not
  post-filter shape), decode base64 knn payload to verify filter and
  predicate field are embedded inside knn query
- Rename "Backward-compatible constructor" to clarify intent

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Rename remaining backward-compatible constructor comment for consistency

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Reject GROUP BY / aggregations on vectorSearch() relations

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review: reframe as SQL-preview constraint; add GROUP BY and bare-aggregate ITs; drop subquery workaround text

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Strengthen VectorSearchExplainIT with structural DSL assertions

The explain suite previously relied on substring checks (contains("wrapper"),
contains("bool"), contains("filter")) that could false-positive across distinct
push-down shapes — notably, post-filter vs. efficient-mode filter_type output
look similar at the substring level.

Add shared helpers to (1) decode every base64 WrapperQueryBuilder payload into
its inner k-NN JSON and (2) extract and unescape the outer sourceBuilder JSON
from explain output. Strengthen the top-k, radial (max_distance, min_score),
post-filter, compound-predicate, radial-with-WHERE, and filter_type=post tests
to assert both the outer bool/must/filter shape and that WHERE predicates
stay out of the inner k-NN payload. Refactor the efficient-mode test onto the
same helpers so its no-outer-bool/must guard is verified against the unescaped
sourceBuilder JSON rather than accidentally passing on escaped output.

Leave the ORDER BY _score DESC and LIMIT <= k smoke tests untouched — those
are guard-rail "does planning succeed" checks where shape is covered elsewhere.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review nits: outer-bool guard, helper comment, efficient+ORDER BY

- Add assertFalse(contains("\"bool\"")) on the outer sourceBuilder JSON for the
  plain top-k and radial (max_distance, min_score) cases so a regression that
  wrapped the knn in an outer bool while preserving the inner payload is
  caught.
- Document the SOURCE_BUILDER_JSON helper's test-only coupling to the
  surrounding "sourceBuilder=...", "pitId=" tokens in the request-string
  toString() output — if that format changes, update the regex anchors.
- Strengthen testEfficientFilterWithOrderByScoreDescSucceeds onto the same
  helpers as testExplainFilterTypeEfficientProducesKnnWithFilter: verify no
  outer bool/must in the sourceBuilder JSON and that the WHERE predicate is
  embedded inside the decoded knn payload.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Fail fast when k-NN plugin is missing

vectorSearch() currently lets queries proceed even when the k-NN plugin
is not installed; the user sees a cryptic native OpenSearch error deep
in execution. This change adds a lazy capability probe that runs on the
first vectorSearch() invocation per JVM and throws a clear message when
the plugin is absent.

- KnnPluginCapability: probes Nodes Info once via NodeClient, caches the
  result. Successful results (positive or negative) are cached; probe
  failures (IO, timeout) are not, so subsequent calls retry. If no
  NodeClient is available (REST-client / standalone mode), the check is
  skipped and execution-time errors remain the signal.
- The probe uses the canonical k-NN plugin class name rather than the
  artifact name so it survives repackaging variants.
- Wired into VectorSearchTableFunctionImplementation.applyArguments() so
  the failure surfaces before any DSL is built. Resolver instantiates
  one shared capability per registration; the constructor remains
  backward-compatible (two-arg form delegates to the three-arg form).
- Unit tests cover: node-client absent (skip), plugin present (pass),
  plugin absent (throw), positive/negative caching, and probe-failure
  non-caching.

Plugin registration stays unconditional — failing at registration would
break introspection (SHOW FUNCTIONS etc.) on clusters without k-NN.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Defer k-NN plugin probe to scan open() so _explain keeps working

The original placement ran knnCapability.requireInstalled() at the
very top of applyArguments(). That hook runs during analysis for
both _query AND _explain, which regressed two existing contracts on
clusters without the k-NN plugin:

- VectorSearchExplainIT explicitly states "_explain only parses and
  plans the query and does NOT require the k-NN plugin". The check
  would have made every explain test fail with "k-NN plugin missing".
- VectorSearchIT is the local-validation/error-path suite. Malformed
  queries on a pluginless cluster would have collapsed to the k-NN
  error instead of the existing SQL validation errors.

Move the probe to VectorSearchIndexScan.open() — a new thin subclass
of OpenSearchIndexScan whose only override calls requireInstalled()
before super.open(). Scan open() runs only on the execute path;
_explain walks the built plan without opening it, so explain stays
functional.

Also reorder applyArguments() — nothing to relocate now that the
capability check has moved, but the comment documents the intent:
local validation runs first so malformed-query errors are stable
regardless of cluster state or probe outcome.

Add two ITs on the default (pluginless) integ-test cluster:
- testExecutionWithoutKnnPluginReturnsCapabilityError — verifies the
  clear "k-NN plugin not installed" error surfaces on execute.
- testExplainWithoutKnnPluginStillWorks — guards the _explain
  contract so future probe relocations can't silently break it.

Also drops the "SQL preview" framing from VectorSearchIndexScanBuilder's
aggregation-rejection error per prior user instruction.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [BugFix] Reject positional args and tighten table-name validation in vectorSearch()

Close the silent-no-op path where `vectorSearch('idx', field='f', ...)` parsed
through the legacy engine and returned HTTP 200 with zero rows. The V2 SQL
grammar now accepts `tableFunctionArg : ident EQUAL_SYMBOL functionArg |
functionArg`, so the positional shape reaches AstBuilder, which throws a
SemanticCheckException (not SyntaxCheckException) to prevent legacy fallback.

Also:
- Route arity mismatch through ExpressionEvaluationException for consistency
  with the other user-facing resolver errors.
- Reject `_all`, `.`, and `..` as table names explicitly so the preview's
  single-index contract is not bypassed by wildcard or pathologic names.
- Duplicate-name check in the Implementation layer uses Locale.ROOT when
  lower-casing so `TABLE` and `table` are always treated as the same key.
- Update the resolver comment to describe exactly which checks the
  Implementation repeats as defense-in-depth (non-named and duplicate-name,
  not the unknown-name allowlist).
- Add IT coverage for the real positional+named shape, case-insensitive
  duplicates, missing required args, and `_all` / `.` / `..` table names.
- Add unit tests mirroring the new table-name rejections.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Drop PR-reference from Argument-shape section header

Remove the "(PR B)" qualifier from the section header per the reviewer
request. Code should not reference other PRs in the stack — the section
header describes the tests' scope.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [BugFix] Lock in BETWEEN / NOT IN pushdown shapes for vectorSearch

Round 2 manual testing confirmed that WHERE clauses with BETWEEN and NOT IN
predicates on a vectorSearch() relation push down as native OpenSearch DSL
(range / bool.must_not) rather than falling back to a serialized script query.
These two shapes were previously unwitnessed in VectorSearchExplainIT.

Add explain-plan integration tests that pin the current behaviour so any
regression that causes BETWEEN or NOT IN to degrade to a script query, or
to leak the predicate into the knn payload under POST filter mode, fails
loudly in CI instead of silently regressing query performance.

- testBetweenPushesAsRange: asserts BETWEEN desugars to native range DSL with
  from=50 / to=200 bounds pushed outside the knn payload.
- testNotInPushesAsMustNotTerms: asserts NOT IN desugars to bool.must_not
  wrapping bool.should[term, term] on the keyword subfield, also outside
  the knn payload.

Pure test addition; no production code changes.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Rewrite IS NOT NULL / IS NULL to native exists DSL in v2 filter pushdown

The v2 FilterQueryBuilder's luceneQueries map did not register entries for
BuiltinFunctionName.IS_NULL or IS_NOT_NULL, so visitFunction fell through to
buildScriptQuery for these unary predicates. This produced a compounded script
query in the pushed-down DSL, e.g.:

    {"script": {"script": {"source": "{\"langType\":\"v2\",\"script\":\"is not null(age)\"}", ...}}}

instead of the native OpenSearch exists query. Script fallback is more
expensive to evaluate, is not supported on AOSS / serverless, and diverges
from what the Calcite path and downstream tooling already produce.

This change adds a new ExistsQuery lucene query class that overrides the
LuceneQuery.canSupport(FunctionExpression) / build(FunctionExpression) pair
to accept a single ReferenceExpression argument (the standard base class
contract is binary {ref, literal}). Two instances are registered in
FilterQueryBuilder, one for each predicate sense, producing:

    IS NOT NULL col  ->  {"exists": {"field": "col"}}
    IS NULL col      ->  {"bool": {"must_not": [{"exists": {"field": "col"}}]}}

matching the Calcite PredicateAnalyzer output for these predicates.

Nested-field predicates are explicitly guarded: ExistsQuery.canSupport
returns false when the argument is a nested() function, mirroring the
equivalent guard in PredicateAnalyzer ("OpenSearch DSL does not handle
IS_NULL / IS_NOT_NULL on nested fields correctly"). When canSupport returns
false the existing default branch in FilterQueryBuilder.visitFunction falls
through to the script query path, preserving the prior behavior for that
edge case.

This closes the tracker row "Pushdown coverage - IS NOT NULL pushes script
instead of exists". The change affects all v2 SQL queries, not just
vectorSearch table function queries.

Test coverage:
- FilterQueryBuilderTest: updated should_build_script_query_for_unsupported_lucene_query
  to two new assertions - should_build_exists_query_for_is_not_null and
  should_build_must_not_exists_query_for_is_null - pinning the new DSL shape.
- ExistsPushdownIT (new): explain-plan integration tests asserting that
  WHERE col IS NOT NULL pushes as native exists DSL (no "script" key) and
  WHERE col IS NULL pushes as bool/must_not[exists].
- AggregationIT (existing) continues to pass, confirming result-level
  correctness is preserved for queries like "where int0 IS NOT NULL".

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Force script fallback for nested IS NULL / IS NOT NULL

ExistsQuery.canSupport() already returns false for nested references, but
the default branch of FilterQueryBuilder.visitFunction() has a second
dispatch: when canSupport() is false, it checks isNestedPredicate() and, if
true, routes to NestedQuery.buildNested(func, query). That helper reads
func.getArguments().get(1) on the assumption of a binary predicate shape,
so unary IS NULL / IS NOT NULL would throw at runtime.

Override isNestedPredicate() to return false in ExistsQuery so the dispatch
falls through to the compiled-script path for nested-field predicates. Add
unit tests covering both IS NULL and IS NOT NULL to lock in the fallback.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Clearer error when vectorSearch() is used without a table alias

Running `SELECT ... FROM vectorSearch(...)` without an AS <alias>
previously produced an opaque parser error from the legacy SQL engine
fallback (`ERROR. token : TABLE, pos : 32`). Make the alias optional in
the grammar rule for `tableFunctionRelation` and reject the missing-alias
case in AstBuilder with a SemanticCheckException that tells the user to
add an alias, for example `FROM vectorSearch(...) AS v`.

SemanticCheckException (not SyntaxCheckException) is used on purpose so
the request does not fall back to the legacy engine, whose opaque error
would otherwise mask this message.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Reject outer WHERE on vectorSearch() subqueries

When vectorSearch() is wrapped in a subquery and the outer query has its
own WHERE clause, for example:

  SELECT * FROM (SELECT v.firstname, v.state
                 FROM vectorSearch(table='idx', field='emb',
                                   vector='[1,2]', option='k=5') AS v) t
  WHERE t.state = 'TX'

the outer predicate never reaches the push-down contract. The plan shape
after analysis is LogicalFilter over LogicalProject over the scan
builder, so the PUSH_DOWN_FILTER rule (which matches filter directly
above scanBuilder) does not fire. The filter is then applied in memory
after k-NN has already returned top-k documents ranked by vector
distance, which can silently yield zero rows with no explanation.

This commit adds a post-optimization validation hook on TableScanBuilder
(default no-op) and invokes it from Planner.plan() once all push-down
rules have settled. VectorSearchIndexScanBuilder overrides the hook to
walk the fully optimized plan and reject the outer-WHERE-over-subquery
shape: a LogicalFilter whose descendant chain reaches the scan builder
through one or more LogicalProject nodes. A filter directly above the
scan builder (WHERE on vectorSearch() itself) is explicitly allowed
because that path has already been exercised by push-down.

The error message names the shape and tells the user to move the
predicate inside the subquery so it is applied during the k-NN search.

Tests:
 - VectorSearchIndexScanBuilderTest adds six cases covering the bad
   shape, nested subqueries, and the three positive controls (filter
   directly on scanBuilder, inner filter wrapped in outer project, no
   filter at all) plus a bare-scanBuilder defensive baseline.
 - VectorSearchSubqueryIT adds integration tests for the exact tracker
   query shape, the with-LIMIT variant, the _explain path, and positive
   controls for inner WHERE and no-WHERE subqueries.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Preserve subquery Project marker across inner Filter in walker

The previous walker reset sawProjectSinceFilter to false when entering a
new LogicalFilter, which missed the shape
Filter(outer) -> Project(subquery) -> Filter(inner) -> scanBuilder. That
shape is produced by outer WHERE + inner WHERE pairs such as
  SELECT * FROM (
    SELECT v.* FROM vectorSearch(...) WHERE v.age > 10
  ) t WHERE t.state = 'TX'
and the outer predicate is still separated from the k-NN search by the
subquery Project boundary regardless of the inner filter.

Keep the project marker across nested LogicalFilter nodes so the walker
still rejects outer WHERE over a subquery when both outer and inner WHERE
are present. Add a unit test for filter(project(filter(scanBuilder))) and
an IT for the outer+inner SQL shape.

Also soften the remediation hint: "so it can participate in the
vectorSearch WHERE pushdown contract" instead of "applied during the k-NN
search", which more accurately describes why the inner WHERE is accepted.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Apply spotless formatting to walker Javadoc

Google Java Format wraps the list-item continuation line at the 100-char
mark; the previous commit landed a non-conforming break. Apply
:opensearch:spotlessApply to keep the formatter happy.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [BugFix] Reject vectorSearch on index with user _score field

vectorSearch() exposes a synthetic v._score column backed by the
document metadata _score. If a user index also defines a stored field
named _score, the response tuple builder collides the two values into
the same key and fails with an opaque duplicate-key error from
ImmutableMap.Builder.

Probe result on OpenSearch 3.6:
  - _id: mapping creation is rejected by OpenSearch at index time, so
    no collision is possible. A lock-in IT guards that assumption.
  - _score: mapping creation is accepted. Without a guard,
    vectorSearch() against such an index fails opaquely at response
    time.

Fix: reject the mapping in VectorSearchIndex.createScanBuilder() with
a clear, user-facing SQL error before any request is sent. The guard
runs during planning, so _explain also surfaces the problem without
needing the k-NN plugin. The check lives in VectorSearchIndex to keep
the AstBuilder and TableFunction resolver paths untouched.

Tests:
  - VectorSearchIndexTest.createScanBuilderRejectsIndexWithScoreField
    covers the guard via a mocked mapping.
  - VectorSearchIT.testUserMappingWithIdFieldIsRejectedByOpenSearch
    locks in OpenSearch's _id-rejection behavior.
  - VectorSearchIT.testVectorSearchAgainstIndexWithScoreFieldRejects
    exercises the guard end-to-end via _explain (no k-NN plugin needed).
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review nits on synthetic _score collision guard

- Use OpenSearchIndex.METADATA_FIELD_SCORE constant in place of the bare
  "_score" literal for the mapping-field check.
- Drop "stored field" wording in the error message and code comment;
  the mapping does not use "store": true, so "user field" is what we
  actually mean.
- Scope the block comment to the _id / _score cases this PR actually
  locks in; drop the broader "(and most other metadata names)" claim.
- Shorten comments on the guard and both ITs to minimal WHY-only form.
- Assert HTTP 400 in testVectorSearchAgainstIndexWithScoreFieldRejects
  so the IT surfaces an accidental status-code regression explicitly.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Reject OFFSET, WHERE on _score, and script subtrees under filter_type=efficient

Three bug fixes in VectorSearchQueryBuilder that close silent-data-loss paths
on vectorSearch() relations:

1. OFFSET on vectorSearch(): pushDownLimit now rejects any LogicalLimit with
   a non-zero offset. The parent path would push `from: <offset>` into the
   OpenSearch request, silently shifting the top-k window and dropping the
   highest-scoring results. LogicalSort carries only a count (no offset),
   so no change is needed on the sort path.

2. WHERE on _score: pushDownFilter walks the condition with an
   ExpressionNodeVisitor before compiling to DSL and rejects any
   ReferenceExpression whose attr matches "_score" case-insensitively.
   Previously the synthetic _score column compiled to a range query on a
   non-existent stored field, which returned zero rows at the cluster with
   no error. Users who want a score floor should use option='min_score=...'.

3. Script subtrees under filter_type=efficient: after building the WHERE
   QueryBuilder, the EFFICIENT branch recursively scans for any
   ScriptQueryBuilder (traversing BoolQueryBuilder's must/filter/should/
   mustNot lists and ConstantScoreQueryBuilder wrappers). If found, refuse
   to embed it under knn.filter. AOSS and serverless vector collections
   reject script queries inside knn.filter; without this guard the user
   saw an opaque cluster-side failure instead of a clear plan-time error.

Unit tests cover each rejection (including the compound predicate path for
_score, the uppercase _SCORE bypass attempt, and the POST-mode negative
case for script predicates). Integration tests assert 400 status and the
user-facing message from the SQL endpoint, including the combined sort +
limit + offset shape that exercises the pushDownSort() path.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Replace script-query blacklist with fail-closed allow-list for efficient filter

The prior containsScriptQuery walker missed any wrapper type it didn't know about, which risked silently embedding unsupported queries (e.g., NestedQueryBuilder) under knn.filter. Switch to an allow-list validator over FilterQueryBuilder's actual leaf output (term/range/wildcard/match*/query_string/simple_query_string/match_bool_prefix/exists), recurse bool and constant_score, reject script and nested with targeted messages, and fail closed on unknown shapes.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [BugFix] Tighten vectorSearch() option and vector parsing error messages

Polishes validation behavior for vectorSearch() so malformed options and
vector literals surface a clear, actionable error instead of a generic
one:

- Reject empty option values and mid-segment empty values with a
  "expected key=value" message that names the offending segment.
- Trim whitespace around option keys and values so natural spacing
  (e.g., "k = 5") works.
- Reject unknown option keys with a message listing the supported set.
- Detect the common non-comma separators (';', ':', '|') in vector
  literals and surface a dedicated message pointing to the correct
  comma-separated form, rather than a generic "Invalid vector
  component".
- Reject negative max_distance and min_score values with a clear
  non-negative requirement, while continuing to accept zero.

Adds unit coverage in VectorSearchTableFunctionImplementationTest and
integration coverage in VectorSearchIT for the new error paths.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review feedback: reject empty vector / option segments, stabilize key order

- parseVector now uses split(",", -1) and rejects empty components so
  malformed literals like "[1.0,]" or "[1.0,,2.0]" surface an explicit
  error instead of silently shrinking the vector to a lower dimension.
- parseOptions now rejects empty option segments with an explicit
  "Malformed option segment" error; a wholly empty option string
  continues to flow to the "Missing required option" gate for the
  existing, clearer message.
- ALLOWED_OPTION_KEYS is now a List so the unknown-key error renders
  the supported keys in a stable, user-friendly order
  (k, max_distance, min_score, filter_type) rather than Set iteration
  order.

Adds unit coverage for trailing, leading, and consecutive empty
segments in both vector and option parsing, and for stable key order
in the unknown-option-key error message. Adds an IT for the trailing
comma vector case.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Apply second-pass polish: return parsed numeric options, tighten key-order test

- parseIntOption / parseDoubleOption now return the parsed value so
  validateOptions does not re-parse the canonicalized string solely to
  check numeric bounds.
- The unknown-option-key stable-order test now matches the rendered
  list literal "[k, max_distance, min_score, filter_type]" rather
  than using indexOf("k"), which would match the "k" in "Unknown
  option key" and reduce the assertion to a tautology.
- Adds a direct unit test pinning parseOptions("") -> empty map, so
  the wholly-empty option string contract (which flows to the
  "Missing required option" gate) is exercised explicitly.

No functional change for user-visible paths.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [BugFix] Surface dedicated error for wildcard or multi-target table in vectorSearch()

vectorSearch() requires a single concrete index because top-k
semantics, dimension checks, and the embedded filter JSON are not
defined across heterogeneous shards. Values like 'sql_vector_*' or
'idx_a,idx_b' were already rejected by the generic SAFE_FIELD_NAME
character-class regex, but surfaced the fallback message
"must contain only alphanumeric characters, dots, underscores, or
hyphens", which does not tell the user what the actual constraint
is.

This change adds a dedicated check (before the regex) that flags
'*' and ',' in the table name and surfaces:

  "Invalid table name '...': vectorSearch() requires a single
   concrete index or alias; wildcards ('*') and multi-target
   patterns (comma-separated) are not supported"

No DSL or execution-path change. Accepted inputs are unchanged
(simple names, dotted names, hyphens, underscores all still pass).

Adds four unit tests (bare '*', trailing '*', mid-name '*',
multi-target ',') and two integration tests pinning the new
message on real queries.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Trim duplicated inline comment on wildcard-table check

The inline block repeated the method-level Javadoc rationale for why
vectorSearch() targets a single concrete index. Collapse to a one-line
reference so readers do not see the same reasoning twice.

No functional change.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Extend validateTableName Javadoc to cover wildcard/multi-target rejection

The Javadoc previously described only the regex allow-list and the
_all / . / .. rejections; update it to also mention the dedicated
wildcard ('*') and comma-separated multi-target rejection branch
introduced in this PR, so the doc matches the method body.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add happy-path execution tests for vectorSearch()

VectorSearchIT covered only rejection paths; this adds positive end-to-end
coverage for top-k, POST-filter, EFFICIENT-filter, and both radial modes
(max_distance / min_score).

The k-NN plugin is not provisioned by the default integ-test cluster, so
each test guards init() with Assume.assumeTrue(isKnnPluginInstalled()) —
tests skip cleanly when k-NN is absent and run when it is (e.g. locally
via scripts/setup-knn-local.sh). Provisioning k-NN in CI is a separate
follow-up tracked outside this PR.

Test data is a 6-doc 2D knn_vector index with two well-separated clusters
so filter correctness is assertable by document id.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Tighten VectorSearchExecutionIT per reviewer feedback

Strengthen the happy-path IT so the efficient-filter test actually
discriminates EFFICIENT from POST and so the other tests assert
exact id sets instead of "at least one row, all in expected range".

- Pin the test index to Lucene HNSW + L2 so efficient filtering is
  deterministic (k-NN only supports efficient filtering on lucene+hnsw
  and faiss+hnsw/ivf) and the L2 -> 1/(1+d) scoring used by min_score
  is well-defined.
- Rework the efficient-filter test to query near TX with k=3 and
  WHERE state='CA'. A POST-filter implementation would return 0 rows
  here (the 3 nearest candidates are all TX and get filtered out), so
  the assertion on exactly {4,5,6} is discriminative between modes.
- Tighten POST and radial tests to exact id-set assertions.
- Drop the scripts/setup-knn-local.sh reference in the class javadoc
  (the helper is local-only working-tree tooling, not a tracked script).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Fix efficient-filter test to satisfy LIMIT <= k contract

The efficient-filter test used LIMIT 5 with option='k=3', which the
branch rejects at push-down time via
VectorSearchQueryBuilder#validateLimitWithinK. Change LIMIT to 3 so the
query passes validation and actually exercises the k-NN execution path.

The discriminator is preserved: query near TX with WHERE state='CA' and
k=3 would return zero rows under POST filtering (the three nearest
candidates are all TX and get filtered out) and returns exactly {4,5,6}
under efficient filtering, which navigates HNSW toward the CA cluster.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add user doc page for vectorSearch() table function

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review: soften aggregation language; tighten filter_type section; drop subquery workaround; note case-sensitive option keys

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Correct vector-search doc to match current behavior

Address review feedback on PR #5363:

- filter_type=omitted section: describe the actual path (native DSL,
  then ScriptQueryBuilder fallback executed server-side, and only then
  in-memory evaluation when predicate translation itself fails), not
  "falls back to in-memory" as the general case.
- Limitations: GROUP BY / aggregations are actively rejected by
  VectorSearchIndexScanBuilder.pushDownAggregation, not merely
  "not validated". Add outer-WHERE-on-subquery rejection, which is
  also enforced by the same scan builder's validatePlan walker.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Drop preview framing from Limitations section

vectorSearch() is not being shipped as an experimental/preview feature,
so reword the Limitations header to avoid "preview contract" framing.
Addresses dai-chen's review question on #5363.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Tighten Limitations wording after dropping preview framing

"Are not validated" was accurate only under the prior "preview contract"
framing. Reword the section header and the JOIN / UNION / multi-call
bullets to "not covered by tests", which is what we actually mean.
GROUP BY and outer-WHERE-on-subquery remain explicitly rejected.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Align doc with current branch behavior after hardening PRs

Update the doc to reflect restrictions that landed in later PRs on
feature/vector-search-p0 and are now enforced in code:

- OFFSET rejection on vectorSearch() (VectorSearchQueryBuilder)
- _score rejected in WHERE; point users to option='min_score=...'
- filter_type=efficient allowlist: native leaf queries + bool/
  constant_score; scripts, nested predicates, and unknown shapes
  are rejected
- table= accepts single concrete index or alias; wildcards, comma
  multi-targets, _all, ., and .. are rejected
- k-NN plugin missing produces a dedicated error; _explain still works
- Native k-NN tuning options are rejected as unknown option keys
- Non-negative max_distance / min_score
- Vector literal component validation (separator + empty component)
- Alias required on table functions
- Positional / mixed-positional arguments are rejected
- Synthetic _score collision with a user-mapped _score field

Also reword the Limitations header to "unsupported or not guaranteed"
now that the preview framing is gone.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Align Limitations wording with house style

docs/user/limitations/limitations.rst and docs/user/ppl/limitations/
limitations.md both use plain "is not supported" / "are not supported"
phrasing. "Behavior not guaranteed" was not used anywhere else in the
user docs, so replace it with the idiomatic "not supported" wording.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Polish doc wording to match existing user-doc house style

Replace non-idiomatic phrasing (rejected, descriptive error, fail fast)
with the standard 'is not supported' / 'returns an error' / 'fails with
an error' forms used across docs/user/. No behavioral claims changed.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Strengthen no-kNN integration tests with exact message and scan-operator assertions

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Add integration test for vectorSearch() alias with multiple backing indices

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Remove em dashes from added comments

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Reject outer operators over vectorSearch() subqueries

Extend the subquery-boundary walker to also reject outer ORDER BY,
OFFSET, GROUP BY, aggregation, and DISTINCT when separated from the
scan by a subquery Project. Previously only outer WHERE was rejected,
so these operators would be applied in memory after k-NN had already
returned the top-k by vector distance and could silently yield zero
rows or mis-ordered results.

Outer LIMIT without OFFSET is still allowed and locked in by a
positive test.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address PR review: operator-specific error messages, inner ORDER BY _score IT, reword pushdown phrasing

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [Feature] Mark vectorSearch() experimental and default to efficient filtering

Mark the vectorSearch() table function as experimental in the user doc
following the repo convention (title [Experimental] suffix), and flip the
default WHERE filter placement from post-filtering to efficient
pre-filtering so a query without filter_type embeds the predicate under
knn.filter for ANN-time pruning.

Production code: filterType=null in VectorSearchIndex now resolves to
FilterType.EFFICIENT, and VectorSearchQueryBuilder's full constructor
defaults to EFFICIENT when passed null. The test-only 3-arg constructor
stays pinned to POST because it does not wire a rebuildKnnWithFilter
callback and EFFICIENT mode requires one.

Allow-list error messages are reworded to neutral wording
("vectorSearch WHERE pre-filtering does not support...") so default-path
users never see internal filter_type=efficient terminology and get a
clear "set filter_type=post" fallback hint.

Doc updates the Filtering section to describe Omitted=efficient as the
default, with post framed as the opt-in fallback for predicates outside
the efficient allow-list. Example 4 shows the default knn.filter shape;
Example 5 shows filter_type=post for arithmetic predicates.

Tests: BETWEEN / NOT IN regression guards pin filter_type=post
explicitly so they continue to assert the post-filter DSL shape.
testPostFilterReturnsOnlyMatchingDocs pins filter_type=post so the test
name still reflects what it exercises. New default-shape IT coverage
asserts knn.filter embeds the predicate and there is no outer bool
wrapping.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [Doc] Apply reviewer feedback on vectorSearch() user doc

- Move translation-failure in-memory fallback note from explicit `post`
  to the default (omitted) section; under explicit `post` the query
  errors instead.
- Expand Limitations to cover outer ORDER BY, non-zero OFFSET, GROUP BY,
  aggregation, and DISTINCT over a vectorSearch() subquery (matches the
  expanded rejection landed via #5385); note that plain outer LIMIT
  without OFFSET is allowed.
- Add engine/method caveat for default `efficient` filtering and
  soften "pre-filtering during ANN search" phrasing to "native efficient
  k-NN filtering".
- Clarify that full-text predicates under WHERE act as filters, not as
  hybrid relevance scorers.
- Rename Example 5 to "Post-filtering for predicates not supported by
  efficient mode".
- Tighten explicit `efficient` wording to emphasize it fails closed.
- Reword radial examples and supported option keys to say "matches /
  returns up to the specified LIMIT documents" instead of "returns all".
- Add alias fan-out note under the `table` argument.
- Sweep remaining em dashes in the file to plain text.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* [Doc] Apply second-pass review polish to vectorSearch() user doc

- Introduction: add a sentence noting that the SQL layer translates
  vectorSearch() into an OpenSearch search request whose body is native
  k-NN query DSL, with the query vector parsed into a numeric array
  before emission.
- Soften the multi-backing alias note: SQL validates the table string
  shape only; it does not prevalidate per-backing-index mapping,
  dimension, or engine compatibility. OpenSearch execution remains the
  source of truth for those checks.
- Rewrite the full-text paragraph: placement now follows filter_type,
  so under default `efficient` full-text predicates are embedded under
  `knn.filter` (not only "alongside" the k-NN query). Keep the
  not-hybrid-scorer clarification.
- Reword `post` bullet to describe Boolean filter placement
  (`bool.must(knn)` + `bool.filter(where)`) instead of "runs first";
  explicitly contrast with the REST `post_filter` parameter, and note
  that selective filters can yield fewer than k rows.
- Rename Example 4 to "Default efficient filtering (no filter_type)"
  and replace the remaining "pre-filtering" mention with "efficient
  filtering" to align with OpenSearch k-NN terminology.
- Scoring section: use a concrete `v._score` example for readability
  alongside the `<alias>._score` form.
- Limitations: replace "top-k rows" with "finite result set" to cover
  both top-k and radial modes.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

---------

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working feature skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. SQL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant