Skip to content

[Feature] Support SQL Vector Search#5394

Merged
Swiddis merged 80 commits intomainfrom
feature/vector-search-p0
Apr 29, 2026
Merged

[Feature] Support SQL Vector Search#5394
Swiddis merged 80 commits intomainfrom
feature/vector-search-p0

Conversation

@mengweieric
Copy link
Copy Markdown
Collaborator

Merge the feature/vector-search-p0 branch into main.

Adds the experimental vectorSearch() SQL table function (k-NN pushdown, efficient/post filtering, radial and top-k modes). See individual PRs in the stack for details.

mengweieric and others added 30 commits April 20, 2026 15:26
* [Feature] Add table function relation to SQL grammar for vectorSearch()

Add table function relation support to the SQL parser:
- New `tableFunctionRelation` alternative in `relation` grammar rule
- Named argument syntax: `key=value` (e.g., table='index', field='vec')
- Alias is required by grammar (FROM func(...) AS alias)
- AstBuilder emits existing TableFunction + SubqueryAlias AST nodes
- 3 parser unit tests: basic parse, with WHERE/ORDER BY/LIMIT, alias required

This is a pure grammar change — no execution support yet. Queries will
parse successfully but fail at the Analyzer with "unsupported function".

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review feedback on table function relation grammar

1. Canonicalize argument names at parser boundary:
   unquoteIdentifier + toLowerCase(Locale.ROOT) in visitTableFunctionRelation
   so FIELD='x' and `field`='x' both produce argName="field"

2. Make AS keyword optional (AS? alias) for consistency with
   tableAsRelation and subqueryAsRelation grammar rules

3. Strengthen test coverage:
   - Full structural AST assertion for WHERE + ORDER BY + LIMIT
     (verifies Sort, Limit, Filter nodes, not just toString)
   - Argument reorder test proves names resolve by name not position
   - Case canonicalization test (TABLE= → table=)
   - Alias-without-AS test (FROM func(...) v)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Apply spotless formatting

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

---------

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Maps knn_vector fields to ExprCoreType.ARRAY so they appear in
DESCRIBE output and can be referenced in projections. This is a
visibility shim — not a full vector type.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchIndex.createScanBuilder() needs to construct an
OpenSearchIndexScanBuilder with a custom VectorSearchQueryBuilder
delegate. The existing constructor was protected (test-only).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Introduces the core execution pipeline for vectorsearch():
- VectorSearchTableFunctionResolver: registers vectorsearch with 4 STRING args
- VectorSearchTableFunctionImplementation: parses named args, vector literal,
  options string, validates search mode (k/max_distance/min_score)
- VectorSearchIndex: extends OpenSearchIndex with knn query seeding,
  score tracking, and WrapperQueryBuilder DSL construction
- VectorSearchQueryBuilder: keeps knn in must (scoring) context,
  WHERE filters in filter (non-scoring) context

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Override getFunctions() to expose vectorsearch() table function
to the query analysis pipeline.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Verifies knn query is placed in scoring (must) context, not wrapped
in bool.filter when no WHERE clause is present.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Add pushDownFilter() unit test asserting knn stays in bool.must
  (scoring) and WHERE predicate goes to bool.filter (non-scoring)
- Add option key allowlist (k, max_distance, min_score) to reject
  unknown/unsupported keys before they reach DSL generation
- Add field name validation to reject characters that could corrupt
  the WrapperQueryBuilder JSON (allows alphanumeric, dots, underscores,
  hyphens)
- Add named-arg type guard to reject non-NamedArgumentExpression args
  early with a clear error message

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Parse k as integer, max_distance and min_score as double before they
reach buildKnnQuery(). Rejects non-numeric and non-finite values with
clear errors. This closes the residual JSON-injection path through
option values without requiring full XContent migration.

Also fixes toString() to be consistent with the named-arg guard
(no longer blindly casts to NamedArgumentExpression).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- parseOptions: reject malformed segments and duplicate keys
- parseVector: wrap errors in ExpressionEvaluationException, reject
  non-finite floats (Infinity, NaN)
- VectorSearchIndex: default requestedTotalSize to k via
  pushDownLimitToRequestTotal so queries without LIMIT return k results
- Add 5 new tests: malformed option, duplicate key, empty vector,
  malformed vector component, non-finite vector component

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- validateNamedArgs() now rejects null/empty arg names defensively,
  closing a potential NPE if the shared table-function path is later
  wired into PPL
- OpenSearchStorageEngineTest uses contains-check instead of exact
  collection size assertion
- Add testNullArgNameThrows test

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Remove unused VECTOR_OPTION constant from VectorSearchIndex
- Clarify buildKnnQuery() comment: quoted fallback is for forward
  compatibility, all P0 values are already canonicalized as numeric
- Rename testMissingSearchModeOptionThrows to
  testUnknownOptionKeyOnlyThrows to match what it actually tests

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Enforce exactly one of k, max_distance, or min_score
- Validate k is in [1, 10000] range
- Add 6 tests: mutual exclusivity (3 combos), k too small, k too
  large, k boundary values (1 and 10000)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchQueryBuilder now accepts options map and rejects
pushDownLimit when LIMIT exceeds k. Radial modes (max_distance,
min_score) have no LIMIT restriction.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Create VectorSearchIndexTest: 7 tests covering buildKnnQueryJson()
  for top-k, max_distance, min_score, nested fields, multi-element
  and single-element vectors, numeric option rendering
- Add edge case tests to VectorSearchTableFunctionImplementationTest:
  NaN vector component, empty option key/value, negative k, NaN for
  max_distance and min_score (6 new tests)
- Add VectorSearchQueryBuilderTest: min_score radial mode LIMIT,
  pushDownSort delegation to parent (2 new tests)
- Extract buildKnnQueryJson() as package-private for direct testing

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Test too-many (5) and zero arguments paths in
VectorSearchTableFunctionResolver to complement existing
too-few (2) test.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Cap radial mode (max_distance/min_score) results at maxResultWindow
  to prevent unbounded result sets
- Reject ORDER BY on non-_score fields and _score ASC in vectorSearch
  since knn results are naturally sorted by _score DESC
- Add 12 integration tests: 4 _explain DSL shape verification tests
  and 8 validation error path tests

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Add multi-sort expression test: ORDER BY _score DESC, name ASC
  correctly rejects the non-_score field (VectorSearchQueryBuilderTest)
- Add case-insensitive argument name lookup test to verify
  TABLE='x' resolves same as table='x' (Implementation test)
- Add non-numeric option fallback test: verifies string options
  are quoted in JSON output (VectorSearchIndexTest)
- Add 4 integration tests: ORDER BY _score DESC succeeds,
  ORDER BY non-score rejects, ORDER BY _score ASC rejects,
  LIMIT within k succeeds (VectorSearchIT, now 16 tests)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
The base OpenSearchIndexScanQueryBuilder.pushDownSort() pushes
sort.getCount() as a limit when non-zero. Our override validated
_score DESC and returned true, but did not preserve this contract.

SQL always sets count=0, so this was not reachable today, but PPL
or future callers may set a non-zero count to combine sort+limit
in one LogicalSort node. Preserve the behavior defensively.

Add focused test: LogicalSort(count=7) with _score DESC verifies
the count is pushed down as request size.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Unit test: compound AND predicate survives pushdown into bool.filter
- Integration test: compound WHERE (term + range) produces bool query
- Integration test: radial max_distance with WHERE produces bool query

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
pushDownSort() called requestBuilder.pushDownLimit() directly, bypassing
the LIMIT > k guard in pushDownLimit(). Extract validateLimitWithinK()
helper and call it from both paths so the invariant holds when PPL or
future callers set a non-zero sort count.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Move all explainQuery()-based DSL shape tests into a dedicated
VectorSearchExplainIT suite. VectorSearchIT now contains only
validation and error-path tests.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…SearchIndex

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…on in VectorSearchQueryBuilder

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…fficient mode

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…matting

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…iltering

Mark the vectorSearch() table function as experimental in the user doc
following the repo convention (title [Experimental] suffix), and flip the
default WHERE filter placement from post-filtering to efficient
pre-filtering so a query without filter_type embeds the predicate under
knn.filter for ANN-time pruning.

Production code: filterType=null in VectorSearchIndex now resolves to
FilterType.EFFICIENT, and VectorSearchQueryBuilder's full constructor
defaults to EFFICIENT when passed null. The test-only 3-arg constructor
stays pinned to POST because it does not wire a rebuildKnnWithFilter
callback and EFFICIENT mode requires one.

Allow-list error messages are reworded to neutral wording
("vectorSearch WHERE pre-filtering does not support...") so default-path
users never see internal filter_type=efficient terminology and get a
clear "set filter_type=post" fallback hint.

Doc updates the Filtering section to describe Omitted=efficient as the
default, with post framed as the opt-in fallback for predicates outside
the efficient allow-list. Example 4 shows the default knn.filter shape;
Example 5 shows filter_type=post for arithmetic predicates.

Tests: BETWEEN / NOT IN regression guards pin filter_type=post
explicitly so they continue to assert the post-filter DSL shape.
testPostFilterReturnsOnlyMatchingDocs pins filter_type=post so the test
name still reflects what it exercises. New default-shape IT coverage
asserts knn.filter embeds the predicate and there is no outer bool
wrapping.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Move translation-failure in-memory fallback note from explicit `post`
  to the default (omitted) section; under explicit `post` the query
  errors instead.
- Expand Limitations to cover outer ORDER BY, non-zero OFFSET, GROUP BY,
  aggregation, and DISTINCT over a vectorSearch() subquery (matches the
  expanded rejection landed via #5385); note that plain outer LIMIT
  without OFFSET is allowed.
- Add engine/method caveat for default `efficient` filtering and
  soften "pre-filtering during ANN search" phrasing to "native efficient
  k-NN filtering".
- Clarify that full-text predicates under WHERE act as filters, not as
  hybrid relevance scorers.
- Rename Example 5 to "Post-filtering for predicates not supported by
  efficient mode".
- Tighten explicit `efficient` wording to emphasize it fails closed.
- Reword radial examples and supported option keys to say "matches /
  returns up to the specified LIMIT documents" instead of "returns all".
- Add alias fan-out note under the `table` argument.
- Sweep remaining em dashes in the file to plain text.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Introduction: add a sentence noting that the SQL layer translates
  vectorSearch() into an OpenSearch search request whose body is native
  k-NN query DSL, with the query vector parsed into a numeric array
  before emission.
- Soften the multi-backing alias note: SQL validates the table string
  shape only; it does not prevalidate per-backing-index mapping,
  dimension, or engine compatibility. OpenSearch execution remains the
  source of truth for those checks.
- Rewrite the full-text paragraph: placement now follows filter_type,
  so under default `efficient` full-text predicates are embedded under
  `knn.filter` (not only "alongside" the k-NN query). Keep the
  not-hybrid-scorer clarification.
- Reword `post` bullet to describe Boolean filter placement
  (`bool.must(knn)` + `bool.filter(where)`) instead of "runs first";
  explicitly contrast with the REST `post_filter` parameter, and note
  that selective filters can yield fewer than k rows.
- Rename Example 4 to "Default efficient filtering (no filter_type)"
  and replace the remaining "pre-filtering" mention with "efficient
  filtering" to align with OpenSearch k-NN terminology.
- Scoring section: use a concrete `v._score` example for readability
  alongside the `<alias>._score` form.
- Limitations: replace "top-k rows" with "finite result set" to cover
  both top-k and radial modes.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@mengweieric mengweieric changed the title [Feature] Merge vectorSearch() P0 preview into main [Feature] Support Vector Search Apr 29, 2026
@mengweieric mengweieric added SQL feature skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. labels Apr 29, 2026
@mengweieric mengweieric changed the title [Feature] Support Vector Search [Feature] Support SQL Vector Search Apr 29, 2026
@Swiddis Swiddis merged commit d3bdca8 into main Apr 29, 2026
81 of 84 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. SQL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants