feat: native Elasticsearch vector search support#27111
feat: native Elasticsearch vector search support#27111joaopamaral wants to merge 10 commits intoopen-metadata:mainfrom
Conversation
- Add ElasticSearchVectorService mirroring OpenSearchVectorService using Rest5Client - Add vector_search_index_es_native.json with dense_vector/dims/cosine mappings for en/jp/ru/zh locales - Add VectorSearchQueryBuilder.buildNativeESQuery() for ES 8.x/9.x top-level knn query format - Add SemanticSearchQueryBuilder for Elasticsearch (mirrors OpenSearch equivalent) - Fix ElasticSearchIndexManager.extractMappingsJson() to extract mappings sub-object for putMapping - Fix reformatVectorIndexWithDimension() to handle both "dims" (ES) and "dimension" (OpenSearch) keys - Wire ElasticSearchVectorService into SearchRepository and ElasticSearchBulkSink - Extend VectorSearchQueryBuilderTest and ElasticSearchIndexManagerTest with new coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Initial results look good, but I've run a test only with ES 9.x and version 1.12.4 (not the one from main). I also need to double-check if OpenSearch is affected by this change. Also need to review some AI-resolved conflicts from version 1.12.4 with main. |
|
Thanks @joaopamaral this is great!!. Can you make it ready for review? and also address comments here #27111 (comment) |
|
Sure @harshach! I'll work on the bot review first before making it ready for review! 👍 |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…l 6 required args Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…ce and fix compilation - Expand VectorIndexService interface to declare getExistingFingerprint, getExistingFingerprintsBatch, executeGenericRequest, and VECTOR_INDEX_KEY constant so callers (VectorSearchResource, ElasticSearchBulkSink) can invoke these through the interface type - Add default getIndexAlias() to interface, removing duplicate private getSearchAlias() in OpenSearchVectorService - Fix ElasticSearchVectorService: add generateEmbeddingFields/ updateEntityEmbedding implementations, correct search() signature to include 'from' param, remove spurious @OverRide annotations - Add 'from' parameter to buildNativeESQuery (valid for ES KNN pagination) - DRY appendFilterMustClauses with nestedTags boolean: ES-native index maps tags as nested type, OpenSearch entity indices use flat object - Add semanticSearch field to searchRequest.json (required by SemanticSearchQueryBuilder) - Fix test: update all search() and buildNativeESQuery() call sites to pass the new 'from' parameter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…pper Use Jackson ObjectMapper to patch the 'dims' field in the vector index mapping template instead of exact string matching, which was fragile against whitespace variations. Extract into package-private patchDimension() and add 3 unit tests covering dimension replacement, preservation of other fields, and the no-space JSON variant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…stance Assign to the volatile 'instance' field only after registerVectorEmbeddingHandler() completes, so a concurrent caller via getInstance() cannot observe a partially-initialized service. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
….search() ES search() was passing 'from' directly to the KNN query, which skips raw chunks rather than parent entities. Mirror the OpenSearch approach: - Loop with rawOffset to collect from + size + 1 distinct parents - Skip 'from' parents in application code after collection - Return 4-arg VectorSearchResponse with totalHits and hasMore populated Extract collectSearchHits() and extractTotalHits() private helpers (same pattern as OpenSearchVectorService). Update tests to use parentId (camelCase, matching VectorDocBuilder), use thenAnswer for fresh mock streams on each loop iteration, and add sequence mock for multi-page termination tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…DocBuilder
bulkIndex() was reading 'parent_id' and 'chunk_index' (snake_case) to build
document IDs, but VectorDocBuilder stores both as camelCase ('parentId',
'chunkIndex'). This caused doc IDs to always be null-N or parentId-N (using
loop index instead of actual chunk index). Add test that captures the
BulkRequest and verifies the doc ID is parentId-chunkIndex.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
ElasticSearchBulkSink was casting VectorIndexService to ElasticSearchVectorService to call copyExistingVectorDocuments(), breaking the interface abstraction with a potential ClassCastException. Add the method to the interface with a default no-op (returns false), and remove the cast and import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…yExistingVectorDocuments Remove copyExistingVectorDocuments from VectorIndexService (it is ES-specific and has no meaningful default for other implementations). Use Java 21 pattern matching instanceof in ElasticSearchBulkSink so the call is explicit and safe without introducing a no-op default into the interface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review 👍 Approved with suggestions 5 resolved / 6 findingsAdds native Elasticsearch vector search support with comprehensive test coverage and fixes to pagination, initialization ordering, and interface safety. Consider adding a type guard for the extractRestClient cast to Rest5ClientTransport to prevent runtime errors. 💡 Edge Case: extractRestClient cast to Rest5ClientTransport has no guardAt line 61, Suggested fix✅ 5 resolved✅ Bug: Test calls build() with 4 args but method requires 6 — won't compile
✅ Edge Case: loadIndexMapping dimension replacement is brittle — exact string match
✅ Edge Case: init() assigns instance before registerVectorEmbeddingHandler completes
✅ Bug: ES search pagination is broken vs OpenSearch implementation
✅ Quality: Unsafe downcast defeats purpose of VectorIndexService interface
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
Hi @harshach, ’ve addressed the bot review, but I still need to re-review the code after rebasing/merging with main and rerun the tests against a real server. So far, I’ve tested this PR with version 1.12.4 and ES 9.3.1. I still need to validate that everything continues to work correctly with OpenSearch and ES 8.x. I won’t be able to run tests for the next couple of days, but feel free to proceed with any testing on your side in the meantime. |
There was a problem hiding this comment.
Pull request overview
This PR adds native Elasticsearch (8.x/9.x) vector search support to OpenMetadata, aiming to provide semantic/vector search capabilities on Elasticsearch deployments comparable to the existing OpenSearch implementation.
Changes:
- Added a new
ElasticSearchVectorServiceplus wiring inSearchRepository/ElasticSearchBulkSinkto initialize and use it when Elasticsearch is the configured backend. - Introduced ES-native vector index mapping templates (
vector_search_index_es_native.json) and extended query-building to emit Elasticsearch’s top-levelknnquery format. - Added/updated tests around the ES-native query format and Elasticsearch vector service behavior.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-spec/src/main/resources/json/schema/search/searchRequest.json | Adds semanticSearch flag to the search request schema. |
| openmetadata-spec/src/main/resources/elasticsearch/en/vector_search_index_es_native.json | New ES-native vector index template (en). |
| openmetadata-spec/src/main/resources/elasticsearch/jp/vector_search_index_es_native.json | New ES-native vector index template (jp). |
| openmetadata-spec/src/main/resources/elasticsearch/ru/vector_search_index_es_native.json | New ES-native vector index template (ru). |
| openmetadata-spec/src/main/resources/elasticsearch/zh/vector_search_index_es_native.json | New ES-native vector index template (zh). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilder.java | Adds buildNativeESQuery and refactors filter emission for vector search queries. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilderTest.java | Adds coverage for ES-native top-level knn query structure and filter behavior. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorIndexService.java | Extends vector service interface and adds an alias helper. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java | Adjusts to use the new interface default alias method and annotates overrides. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java | New Elasticsearch vector service implementation using Rest5Client for generic requests. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/vector/ElasticSearchVectorServiceTest.java | New tests for ES vector service result parsing, grouping, and dimension patching. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java | Initializes ES vector service when Elasticsearch backend is configured; mapping selection tweaks for ES-native template. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/RecreateWithEmbeddings.java | Attempts to include a vector “entity” key in recreate flow when vector search is enabled. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/SemanticSearchQueryBuilder.java | New builder for semantic/hybrid query composition on Elasticsearch. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java | Extracts mappings sub-object before calling putMapping. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManagerTest.java | Adds a test asserting updateIndex handles full index JSON by extracting mappings. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/search/VectorSearchResource.java | Switches to repository-provided VectorIndexService and adds a fingerprint endpoint. |
| openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java | Adds async vector-embedding task execution + migration path for ES indexing jobs. |
| openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSinkSimpleTest.java | Adds minimal coverage for vector-embedding helpers on the ES sink. |
| openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SemanticSearchTool.java | Uses repository VectorIndexService rather than OpenSearch-only implementation. |
| sb.append(",\"filter\":{\"bool\":{\"must\":["); | ||
| appendFilterMustClauses(sb, filters, true); | ||
| sb.append("]}}"); // close must array and bool |
There was a problem hiding this comment.
buildNativeESQuery passes nestedTags=true, which makes the tags filter emit a nested query. However, in the existing entity index mappings (e.g., elasticsearch/en/table_index_mapping.json), tags is an object (no "type": "nested"), so a nested query on path tags will fail or return no results. Consider using the same flat tags.tagFQN terms filter as the OpenSearch path (or only emitting nested when the target index actually maps tags as nested).
| List<Map<String, Object>> docs = VectorDocBuilder.fromEntity(entity, embeddingClient); | ||
| deleteByParentId(targetIndex, parentId); | ||
| bulkIndex(docs, targetIndex); | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
updateEntityEmbedding delegates to updateVectorEmbeddings, which deletes-by-query and then bulk-indexes new documents into targetIndex. But VectorEmbeddingHandler passes the entity index name (e.g., table_search_index), so this will create additional documents (IDs like parentId-chunkIndex) instead of updating the existing entity document (ID = entity UUID), corrupting the entity index contents. Elasticsearch implementation should mirror OpenSearchVectorService.partialUpdateEntity(...) and update the existing doc by _id with the embedding fields.
| String query = | ||
| "{\"size\":1,\"_source\":[\"fingerprint\"]," | ||
| + "\"query\":{\"term\":{\"parent_id\":\"" | ||
| + VectorSearchQueryBuilder.escape(parentId) | ||
| + "\"}}}"; | ||
| String response = executeGenericRequest("POST", "/" + indexName + "/_search", query); |
There was a problem hiding this comment.
Fingerprint lookup queries parent_id, but OpenMetadata search documents are keyed by _id (entity UUID) and the OpenSearch implementation queries by _id / ids query. As written, Elasticsearch will almost always return null fingerprints, causing unnecessary recomputation and making delete/update-by-query paths ineffective. Recommend switching to the same _id-based query strategy as OpenSearchVectorService.getExistingFingerprint(...).
| String query = | ||
| "{\"size\":" | ||
| + parentIds.size() | ||
| + ",\"_source\":[\"parent_id\",\"fingerprint\"]" | ||
| + ",\"query\":{\"terms\":{\"parent_id\":" | ||
| + termsArray | ||
| + "}}" | ||
| + ",\"collapse\":{\"field\":\"parent_id\"}}"; | ||
|
|
There was a problem hiding this comment.
Batch fingerprint lookup also uses parent_id terms + collapse on parent_id. If the intent is to reuse existing entity documents, this should use an ids query on _id (same as OpenSearch) to avoid relying on a separate parent_id field and to work across the existing entity indices behind the dataAssetEmbeddings alias.
| try (InputStream is = response.getEntity().getContent()) { | ||
| return new String(is.readAllBytes(), StandardCharsets.UTF_8); | ||
| } |
There was a problem hiding this comment.
executeGenericRequest reads and returns the response body but never checks HTTP status codes. For Elasticsearch errors (4xx/5xx), this will return an error payload as if it were a success, and downstream code will fail later while parsing/processing. Consider checking response.getStatusLine() (or equivalent) and throwing an exception that includes the error body when status >= 400, similar to OpenSearchVectorService.executeGenericRequest.
| try (InputStream is = response.getEntity().getContent()) { | |
| return new String(is.readAllBytes(), StandardCharsets.UTF_8); | |
| } | |
| String responseBody = ""; | |
| if (response.getEntity() != null) { | |
| try (InputStream is = response.getEntity().getContent()) { | |
| responseBody = new String(is.readAllBytes(), StandardCharsets.UTF_8); | |
| } | |
| } | |
| int statusCode = response.getStatusLine().getStatusCode(); | |
| if (statusCode >= 400) { | |
| throw new RuntimeException( | |
| String.format( | |
| "Elasticsearch generic request failed: %s %s returned status %d with body: %s", | |
| method, endpoint, statusCode, responseBody)); | |
| } | |
| return responseBody; |
|
|
||
| @Override | ||
| public int getPendingVectorTaskCount() { | ||
| return Math.max(0, phaser.getUnarrivedParties() - 1); |
There was a problem hiding this comment.
getPendingVectorTaskCount() computes pending tasks via phaser.getUnarrivedParties() - 1. After awaitVectorCompletion() calls phaser.arrive(), the main party is no longer "unarrived", so this undercounts pending tasks (can report 0 while tasks are still running) and makes timeout reporting inaccurate. Prefer tracking pending via phaser.getRegisteredParties() - 1 (or an explicit counter) so the count is correct regardless of whether the main party has arrived.
| return Math.max(0, phaser.getUnarrivedParties() - 1); | |
| return Math.max(0, phaser.getRegisteredParties() - 1); |
| "parent_id": { | ||
| "type": "keyword" | ||
| }, | ||
| "chunk_index": { | ||
| "type": "integer" | ||
| }, | ||
| "chunk_count": { |
There was a problem hiding this comment.
The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count). The embedding code and existing search mappings use camelCase (parentId, chunkIndex, chunkCount), and the vector search/filtering logic also expects parentId. This mismatch will prevent correct grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the vector index mapping field names with the existing document schema (camelCase) or update the vector doc builder + query code consistently.
| "parent_id": { | |
| "type": "keyword" | |
| }, | |
| "chunk_index": { | |
| "type": "integer" | |
| }, | |
| "chunk_count": { | |
| "parentId": { | |
| "type": "keyword" | |
| }, | |
| "chunkIndex": { | |
| "type": "integer" | |
| }, | |
| "chunkCount": { |
| "parent_id": { | ||
| "type": "keyword" | ||
| }, | ||
| "chunk_index": { | ||
| "type": "integer" | ||
| }, | ||
| "chunk_count": { |
There was a problem hiding this comment.
The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count), but the embedding/search code uses camelCase (parentId, chunkIndex, chunkCount). This mismatch will break grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the mapping field names with the existing document schema (camelCase) or update the indexing/query code consistently.
| "parent_id": { | |
| "type": "keyword" | |
| }, | |
| "chunk_index": { | |
| "type": "integer" | |
| }, | |
| "chunk_count": { | |
| "parentId": { | |
| "type": "keyword" | |
| }, | |
| "chunkIndex": { | |
| "type": "integer" | |
| }, | |
| "chunkCount": { |
| "parent_id": { | ||
| "type": "keyword" | ||
| }, | ||
| "chunk_index": { | ||
| "type": "integer" | ||
| }, | ||
| "chunk_count": { |
There was a problem hiding this comment.
The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count), but the embedding/search code uses camelCase (parentId, chunkIndex, chunkCount). This mismatch will break grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the mapping field names with the existing document schema (camelCase) or update the indexing/query code consistently.
| "parent_id": { | |
| "type": "keyword" | |
| }, | |
| "chunk_index": { | |
| "type": "integer" | |
| }, | |
| "chunk_count": { | |
| "parentId": { | |
| "type": "keyword" | |
| }, | |
| "chunkIndex": { | |
| "type": "integer" | |
| }, | |
| "chunkCount": { |
| "parent_id": { | ||
| "type": "keyword" | ||
| }, | ||
| "chunk_index": { | ||
| "type": "integer" | ||
| }, | ||
| "chunk_count": { |
There was a problem hiding this comment.
The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count), but the embedding/search code uses camelCase (parentId, chunkIndex, chunkCount). This mismatch will break grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the mapping field names with the existing document schema (camelCase) or update the indexing/query code consistently.
| "parent_id": { | |
| "type": "keyword" | |
| }, | |
| "chunk_index": { | |
| "type": "integer" | |
| }, | |
| "chunk_count": { | |
| "parentId": { | |
| "type": "keyword" | |
| }, | |
| "chunkIndex": { | |
| "type": "integer" | |
| }, | |
| "chunkCount": { |
|
Also need to review all after this refactor #26000 😢 |
Summary
Adds native Elasticsearch 8.x/9.x vector search support, mirroring the existing OpenSearch implementation. OpenMetadata deployments backed by Elasticsearch can now use the same semantic/vector search features as OpenSearch deployments.
Changes
ElasticSearchVectorService(new): ES implementation ofVectorIndexService, usingRest5Clientfor generic HTTP requests. MirrorsOpenSearchVectorServicestructure.vector_search_index_es_native.json(new, en/jp/ru/zh): ES-native index mappings usingdense_vector/dims/cosinesimilarity (ES 8.x/9.x format, as opposed to OpenSearch'sknn_vector/dimension/ HNSW).VectorSearchQueryBuilder.buildNativeESQuery(): emits the ES 8.x/9.x top-levelknnquery format (distinct from OpenSearch's nestedquery.knn). Reference: https://www.elastic.co/docs/solutions/search/vector/knnSemanticSearchQueryBuilderfor Elasticsearch package: mirrors the OpenSearch equivalent.ElasticSearchIndexManager.extractMappingsJson(): extracts themappingssub-object before callingputMapping— ES rejects full index JSON (withsettings/aliases) at the mappings API.reformatVectorIndexWithDimension(): handles both"dims"(ES native) and"dimension"(OpenSearch) keys so embedding dimension injection works for both backends.SearchRepository/ElasticSearchBulkSink: wired to initialize and useElasticSearchVectorServicewhen ES backend is configured.VectorSearchQueryBuilderTest,ElasticSearchIndexManagerTest, and newElasticSearchVectorServiceTest.Compatibility
OpenSearchBulkSink/OpenSearchVectorServiceuntouched.Test plan
mvn test -pl openmetadata-service -Dtest=VectorSearchQueryBuilderTest,ElasticSearchIndexManagerTest,ElasticSearchVectorServiceTestembeddingProviderinelasticSearchConfiguration, run Search Index app against an ES 8.x/9.x cluster, verify vector index is created and knn search returns resultsReferences
🤖 Generated with Claude Code