Skip to content

feat: native Elasticsearch vector search support#27111

Open
joaopamaral wants to merge 10 commits intoopen-metadata:mainfrom
Automattic:feat/es-elasticsearch-vector-search-upstream
Open

feat: native Elasticsearch vector search support#27111
joaopamaral wants to merge 10 commits intoopen-metadata:mainfrom
Automattic:feat/es-elasticsearch-vector-search-upstream

Conversation

@joaopamaral
Copy link
Copy Markdown

Summary

Adds native Elasticsearch 8.x/9.x vector search support, mirroring the existing OpenSearch implementation. OpenMetadata deployments backed by Elasticsearch can now use the same semantic/vector search features as OpenSearch deployments.

Changes

  • ElasticSearchVectorService (new): ES implementation of VectorIndexService, using Rest5Client for generic HTTP requests. Mirrors OpenSearchVectorService structure.
  • vector_search_index_es_native.json (new, en/jp/ru/zh): ES-native index mappings using dense_vector / dims / cosine similarity (ES 8.x/9.x format, as opposed to OpenSearch's knn_vector / dimension / HNSW).
  • VectorSearchQueryBuilder.buildNativeESQuery(): emits the ES 8.x/9.x top-level knn query format (distinct from OpenSearch's nested query.knn). Reference: https://www.elastic.co/docs/solutions/search/vector/knn
  • SemanticSearchQueryBuilder for Elasticsearch package: mirrors the OpenSearch equivalent.
  • ElasticSearchIndexManager.extractMappingsJson(): extracts the mappings sub-object before calling putMapping — ES rejects full index JSON (with settings/aliases) at the mappings API.
  • reformatVectorIndexWithDimension(): handles both "dims" (ES native) and "dimension" (OpenSearch) keys so embedding dimension injection works for both backends.
  • SearchRepository / ElasticSearchBulkSink: wired to initialize and use ElasticSearchVectorService when ES backend is configured.
  • Tests: VectorSearchQueryBuilderTest, ElasticSearchIndexManagerTest, and new ElasticSearchVectorServiceTest.

Compatibility

  • OpenSearch path is unchanged — OpenSearchBulkSink / OpenSearchVectorService untouched.
  • No schema changes required for existing deployments.

Test plan

  • mvn test -pl openmetadata-service -Dtest=VectorSearchQueryBuilderTest,ElasticSearchIndexManagerTest,ElasticSearchVectorServiceTest
  • Integration: configure embeddingProvider in elasticSearchConfiguration, run Search Index app against an ES 8.x/9.x cluster, verify vector index is created and knn search returns results
  • Confirm OpenSearch deployments unaffected

References

🤖 Generated with Claude Code

- Add ElasticSearchVectorService mirroring OpenSearchVectorService using Rest5Client
- Add vector_search_index_es_native.json with dense_vector/dims/cosine mappings for en/jp/ru/zh locales
- Add VectorSearchQueryBuilder.buildNativeESQuery() for ES 8.x/9.x top-level knn query format
- Add SemanticSearchQueryBuilder for Elasticsearch (mirrors OpenSearch equivalent)
- Fix ElasticSearchIndexManager.extractMappingsJson() to extract mappings sub-object for putMapping
- Fix reformatVectorIndexWithDimension() to handle both "dims" (ES) and "dimension" (OpenSearch) keys
- Wire ElasticSearchVectorService into SearchRepository and ElasticSearchBulkSink
- Extend VectorSearchQueryBuilderTest and ElasticSearchIndexManagerTest with new coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@joaopamaral
Copy link
Copy Markdown
Author

joaopamaral commented Apr 7, 2026

Initial results look good, but I've run a test only with ES 9.x and version 1.12.4 (not the one from main). I also need to double-check if OpenSearch is affected by this change. Also need to review some AI-resolved conflicts from version 1.12.4 with main.

@harshach
Copy link
Copy Markdown
Collaborator

harshach commented Apr 7, 2026

Thanks @joaopamaral this is great!!. Can you make it ready for review? and also address comments here #27111 (comment)

@joaopamaral
Copy link
Copy Markdown
Author

Sure @harshach! I'll work on the bot review first before making it ready for review! 👍

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…l 6 required args

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…ce and fix compilation

- Expand VectorIndexService interface to declare getExistingFingerprint,
  getExistingFingerprintsBatch, executeGenericRequest, and VECTOR_INDEX_KEY
  constant so callers (VectorSearchResource, ElasticSearchBulkSink) can
  invoke these through the interface type
- Add default getIndexAlias() to interface, removing duplicate private
  getSearchAlias() in OpenSearchVectorService
- Fix ElasticSearchVectorService: add generateEmbeddingFields/
  updateEntityEmbedding implementations, correct search() signature to
  include 'from' param, remove spurious @OverRide annotations
- Add 'from' parameter to buildNativeESQuery (valid for ES KNN pagination)
- DRY appendFilterMustClauses with nestedTags boolean: ES-native index
  maps tags as nested type, OpenSearch entity indices use flat object
- Add semanticSearch field to searchRequest.json (required by
  SemanticSearchQueryBuilder)
- Fix test: update all search() and buildNativeESQuery() call sites to
  pass the new 'from' parameter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…pper

Use Jackson ObjectMapper to patch the 'dims' field in the vector index
mapping template instead of exact string matching, which was fragile
against whitespace variations. Extract into package-private patchDimension()
and add 3 unit tests covering dimension replacement, preservation of other
fields, and the no-space JSON variant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…stance

Assign to the volatile 'instance' field only after registerVectorEmbeddingHandler()
completes, so a concurrent caller via getInstance() cannot observe a
partially-initialized service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

….search()

ES search() was passing 'from' directly to the KNN query, which skips raw
chunks rather than parent entities. Mirror the OpenSearch approach:
- Loop with rawOffset to collect from + size + 1 distinct parents
- Skip 'from' parents in application code after collection
- Return 4-arg VectorSearchResponse with totalHits and hasMore populated

Extract collectSearchHits() and extractTotalHits() private helpers (same
pattern as OpenSearchVectorService). Update tests to use parentId (camelCase,
matching VectorDocBuilder), use thenAnswer for fresh mock streams on each
loop iteration, and add sequence mock for multi-page termination tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…DocBuilder

bulkIndex() was reading 'parent_id' and 'chunk_index' (snake_case) to build
document IDs, but VectorDocBuilder stores both as camelCase ('parentId',
'chunkIndex'). This caused doc IDs to always be null-N or parentId-N (using
loop index instead of actual chunk index). Add test that captures the
BulkRequest and verifies the doc ID is parentId-chunkIndex.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

ElasticSearchBulkSink was casting VectorIndexService to ElasticSearchVectorService
to call copyExistingVectorDocuments(), breaking the interface abstraction with
a potential ClassCastException. Add the method to the interface with a default
no-op (returns false), and remove the cast and import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…yExistingVectorDocuments

Remove copyExistingVectorDocuments from VectorIndexService (it is ES-specific
and has no meaningful default for other implementations). Use Java 21 pattern
matching instanceof in ElasticSearchBulkSink so the call is explicit and safe
without introducing a no-op default into the interface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@joaopamaral joaopamaral marked this pull request as ready for review April 10, 2026 19:41
Copilot AI review requested due to automatic review settings April 10, 2026 19:41
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 10, 2026

Code Review 👍 Approved with suggestions 5 resolved / 6 findings

Adds native Elasticsearch vector search support with comprehensive test coverage and fixes to pagination, initialization ordering, and interface safety. Consider adding a type guard for the extractRestClient cast to Rest5ClientTransport to prevent runtime errors.

💡 Edge Case: extractRestClient cast to Rest5ClientTransport has no guard

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:60-63

At line 61, extractRestClient() performs an unchecked cast (Rest5ClientTransport) client._transport(). While the current codebase always creates the ES client with Rest5ClientTransport, a future refactor or different client construction path would produce a ClassCastException with no helpful message. A defensive check or better error message would improve debuggability.

Suggested fix
private static Rest5Client extractRestClient(ElasticsearchClient client) {
  if (!(client._transport() instanceof Rest5ClientTransport rest5)) {
    throw new IllegalArgumentException(
        "ElasticSearchVectorService requires Rest5ClientTransport, got: "
            + client._transport().getClass().getName());
  }
  return rest5.restClient();
}
✅ 5 resolved
Bug: Test calls build() with 4 args but method requires 6 — won't compile

📄 openmetadata-service/src/test/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilderTest.java:864 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilder.java:19-25
The test testNativeESQueryAndOpenSearchQueryProduceSameFilters at line 864 calls VectorSearchQueryBuilder.build(vector, 10, 100, filters) with 4 arguments, but the build() method signature requires 6 parameters: (float[] vector, int size, int from, int k, Map<String, List<String>> filters, double threshold). This will fail to compile. The intent appears to be comparing OpenSearch and ES native queries with the same filters.

Edge Case: loadIndexMapping dimension replacement is brittle — exact string match

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:523-527 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:3060-3068
In ElasticSearchVectorService.loadIndexMapping(), the dimension placeholder is replaced via exact string matching: template.replace(""dims": 512", ""dims": " + dimension). This requires the JSON template to have exactly one space after the colon. If the template is reformatted (e.g., minified, or extra spaces), the replacement silently fails. The code does have a post-check that throws if no replacement happened, but the same brittleness exists in SearchRepository.reformatVectorIndexWithDimension() fallback (lines 3062-3068) which does multiple .replace() calls for both with-space and without-space variants — showing awareness of the problem but an incomplete fix.

Edge Case: init() assigns instance before registerVectorEmbeddingHandler completes

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:65-79
In ElasticSearchVectorService.init() (line 70), instance is assigned the new service object, and then instance.registerVectorEmbeddingHandler() is called on line 71. Since getInstance() is not synchronized, a concurrent caller could observe the instance in a partially-initialized state (before the handler is registered). The volatile keyword ensures the reference is visible but doesn't guarantee that registerVectorEmbeddingHandler() has completed.

This mirrors the existing OpenSearch pattern, but the ES version is new code where it can be fixed.

Bug: ES search pagination is broken vs OpenSearch implementation

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:95-109 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:196-210
The ElasticSearchVectorService.search() passes from directly to the ES kNN query, which skips raw hits (chunks). However, the API contract expects from to skip parent entities (grouped results). The OpenSearch implementation correctly handles this: it fetches from + size + 1 parents via a loop, then skips from parents in application code.

With the current ES implementation, requesting from=5, size=10 skips 5 raw chunks (not 5 parents), leading to incorrect pagination results. Additionally, the ES version uses the 2-arg VectorSearchResponse constructor, so totalHits and hasMore are always null — callers lose pagination metadata.

Quality: Unsafe downcast defeats purpose of VectorIndexService interface

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java:778
ElasticSearchBulkSink casts vectorService to (ElasticSearchVectorService) to call copyExistingVectorDocuments(). This couples the sink back to the concrete class, undermining the new interface abstraction. If vectorService is ever a different implementation, this throws ClassCastException at runtime.

🤖 Prompt for agents
Code Review: Adds native Elasticsearch vector search support with comprehensive test coverage and fixes to pagination, initialization ordering, and interface safety. Consider adding a type guard for the extractRestClient cast to Rest5ClientTransport to prevent runtime errors.

1. 💡 Edge Case: extractRestClient cast to Rest5ClientTransport has no guard
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java:60-63

   At line 61, `extractRestClient()` performs an unchecked cast `(Rest5ClientTransport) client._transport()`. While the current codebase always creates the ES client with `Rest5ClientTransport`, a future refactor or different client construction path would produce a `ClassCastException` with no helpful message. A defensive check or better error message would improve debuggability.

   Suggested fix:
   private static Rest5Client extractRestClient(ElasticsearchClient client) {
     if (!(client._transport() instanceof Rest5ClientTransport rest5)) {
       throw new IllegalArgumentException(
           "ElasticSearchVectorService requires Rest5ClientTransport, got: "
               + client._transport().getClass().getName());
     }
     return rest5.restClient();
   }

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@joaopamaral
Copy link
Copy Markdown
Author

Hi @harshach, ’ve addressed the bot review, but I still need to re-review the code after rebasing/merging with main and rerun the tests against a real server. So far, I’ve tested this PR with version 1.12.4 and ES 9.3.1. I still need to validate that everything continues to work correctly with OpenSearch and ES 8.x.

I won’t be able to run tests for the next couple of days, but feel free to proceed with any testing on your side in the meantime.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds native Elasticsearch (8.x/9.x) vector search support to OpenMetadata, aiming to provide semantic/vector search capabilities on Elasticsearch deployments comparable to the existing OpenSearch implementation.

Changes:

  • Added a new ElasticSearchVectorService plus wiring in SearchRepository / ElasticSearchBulkSink to initialize and use it when Elasticsearch is the configured backend.
  • Introduced ES-native vector index mapping templates (vector_search_index_es_native.json) and extended query-building to emit Elasticsearch’s top-level knn query format.
  • Added/updated tests around the ES-native query format and Elasticsearch vector service behavior.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
openmetadata-spec/src/main/resources/json/schema/search/searchRequest.json Adds semanticSearch flag to the search request schema.
openmetadata-spec/src/main/resources/elasticsearch/en/vector_search_index_es_native.json New ES-native vector index template (en).
openmetadata-spec/src/main/resources/elasticsearch/jp/vector_search_index_es_native.json New ES-native vector index template (jp).
openmetadata-spec/src/main/resources/elasticsearch/ru/vector_search_index_es_native.json New ES-native vector index template (ru).
openmetadata-spec/src/main/resources/elasticsearch/zh/vector_search_index_es_native.json New ES-native vector index template (zh).
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilder.java Adds buildNativeESQuery and refactors filter emission for vector search queries.
openmetadata-service/src/test/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilderTest.java Adds coverage for ES-native top-level knn query structure and filter behavior.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorIndexService.java Extends vector service interface and adds an alias helper.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java Adjusts to use the new interface default alias method and annotates overrides.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java New Elasticsearch vector service implementation using Rest5Client for generic requests.
openmetadata-service/src/test/java/org/openmetadata/service/search/vector/ElasticSearchVectorServiceTest.java New tests for ES vector service result parsing, grouping, and dimension patching.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Initializes ES vector service when Elasticsearch backend is configured; mapping selection tweaks for ES-native template.
openmetadata-service/src/main/java/org/openmetadata/service/search/RecreateWithEmbeddings.java Attempts to include a vector “entity” key in recreate flow when vector search is enabled.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/SemanticSearchQueryBuilder.java New builder for semantic/hybrid query composition on Elasticsearch.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java Extracts mappings sub-object before calling putMapping.
openmetadata-service/src/test/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManagerTest.java Adds a test asserting updateIndex handles full index JSON by extracting mappings.
openmetadata-service/src/main/java/org/openmetadata/service/resources/search/VectorSearchResource.java Switches to repository-provided VectorIndexService and adds a fingerprint endpoint.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java Adds async vector-embedding task execution + migration path for ES indexing jobs.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSinkSimpleTest.java Adds minimal coverage for vector-embedding helpers on the ES sink.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SemanticSearchTool.java Uses repository VectorIndexService rather than OpenSearch-only implementation.

Comment on lines +94 to +96
sb.append(",\"filter\":{\"bool\":{\"must\":[");
appendFilterMustClauses(sb, filters, true);
sb.append("]}}"); // close must array and bool
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildNativeESQuery passes nestedTags=true, which makes the tags filter emit a nested query. However, in the existing entity index mappings (e.g., elasticsearch/en/table_index_mapping.json), tags is an object (no "type": "nested"), so a nested query on path tags will fail or return no results. Consider using the same flat tags.tagFQN terms filter as the OpenSearch path (or only emitting nested when the target index actually maps tags as nested).

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +234
List<Map<String, Object>> docs = VectorDocBuilder.fromEntity(entity, embeddingClient);
deleteByParentId(targetIndex, parentId);
bulkIndex(docs, targetIndex);
} catch (Exception e) {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateEntityEmbedding delegates to updateVectorEmbeddings, which deletes-by-query and then bulk-indexes new documents into targetIndex. But VectorEmbeddingHandler passes the entity index name (e.g., table_search_index), so this will create additional documents (IDs like parentId-chunkIndex) instead of updating the existing entity document (ID = entity UUID), corrupting the entity index contents. Elasticsearch implementation should mirror OpenSearchVectorService.partialUpdateEntity(...) and update the existing doc by _id with the embedding fields.

Copilot uses AI. Check for mistakes.
Comment on lines +280 to +285
String query =
"{\"size\":1,\"_source\":[\"fingerprint\"],"
+ "\"query\":{\"term\":{\"parent_id\":\""
+ VectorSearchQueryBuilder.escape(parentId)
+ "\"}}}";
String response = executeGenericRequest("POST", "/" + indexName + "/_search", query);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fingerprint lookup queries parent_id, but OpenMetadata search documents are keyed by _id (entity UUID) and the OpenSearch implementation queries by _id / ids query. As written, Elasticsearch will almost always return null fingerprints, causing unnecessary recomputation and making delete/update-by-query paths ineffective. Recommend switching to the same _id-based query strategy as OpenSearchVectorService.getExistingFingerprint(...).

Copilot uses AI. Check for mistakes.
Comment on lines +318 to +326
String query =
"{\"size\":"
+ parentIds.size()
+ ",\"_source\":[\"parent_id\",\"fingerprint\"]"
+ ",\"query\":{\"terms\":{\"parent_id\":"
+ termsArray
+ "}}"
+ ",\"collapse\":{\"field\":\"parent_id\"}}";

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch fingerprint lookup also uses parent_id terms + collapse on parent_id. If the intent is to reuse existing entity documents, this should use an ids query on _id (same as OpenSearch) to avoid relying on a separate parent_id field and to work across the existing entity indices behind the dataAssetEmbeddings alias.

Copilot uses AI. Check for mistakes.
Comment on lines +201 to +203
try (InputStream is = response.getEntity().getContent()) {
return new String(is.readAllBytes(), StandardCharsets.UTF_8);
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeGenericRequest reads and returns the response body but never checks HTTP status codes. For Elasticsearch errors (4xx/5xx), this will return an error payload as if it were a success, and downstream code will fail later while parsing/processing. Consider checking response.getStatusLine() (or equivalent) and throwing an exception that includes the error body when status >= 400, similar to OpenSearchVectorService.executeGenericRequest.

Suggested change
try (InputStream is = response.getEntity().getContent()) {
return new String(is.readAllBytes(), StandardCharsets.UTF_8);
}
String responseBody = "";
if (response.getEntity() != null) {
try (InputStream is = response.getEntity().getContent()) {
responseBody = new String(is.readAllBytes(), StandardCharsets.UTF_8);
}
}
int statusCode = response.getStatusLine().getStatusCode();
if (statusCode >= 400) {
throw new RuntimeException(
String.format(
"Elasticsearch generic request failed: %s %s returned status %d with body: %s",
method, endpoint, statusCode, responseBody));
}
return responseBody;

Copilot uses AI. Check for mistakes.

@Override
public int getPendingVectorTaskCount() {
return Math.max(0, phaser.getUnarrivedParties() - 1);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getPendingVectorTaskCount() computes pending tasks via phaser.getUnarrivedParties() - 1. After awaitVectorCompletion() calls phaser.arrive(), the main party is no longer "unarrived", so this undercounts pending tasks (can report 0 while tasks are still running) and makes timeout reporting inaccurate. Prefer tracking pending via phaser.getRegisteredParties() - 1 (or an explicit counter) so the count is correct regardless of whether the main party has arrived.

Suggested change
return Math.max(0, phaser.getUnarrivedParties() - 1);
return Math.max(0, phaser.getRegisteredParties() - 1);

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +76
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count). The embedding code and existing search mappings use camelCase (parentId, chunkIndex, chunkCount), and the vector search/filtering logic also expects parentId. This mismatch will prevent correct grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the vector index mapping field names with the existing document schema (camelCase) or update the vector doc builder + query code consistently.

Suggested change
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
"parentId": {
"type": "keyword"
},
"chunkIndex": {
"type": "integer"
},
"chunkCount": {

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +76
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count), but the embedding/search code uses camelCase (parentId, chunkIndex, chunkCount). This mismatch will break grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the mapping field names with the existing document schema (camelCase) or update the indexing/query code consistently.

Suggested change
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
"parentId": {
"type": "keyword"
},
"chunkIndex": {
"type": "integer"
},
"chunkCount": {

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +76
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count), but the embedding/search code uses camelCase (parentId, chunkIndex, chunkCount). This mismatch will break grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the mapping field names with the existing document schema (camelCase) or update the indexing/query code consistently.

Suggested change
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
"parentId": {
"type": "keyword"
},
"chunkIndex": {
"type": "integer"
},
"chunkCount": {

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +147
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ES-native vector index mapping defines identifier fields in snake_case (parent_id, chunk_index, chunk_count), but the embedding/search code uses camelCase (parentId, chunkIndex, chunkCount). This mismatch will break grouping/filtering and makes the mapping inconsistent with the rest of the search schema. Align the mapping field names with the existing document schema (camelCase) or update the indexing/query code consistently.

Suggested change
"parent_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"chunk_count": {
"parentId": {
"type": "keyword"
},
"chunkIndex": {
"type": "integer"
},
"chunkCount": {

Copilot uses AI. Check for mistakes.
@joaopamaral
Copy link
Copy Markdown
Author

Also need to review all after this refactor #26000 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants