Vector Embedding needs to respect payload size by mohityadav766 · Pull Request #26943 · open-metadata/OpenMetadata

mohityadav766 · 2026-04-01T20:38:23Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Vector bulk processing improvements:
- Added payload size tracking and flushing to VectorBulkProcessor with scheduled periodic flushes
- Implemented centralized bulk processor in OpenSearchVectorService to respect max payload limits
- Refactored bulkIndex method to use VectorBulkProcessor instead of direct bulk operations
Code cleanup:
- Removed unused VectorBulkProcessor field and import from OpenSearchBulkSink

_{This will update automatically on new commits.}

gitar-bot · 2026-04-01T20:40:40Z

...ta-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java

  public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) {
    if (documents == null || documents.isEmpty()) {
      return;
    }

-    try {
-      List<BulkOperation> operations = new ArrayList<>();
-      for (int i = 0; i < documents.size(); i++) {
-        Map<String, Object> doc = documents.get(i);
-        String parentId = (String) doc.get("parent_id");
-        int chunkIndex = doc.containsKey("chunk_index") ? (int) doc.get("chunk_index") : i;
-        String docId = parentId + "-" + chunkIndex;
-
-        operations.add(
-            BulkOperation.of(
-                op -> op.index(idx -> idx.index(targetIndex).id(docId).document(doc))));
-      }
+    VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex);
+    for (int i = 0; i < documents.size(); i++) {
+      Map<String, Object> doc = documents.get(i);
+      String parentId = (String) doc.get("parent_id");
+      int chunkIndex = doc.containsKey("chunk_index") ? (int) doc.get("chunk_index") : i;
+      String docId = parentId + "-" + chunkIndex;
+      processor.addChunk(docId, doc);
+    }


⚠️ Bug: Race condition: bulkIndex uses processor after it may be closed

bulkIndex() is not synchronized. It obtains a VectorBulkProcessor reference from getOrCreateBulkProcessor() (which IS synchronized), then iterates documents calling processor.addChunk() outside the lock. If another thread concurrently calls bulkIndex() with a different targetIndex, getOrCreateBulkProcessor will close the old processor and create a new one. The first thread then calls addChunk() on the closed processor — whose scheduler is already shut down. Because addChunk() never checks closed, chunks are added to the dead processor's buffer. These will only be flushed if shouldFlush triggers a size-based flush; otherwise they are silently lost.

The same race exists with flushBulkProcessor(): it closes and nullifies centralBulkProcessor while bulkIndex may still hold a stale reference and be mid-iteration.

Suggested fix:

Either synchronize the entire `bulkIndex` method on the same lock as `getOrCreateBulkProcessor`/`flushBulkProcessor`, or hold the lock for the full duration of the add loop: public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) { if (documents == null || documents.isEmpty()) return; synchronized (this) { VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex); for (int i = 0; i < documents.size(); i++) { // ... addChunk ... } } }

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-04-01T20:40:42Z

openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java

+    this.scheduler = Executors.newScheduledThreadPool(1);
+    scheduler.scheduleAtFixedRate(
+        this::flushIfNeeded, FLUSH_INTERVAL_MS, FLUSH_INTERVAL_MS, TimeUnit.MILLISECONDS);


💡 Bug: Scheduler thread pool is non-daemon, may delay JVM shutdown

Executors.newScheduledThreadPool(1) creates a thread pool with non-daemon threads. If close() is not called (e.g., due to an exception in the calling code), the scheduler thread will prevent JVM shutdown. Use a custom ThreadFactory with daemon threads.

Suggested fix:

ThreadFactory daemonFactory = r -> { Thread t = new Thread(r, "vector-bulk-flush"); t.setDaemon(true); return t; }; this.scheduler = Executors.newScheduledThreadPool(1, daemonFactory);

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-04-01T20:40:43Z

Code Review ⚠️ Changes requested 0 resolved / 2 findings

Vector embedding now respects payload size limits, but introduces three concurrency issues: bulkIndex() accesses a processor that may be closed due to unsynchronized close(), and the scheduler thread pool uses non-daemon threads that could delay JVM shutdown.

⚠️

Bug: Race condition: bulkIndex uses processor after it may be closed

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:459-471 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:474-483 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:183-188 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:64-71

bulkIndex() is not synchronized. It obtains a VectorBulkProcessor reference from getOrCreateBulkProcessor() (which IS synchronized), then iterates documents calling processor.addChunk() outside the lock. If another thread concurrently calls bulkIndex() with a different targetIndex, getOrCreateBulkProcessor will close the old processor and create a new one. The first thread then calls addChunk() on the closed processor — whose scheduler is already shut down. Because addChunk() never checks closed, chunks are added to the dead processor's buffer. These will only be flushed if shouldFlush triggers a size-based flush; otherwise they are silently lost.

The same race exists with flushBulkProcessor(): it closes and nullifies centralBulkProcessor while bulkIndex may still hold a stale reference and be mid-iteration.

Suggested fix

Either synchronize the entire `bulkIndex` method on the same lock as `getOrCreateBulkProcessor`/`flushBulkProcessor`, or hold the lock for the full duration of the add loop:

public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) {
  if (documents == null || documents.isEmpty()) return;
  synchronized (this) {
    VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex);
    for (int i = 0; i < documents.size(); i++) {
      // ... addChunk ...
    }
  }
}

💡 Bug: Scheduler thread pool is non-daemon, may delay JVM shutdown

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:51-53

Executors.newScheduledThreadPool(1) creates a thread pool with non-daemon threads. If close() is not called (e.g., due to an exception in the calling code), the scheduler thread will prevent JVM shutdown. Use a custom ThreadFactory with daemon threads.

Suggested fix

ThreadFactory daemonFactory = r -> {
  Thread t = new Thread(r, "vector-bulk-flush");
  t.setDaemon(true);
  return t;
};
this.scheduler = Executors.newScheduledThreadPool(1, daemonFactory);

🤖 Prompt for agents

Code Review: Vector embedding now respects payload size limits, but introduces three concurrency issues: bulkIndex() accesses a processor that may be closed due to unsynchronized close(), and the scheduler thread pool uses non-daemon threads that could delay JVM shutdown.

1. ⚠️ Bug: Race condition: bulkIndex uses processor after it may be closed
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:459-471, openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:474-483, openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:183-188, openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:64-71

   `bulkIndex()` is not synchronized. It obtains a `VectorBulkProcessor` reference from `getOrCreateBulkProcessor()` (which IS synchronized), then iterates documents calling `processor.addChunk()` outside the lock. If another thread concurrently calls `bulkIndex()` with a different `targetIndex`, `getOrCreateBulkProcessor` will close the old processor and create a new one. The first thread then calls `addChunk()` on the closed processor — whose scheduler is already shut down. Because `addChunk()` never checks `closed`, chunks are added to the dead processor's buffer. These will only be flushed if `shouldFlush` triggers a size-based flush; otherwise they are silently lost.
   
   The same race exists with `flushBulkProcessor()`: it closes and nullifies `centralBulkProcessor` while `bulkIndex` may still hold a stale reference and be mid-iteration.

   Suggested fix:
   Either synchronize the entire `bulkIndex` method on the same lock as `getOrCreateBulkProcessor`/`flushBulkProcessor`, or hold the lock for the full duration of the add loop:
   
   public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) {
     if (documents == null || documents.isEmpty()) return;
     synchronized (this) {
       VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex);
       for (int i = 0; i < documents.size(); i++) {
         // ... addChunk ...
       }
     }
   }

2. 💡 Bug: Scheduler thread pool is non-daemon, may delay JVM shutdown
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:51-53

   `Executors.newScheduledThreadPool(1)` creates a thread pool with non-daemon threads. If `close()` is not called (e.g., due to an exception in the calling code), the scheduler thread will prevent JVM shutdown. Use a custom `ThreadFactory` with daemon threads.

   Suggested fix:
   ThreadFactory daemonFactory = r -> {
     Thread t = new Thread(r, "vector-bulk-flush");
     t.setDaemon(true);
     return t;
   };
   this.scheduler = Executors.newScheduledThreadPool(1, daemonFactory);

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Vector Embedding needs to respect payload size

d65b6b4

mohityadav766 self-assigned this Apr 1, 2026

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 1, 2026

gitar-bot bot reviewed Apr 1, 2026

View reviewed changes

mohityadav766 temporarily deployed to test April 1, 2026 20:49 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test April 1, 2026 20:49 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test April 1, 2026 20:49 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test April 1, 2026 20:49 — with GitHub Actions Failure

pmbrull approved these changes Apr 2, 2026

View reviewed changes

mohityadav766 merged commit 6b27fab into 1.12.5 Apr 2, 2026
24 of 50 checks passed

mohityadav766 deleted the vector-payload branch April 2, 2026 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Embedding needs to respect payload size#26943

Vector Embedding needs to respect payload size#26943
mohityadav766 merged 1 commit into1.12.5from
vector-payload

mohityadav766 commented Apr 1, 2026 •

edited by gitar-bot bot

Loading

Uh oh!

gitar-bot bot Apr 1, 2026

Uh oh!

gitar-bot bot Apr 1, 2026

Uh oh!

gitar-bot bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mohityadav766 commented Apr 1, 2026 • edited by gitar-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

gitar-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mohityadav766 commented Apr 1, 2026 •

edited by gitar-bot bot

Loading

gitar-bot bot commented Apr 1, 2026 •

edited

Loading