Skip to content

Vector Embedding needs to respect payload size#26943

Merged
mohityadav766 merged 1 commit into1.12.5from
vector-payload
Apr 2, 2026
Merged

Vector Embedding needs to respect payload size#26943
mohityadav766 merged 1 commit into1.12.5from
vector-payload

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

@mohityadav766 mohityadav766 commented Apr 1, 2026

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Vector bulk processing improvements:
    • Added payload size tracking and flushing to VectorBulkProcessor with scheduled periodic flushes
    • Implemented centralized bulk processor in OpenSearchVectorService to respect max payload limits
    • Refactored bulkIndex method to use VectorBulkProcessor instead of direct bulk operations
  • Code cleanup:
    • Removed unused VectorBulkProcessor field and import from OpenSearchBulkSink

This will update automatically on new commits.

@mohityadav766 mohityadav766 self-assigned this Apr 1, 2026
@github-actions github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 1, 2026
Comment on lines 459 to +471
public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) {
if (documents == null || documents.isEmpty()) {
return;
}

try {
List<BulkOperation> operations = new ArrayList<>();
for (int i = 0; i < documents.size(); i++) {
Map<String, Object> doc = documents.get(i);
String parentId = (String) doc.get("parent_id");
int chunkIndex = doc.containsKey("chunk_index") ? (int) doc.get("chunk_index") : i;
String docId = parentId + "-" + chunkIndex;

operations.add(
BulkOperation.of(
op -> op.index(idx -> idx.index(targetIndex).id(docId).document(doc))));
}
VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex);
for (int i = 0; i < documents.size(); i++) {
Map<String, Object> doc = documents.get(i);
String parentId = (String) doc.get("parent_id");
int chunkIndex = doc.containsKey("chunk_index") ? (int) doc.get("chunk_index") : i;
String docId = parentId + "-" + chunkIndex;
processor.addChunk(docId, doc);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Race condition: bulkIndex uses processor after it may be closed

bulkIndex() is not synchronized. It obtains a VectorBulkProcessor reference from getOrCreateBulkProcessor() (which IS synchronized), then iterates documents calling processor.addChunk() outside the lock. If another thread concurrently calls bulkIndex() with a different targetIndex, getOrCreateBulkProcessor will close the old processor and create a new one. The first thread then calls addChunk() on the closed processor — whose scheduler is already shut down. Because addChunk() never checks closed, chunks are added to the dead processor's buffer. These will only be flushed if shouldFlush triggers a size-based flush; otherwise they are silently lost.

The same race exists with flushBulkProcessor(): it closes and nullifies centralBulkProcessor while bulkIndex may still hold a stale reference and be mid-iteration.

Suggested fix:

Either synchronize the entire `bulkIndex` method on the same lock as `getOrCreateBulkProcessor`/`flushBulkProcessor`, or hold the lock for the full duration of the add loop:

public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) {
  if (documents == null || documents.isEmpty()) return;
  synchronized (this) {
    VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex);
    for (int i = 0; i < documents.size(); i++) {
      // ... addChunk ...
    }
  }
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +51 to +53
this.scheduler = Executors.newScheduledThreadPool(1);
scheduler.scheduleAtFixedRate(
this::flushIfNeeded, FLUSH_INTERVAL_MS, FLUSH_INTERVAL_MS, TimeUnit.MILLISECONDS);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Bug: Scheduler thread pool is non-daemon, may delay JVM shutdown

Executors.newScheduledThreadPool(1) creates a thread pool with non-daemon threads. If close() is not called (e.g., due to an exception in the calling code), the scheduler thread will prevent JVM shutdown. Use a custom ThreadFactory with daemon threads.

Suggested fix:

ThreadFactory daemonFactory = r -> {
  Thread t = new Thread(r, "vector-bulk-flush");
  t.setDaemon(true);
  return t;
};
this.scheduler = Executors.newScheduledThreadPool(1, daemonFactory);

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 1, 2026

Code Review ⚠️ Changes requested 0 resolved / 2 findings

Vector embedding now respects payload size limits, but introduces three concurrency issues: bulkIndex() accesses a processor that may be closed due to unsynchronized close(), and the scheduler thread pool uses non-daemon threads that could delay JVM shutdown.

⚠️ Bug: Race condition: bulkIndex uses processor after it may be closed

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:459-471 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:474-483 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:183-188 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:64-71

bulkIndex() is not synchronized. It obtains a VectorBulkProcessor reference from getOrCreateBulkProcessor() (which IS synchronized), then iterates documents calling processor.addChunk() outside the lock. If another thread concurrently calls bulkIndex() with a different targetIndex, getOrCreateBulkProcessor will close the old processor and create a new one. The first thread then calls addChunk() on the closed processor — whose scheduler is already shut down. Because addChunk() never checks closed, chunks are added to the dead processor's buffer. These will only be flushed if shouldFlush triggers a size-based flush; otherwise they are silently lost.

The same race exists with flushBulkProcessor(): it closes and nullifies centralBulkProcessor while bulkIndex may still hold a stale reference and be mid-iteration.

Suggested fix
Either synchronize the entire `bulkIndex` method on the same lock as `getOrCreateBulkProcessor`/`flushBulkProcessor`, or hold the lock for the full duration of the add loop:

public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) {
  if (documents == null || documents.isEmpty()) return;
  synchronized (this) {
    VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex);
    for (int i = 0; i < documents.size(); i++) {
      // ... addChunk ...
    }
  }
}
💡 Bug: Scheduler thread pool is non-daemon, may delay JVM shutdown

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:51-53

Executors.newScheduledThreadPool(1) creates a thread pool with non-daemon threads. If close() is not called (e.g., due to an exception in the calling code), the scheduler thread will prevent JVM shutdown. Use a custom ThreadFactory with daemon threads.

Suggested fix
ThreadFactory daemonFactory = r -> {
  Thread t = new Thread(r, "vector-bulk-flush");
  t.setDaemon(true);
  return t;
};
this.scheduler = Executors.newScheduledThreadPool(1, daemonFactory);
🤖 Prompt for agents
Code Review: Vector embedding now respects payload size limits, but introduces three concurrency issues: bulkIndex() accesses a processor that may be closed due to unsynchronized close(), and the scheduler thread pool uses non-daemon threads that could delay JVM shutdown.

1. ⚠️ Bug: Race condition: bulkIndex uses processor after it may be closed
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:459-471, openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java:474-483, openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:183-188, openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:64-71

   `bulkIndex()` is not synchronized. It obtains a `VectorBulkProcessor` reference from `getOrCreateBulkProcessor()` (which IS synchronized), then iterates documents calling `processor.addChunk()` outside the lock. If another thread concurrently calls `bulkIndex()` with a different `targetIndex`, `getOrCreateBulkProcessor` will close the old processor and create a new one. The first thread then calls `addChunk()` on the closed processor — whose scheduler is already shut down. Because `addChunk()` never checks `closed`, chunks are added to the dead processor's buffer. These will only be flushed if `shouldFlush` triggers a size-based flush; otherwise they are silently lost.
   
   The same race exists with `flushBulkProcessor()`: it closes and nullifies `centralBulkProcessor` while `bulkIndex` may still hold a stale reference and be mid-iteration.

   Suggested fix:
   Either synchronize the entire `bulkIndex` method on the same lock as `getOrCreateBulkProcessor`/`flushBulkProcessor`, or hold the lock for the full duration of the add loop:
   
   public void bulkIndex(List<Map<String, Object>> documents, String targetIndex) {
     if (documents == null || documents.isEmpty()) return;
     synchronized (this) {
       VectorBulkProcessor processor = getOrCreateBulkProcessor(targetIndex);
       for (int i = 0; i < documents.size(); i++) {
         // ... addChunk ...
       }
     }
   }

2. 💡 Bug: Scheduler thread pool is non-daemon, may delay JVM shutdown
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/VectorBulkProcessor.java:51-53

   `Executors.newScheduledThreadPool(1)` creates a thread pool with non-daemon threads. If `close()` is not called (e.g., due to an exception in the calling code), the scheduler thread will prevent JVM shutdown. Use a custom `ThreadFactory` with daemon threads.

   Suggested fix:
   ThreadFactory daemonFactory = r -> {
     Thread t = new Thread(r, "vector-bulk-flush");
     t.setDaemon(true);
     return t;
   };
   this.scheduler = Executors.newScheduledThreadPool(1, daemonFactory);

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@mohityadav766 mohityadav766 merged commit 6b27fab into 1.12.5 Apr 2, 2026
24 of 50 checks passed
@mohityadav766 mohityadav766 deleted the vector-payload branch April 2, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants