Fix Virtual Threads unbounded by mohityadav766 · Pull Request #26013 · open-metadata/OpenMetadata

mohityadav766 · 2026-02-20T10:04:29Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Virtual thread concurrency limiting:
- Added BoundedExecutorService wrapper enforcing semaphore-based concurrency limits on all async tasks
- Automatically resolves safe limits from environment variable, database pool size (1/3), or CPU cores (min 4)
- Prevents unbounded resource consumption during sustained background task execution
Graceful thread lifecycle management:
- DistributedJobParticipant and DistributedSearchIndexExecutor now properly track and interrupt long-running virtual threads with configurable timeouts
- Added interruptAndJoin() helper method centralizing thread shutdown logic
- Ensures clean termination on service stop without resource leaks
AsyncService shutdown integration:
- New execute() and submit() convenience methods for bounded async operations
- Added 30-second graceful shutdown timeout with forced termination fallback
- Called during application shutdown to coordinate executor service termination
Audit event migration:
- Migrated AuditLogRepository.writeAuthEvent() from unbounded Thread.startVirtualThread() to bounded AsyncService.execute()
- Preserves async audit trail writing while respecting concurrency limits

_{This will update automatically on new commits.}

gitar-bot · 2026-02-20T10:07:14Z

-                }
-              }
-            });
+    participantThread =


⚠️ Edge Case: Race condition: joinAndProcessJob can overwrite active thread ref

joinAndProcessJob() guards against concurrent calls with participating.compareAndSet(false, true), and onJobDiscovered() checks participating.get() before calling it. However, there's a race window in the finally block of the spawned thread (line 215-216):

finally { participantThread = null; // line 215 participating.set(false); // line 216

Between setting participantThread = null and participating.set(false), the onJobDiscovered callback (running on a different thread) could see participating == true, skip the call, and miss a job notification. More concerning: if a new onJobDiscovered fires right after participating.set(false) on line 216 but before the virtual thread fully exits, the new call to joinAndProcessJob will set participantThread to a new thread while the old one is still in its finally block.

The more critical scenario is during stop(): stop() reads participantThread into a local variable, and then the thread's finally block nulls it out. The stop() method correctly captures the reference via Thread thread = participantThread (line 121), so the interrupt/join itself is safe. But the ordering of cleanup in the finally block should set participating.set(false) last (after all other cleanup including resetting the notifier) to minimize the race window. Currently participantThread = null is set first, then participating, which is the right relative order for those two — but currentJobId = null should also be set before participating goes false, which it is. So this is mostly a theoretical concern, but worth noting the window exists.

_{Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-02-20T10:07:15Z

-                  .withTimestamp(System.currentTimeMillis());
-          write(changeEvent);
-        });
+    AsyncService.getInstance()


💡 Bug: Swallowed exceptions from writeAuthEvent after pool migration

The writeAuthEvent method now submits to the executor pool via .submit(), which returns a Future. If the task throws an exception (e.g., database write failure in write(changeEvent)), the exception is silently swallowed because the returned Future is never checked.

With the previous Thread.startVirtualThread(), uncaught exceptions would at least be logged by the default uncaught exception handler. With ExecutorService.submit(), exceptions are captured in the Future and only thrown when Future.get() is called — which never happens here.

Consider using .execute() instead of .submit() so that the virtual thread factory's uncaught exception handler can log failures, or wrap the Runnable with try/catch and explicit logging.

Suggested fix:

AsyncService.getInstance() .getExecutorService() .execute( () -> { ChangeEvent changeEvent = new ChangeEvent() .withId(UUID.randomUUID()) .withEventType(eventType) .withEntityType(Entity.USER) .withEntityId(userId) .withUserName(userName) .withTimestamp(System.currentTimeMillis()); write(changeEvent); });

_{Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-02-20T15:00:32Z

+    }
+
+    @Override
+    public void execute(Runnable command) {


⚠️ Edge Case: Semaphore acquired after thread creation doesn't bound thread count

The BoundedExecutorService.execute() method calls delegate.execute(wrappedTask) which creates a new virtual thread immediately, and the semaphore.acquire() only happens inside that already-spawned thread. Under burst submission (e.g., a flood of audit events), this allows unbounded virtual thread creation — threads are spawned eagerly and then park on the semaphore.

While parked virtual threads are lightweight (they don't consume OS threads), this defeats the PR's stated goal of preventing "unbounded virtual thread creation." The approach does effectively bound the number of concurrently running tasks, which protects downstream resources like DB connections.

To actually bound thread creation, the semaphore should be acquired before delegating to the underlying executor:

public void execute(Runnable command) { try { semaphore.acquire(); // block the CALLER, not the spawned thread } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new RuntimeException("Interrupted waiting for concurrency permit", e); } delegate.execute(() -> { try { command.run(); } finally { semaphore.release(); } }); }

This would apply back-pressure to callers and truly bound the number of created threads. The trade-off is that callers (like the audit log writer) would block until a permit is available, which may be acceptable since they're already running on virtual threads.

Suggested fix:

@Override public void execute(Runnable command) { try { semaphore.acquire(); } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new RuntimeException("Interrupted waiting for concurrency permit", e); } delegate.execute( () -> { try { command.run(); } finally { semaphore.release(); } }); }

_{Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-02-20T15:00:33Z

+  }
+
+  private static int resolveMaxConcurrency() {
+    String env = System.getenv("ASYNC_SERVICE_MAX_CONCURRENCY");


💡 Quality: Invalid env var ASYNC_SERVICE_MAX_CONCURRENCY silently ignored

When ASYNC_SERVICE_MAX_CONCURRENCY is set but contains an invalid value (non-numeric string, zero, or negative number), the code silently falls through to the CPU/DB-pool heuristic at lines 46-47. This can confuse operators who set the env var expecting explicit control but get different behavior with no feedback.

A log warning would help operators detect misconfiguration quickly, especially since this is the highest-priority configuration source.

Suggested fix:

String env = System.getenv("ASYNC_SERVICE_MAX_CONCURRENCY"); if (env != null) { try { int value = Integer.parseInt(env.trim()); if (value > 0) { return value; } LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY must be positive, got {}; falling back to auto", value); } catch (NumberFormatException e) { LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY is not a valid integer: '{}'; falling back to auto", env); } }

_{Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-02-20T15:23:33Z

🔍 CI failure analysis for ac0ebf5: After semaphore-based concurrency limiting: 2 search index timeout failures. Tests expect entities to appear in 1 minute but semaphore may throttle async indexing tasks, causing delays.

Issue

CI failures after new commit (43d1cba) with BoundedExecutorService:

2 tests failed out of 10,714 (0.02% rate):

DomainResourceIT.java:889: Domain not found in search index
TableResourceIT.java:1: Entity didn't appear in search index within 1 minute

Both are search index synchronization timeouts.

Root Cause

Semaphore-based concurrency limiting may be throttling async indexing operations.

New commit introduced:

BoundedExecutorService wrapper with semaphore on ALL execution paths
Every task acquires permit before running, releases after
Max concurrency: min(CPU_budget, database_pool_size / 3) with minimum 4
Environment override: ASYNC_SERVICE_MAX_CONCURRENCY

Critical calculation for tests:

If database_pool_size = 10:
  maxConcurrency = max(4, 10/3) = max(4, 3) = 4
  
With only 4 concurrent async slots:
  10,714 tests → thousands of entities → indexing queue
  Tests asserting within 1 minute → timeouts while queue drains

Details

Background errors (non-fatal):

OpenSearch version conflicts on chart/domain indices
Search query failed: search_phase_execution_exception: all shards failed
Domain FQN update conflicts

Key observations:

Different failure pattern: Previous runs had version conflicts; this run has timeouts
Timing-sensitive: Search indexing is async via executor
Semaphore enforcement: Now blocks ALL paths (execute, submit)
Test expectations: Still expect 1-minute completion

Relevance Assessment

HIGH CONFIDENCE THIS IS RELATED TO SEMAPHORE:

Evidence:

Timing failures (not version conflicts like before)
Search operations affected (async background tasks)
Conservative limits (pool_size / 3 formula)
Tests have fixed timeouts (1 minute wait)

How semaphore causes delays:

Background indexing tasks queue up waiting for permits
With low concurrency (4-8), queue drains slowly
Tests assert entity presence before indexing completes
1-minute timeout expires → test fails

Example scenario:

Test creates entity → DB write succeeds immediately
→ Async indexing task submitted
→ Semaphore has 0 available permits (4 other tasks running)
→ Task waits in queue
→ Test polls search index every few seconds
→ 60 seconds elapse
→ Task still hasn't acquired permit / hasn't indexed
→ Test fails

Recommendation

PRIORITY: Increase Concurrency for Test Environment

Option 1: Environment variable (quickest)

export ASYNC_SERVICE_MAX_CONCURRENCY=32  # in test CI config

Option 2: Test-specific calculation

if (isTestEnvironment()) {
  return Math.max(16, cpuBudget);  // Higher for tests
}
return Math.max(4, Math.min(cpuBudget, poolSize / 3));  // Conservative for prod

Option 3: Non-blocking for indexing

// In execute() method
if (!semaphore.tryAcquire(5, TimeUnit.SECONDS)) {
  LOG.warn("Semaphore timeout, running task anyway");
  delegate.execute(command);  // Bypass semaphore
  return;
}
// ... rest of wrapped execution

Option 4: Increase test timeouts (treating symptom)
Change 1-minute waits to 2-3 minutes.

Assessment

The semaphore approach is architecturally sound but needs tuning:

Production: Conservative limits prevent resource exhaustion ✅
Tests: Need higher limits or separate config ⚠️
Search indexing: May need non-blocking or higher priority ⚠️

Recommended fix: Set ASYNC_SERVICE_MAX_CONCURRENCY=32 in test CI configuration or adjust formula to be less conservative for test workloads.

Code Review ⚠️ Changes requested 3 resolved / 6 findings

The semaphore-based bounded executor is a good improvement over the previous fixed-pool approach — it's idiomatic for virtual threads. The remaining previous findings (race condition in joinAndProcessJob, silent env var handling, swallowed exceptions in writeAuthEvent) are still present but low-impact.

⚠️

Edge Case: Race condition: joinAndProcessJob can overwrite active thread ref

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobParticipant.java:207 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobParticipant.java:215

joinAndProcessJob() guards against concurrent calls with participating.compareAndSet(false, true), and onJobDiscovered() checks participating.get() before calling it. However, there's a race window in the finally block of the spawned thread (line 215-216):

finally {
    participantThread = null;     // line 215
    participating.set(false);      // line 216

Between setting participantThread = null and participating.set(false), the onJobDiscovered callback (running on a different thread) could see participating == true, skip the call, and miss a job notification. More concerning: if a new onJobDiscovered fires right after participating.set(false) on line 216 but before the virtual thread fully exits, the new call to joinAndProcessJob will set participantThread to a new thread while the old one is still in its finally block.

The more critical scenario is during stop(): stop() reads participantThread into a local variable, and then the thread's finally block nulls it out. The stop() method correctly captures the reference via Thread thread = participantThread (line 121), so the interrupt/join itself is safe. But the ordering of cleanup in the finally block should set participating.set(false) last (after all other cleanup including resetting the notifier) to minimize the race window. Currently participantThread = null is set first, then participating, which is the right relative order for those two — but currentJobId = null should also be set before participating goes false, which it is. So this is mostly a theoretical concern, but worth noting the window exists.

💡 Quality: Invalid env var ASYNC_SERVICE_MAX_CONCURRENCY silently ignored

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:39

When ASYNC_SERVICE_MAX_CONCURRENCY is set but contains an invalid value (non-numeric string, zero, or negative number), the code silently falls through to the CPU/DB-pool heuristic at lines 46-47. This can confuse operators who set the env var expecting explicit control but get different behavior with no feedback.

A log warning would help operators detect misconfiguration quickly, especially since this is the highest-priority configuration source.

Suggested fix

    String env = System.getenv("ASYNC_SERVICE_MAX_CONCURRENCY");
    if (env != null) {
      try {
        int value = Integer.parseInt(env.trim());
        if (value > 0) {
          return value;
        }
        LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY must be positive, got {}; falling back to auto", value);
      } catch (NumberFormatException e) {
        LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY is not a valid integer: '{}'; falling back to auto", env);
      }
    }

💡 Bug: Swallowed exceptions from writeAuthEvent after pool migration

📄 openmetadata-service/src/main/java/org/openmetadata/service/audit/AuditLogRepository.java:126

The writeAuthEvent method now submits to the executor pool via .submit(), which returns a Future. If the task throws an exception (e.g., database write failure in write(changeEvent)), the exception is silently swallowed because the returned Future is never checked.

With the previous Thread.startVirtualThread(), uncaught exceptions would at least be logged by the default uncaught exception handler. With ExecutorService.submit(), exceptions are captured in the Future and only thrown when Future.get() is called — which never happens here.

Consider using .execute() instead of .submit() so that the virtual thread factory's uncaught exception handler can log failures, or wrap the Runnable with try/catch and explicit logging.

Suggested fix

    AsyncService.getInstance()
        .getExecutorService()
        .execute(
            () -> {
              ChangeEvent changeEvent =
                  new ChangeEvent()
                      .withId(UUID.randomUUID())
                      .withEventType(eventType)
                      .withEntityType(Entity.USER)
                      .withEntityId(userId)
                      .withUserName(userName)
                      .withTimestamp(System.currentTimeMillis());
              write(changeEvent);
            });

✅ 3 resolved

✅ Performance: Fixed pool of 20 shared by 20+ callers risks thread starvation

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:21 📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:25
The AsyncService executor is shared by at least 20+ call sites across the application: EntityResource (5 uses), ColumnResource (2), SearchResource (2), LineageResource (2), AuditLogResource, AppResource, GlossaryTermResource, TestSuiteRepository, UserRepository, EntityCsv (2), and now AuditLogRepository.

A fixed pool of 20 threads means that under load, long-running tasks (e.g., CSV exports in EntityCsv, search operations in SearchResource) can block all 20 threads, preventing short tasks like writeAuthEvent() from executing. This is a classic thread starvation problem — the bounded pool addresses unbounded thread creation but introduces a new failure mode.

Additionally, Executors.newFixedThreadPool uses an unbounded LinkedBlockingQueue, so the queue itself can grow without limit when all 20 threads are busy, which ironically doesn't fully solve the resource exhaustion concern — it just shifts it from thread count to queue memory.

Suggestions:

Make POOL_SIZE configurable (e.g., via environment variable or application config) so deployments can tune it based on their workload.

Consider using a ThreadPoolExecutor with a bounded queue and a rejection policy (e.g., CallerRunsPolicy) to provide backpressure instead of unbounded queuing.

At minimum, document why 20 was chosen and the trade-offs involved.

✅ Quality: Using fixed pool defeats the purpose of virtual threads

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:25
Executors.newFixedThreadPool(20, Thread.ofVirtual().factory()) creates a fixed pool where virtual threads are reused like platform threads — this effectively negates the key benefit of virtual threads (cheap creation/destruction, no pooling needed). Virtual threads are designed to be created per-task and are not meant to be pooled.

The original concern of "unbounded virtual threads" is largely a non-issue for virtual threads because they don't consume OS threads 1:1. The real concern would be unbounded task submission (memory for task objects), not thread count.

A better approach to bound concurrency with virtual threads would be:

Use Executors.newVirtualThreadPerTaskExecutor() (original) with a Semaphore to limit concurrency

Or use Executors.newFixedThreadPool(20) with platform threads if true pooling is desired

The current approach works correctly but is architecturally contradictory — it's a pattern mismatch that may confuse future maintainers.

✅ Semaphore acquired after thread creation doesn't bound thread count

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-02-20T16:20:35Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

* Fix Virtual Threads unbounded * Bound all AsyncService paths with semaphore-wrapped executor --------- Co-authored-by: Adrià Manero <adria.estivill@getcollate.io>

Fix Virtual Threads unbounded

fc1b2f5

mohityadav766 self-assigned this Feb 20, 2026

mohityadav766 temporarily deployed to test February 20, 2026 10:04 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test February 20, 2026 10:04 — with GitHub Actions Failure

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Feb 20, 2026

gitar-bot Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java Outdated

gitar-bot Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java Outdated

manerow reviewed Feb 20, 2026

View reviewed changes

Bound all AsyncService paths with semaphore-wrapped executor

43d1cba

manerow had a problem deploying to test February 20, 2026 14:55 — with GitHub Actions Error

gitar-bot Bot reviewed Feb 20, 2026

View reviewed changes

Merge branch 'main' into fix-virtual-unbound'

ac0ebf5

mohityadav766 temporarily deployed to test February 20, 2026 15:17 — with GitHub Actions Inactive

pmbrull approved these changes Feb 20, 2026

View reviewed changes

pmbrull merged commit 1f97248 into main Feb 20, 2026
34 of 35 checks passed

pmbrull deleted the fix-virtual-unbound' branch February 20, 2026 16:40

pmbrull pushed a commit that referenced this pull request Feb 20, 2026

Fix Virtual Threads unbounded (#26013)

247bf20

* Fix Virtual Threads unbounded * Bound all AsyncService paths with semaphore-wrapped executor --------- Co-authored-by: Adrià Manero <adria.estivill@getcollate.io>

mohityadav766 mentioned this pull request Apr 21, 2026

[Search Indexing] Follow-ups from #26154 review: stats init race, missing finally cleanup, pre-alias-swap deletion #27586

Closed

Conversation

mohityadav766 commented Feb 20, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

Uh oh!

gitar-bot Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause

Details

Relevance Assessment

Recommendation

PRIORITY: Increase Concurrency for Test Environment

Assessment

Uh oh!

sonarqubecloud Bot commented Feb 20, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohityadav766 commented Feb 20, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented Feb 20, 2026 •

edited

Loading