Skip to content

Fix Virtual Threads unbounded#26013

Merged
pmbrull merged 3 commits intomainfrom
fix-virtual-unbound'
Feb 20, 2026
Merged

Fix Virtual Threads unbounded#26013
pmbrull merged 3 commits intomainfrom
fix-virtual-unbound'

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

@mohityadav766 mohityadav766 commented Feb 20, 2026

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Virtual thread concurrency limiting:
    • Added BoundedExecutorService wrapper enforcing semaphore-based concurrency limits on all async tasks
    • Automatically resolves safe limits from environment variable, database pool size (1/3), or CPU cores (min 4)
    • Prevents unbounded resource consumption during sustained background task execution
  • Graceful thread lifecycle management:
    • DistributedJobParticipant and DistributedSearchIndexExecutor now properly track and interrupt long-running virtual threads with configurable timeouts
    • Added interruptAndJoin() helper method centralizing thread shutdown logic
    • Ensures clean termination on service stop without resource leaks
  • AsyncService shutdown integration:
    • New execute() and submit() convenience methods for bounded async operations
    • Added 30-second graceful shutdown timeout with forced termination fallback
    • Called during application shutdown to coordinate executor service termination
  • Audit event migration:
    • Migrated AuditLogRepository.writeAuthEvent() from unbounded Thread.startVirtualThread() to bounded AsyncService.execute()
    • Preserves async audit trail writing while respecting concurrency limits

This will update automatically on new commits.

}
}
});
participantThread =
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Edge Case: Race condition: joinAndProcessJob can overwrite active thread ref

joinAndProcessJob() guards against concurrent calls with participating.compareAndSet(false, true), and onJobDiscovered() checks participating.get() before calling it. However, there's a race window in the finally block of the spawned thread (line 215-216):

finally {
    participantThread = null;     // line 215
    participating.set(false);      // line 216

Between setting participantThread = null and participating.set(false), the onJobDiscovered callback (running on a different thread) could see participating == true, skip the call, and miss a job notification. More concerning: if a new onJobDiscovered fires right after participating.set(false) on line 216 but before the virtual thread fully exits, the new call to joinAndProcessJob will set participantThread to a new thread while the old one is still in its finally block.

The more critical scenario is during stop(): stop() reads participantThread into a local variable, and then the thread's finally block nulls it out. The stop() method correctly captures the reference via Thread thread = participantThread (line 121), so the interrupt/join itself is safe. But the ordering of cleanup in the finally block should set participating.set(false) last (after all other cleanup including resetting the notifier) to minimize the race window. Currently participantThread = null is set first, then participating, which is the right relative order for those two — but currentJobId = null should also be set before participating goes false, which it is. So this is mostly a theoretical concern, but worth noting the window exists.

Was this helpful? React with 👍 / 👎

.withTimestamp(System.currentTimeMillis());
write(changeEvent);
});
AsyncService.getInstance()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Bug: Swallowed exceptions from writeAuthEvent after pool migration

The writeAuthEvent method now submits to the executor pool via .submit(), which returns a Future. If the task throws an exception (e.g., database write failure in write(changeEvent)), the exception is silently swallowed because the returned Future is never checked.

With the previous Thread.startVirtualThread(), uncaught exceptions would at least be logged by the default uncaught exception handler. With ExecutorService.submit(), exceptions are captured in the Future and only thrown when Future.get() is called — which never happens here.

Consider using .execute() instead of .submit() so that the virtual thread factory's uncaught exception handler can log failures, or wrap the Runnable with try/catch and explicit logging.

Suggested fix:

    AsyncService.getInstance()
        .getExecutorService()
        .execute(
            () -> {
              ChangeEvent changeEvent =
                  new ChangeEvent()
                      .withId(UUID.randomUUID())
                      .withEventType(eventType)
                      .withEntityType(Entity.USER)
                      .withEntityId(userId)
                      .withUserName(userName)
                      .withTimestamp(System.currentTimeMillis());
              write(changeEvent);
            });

Was this helpful? React with 👍 / 👎

}

@Override
public void execute(Runnable command) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Edge Case: Semaphore acquired after thread creation doesn't bound thread count

The BoundedExecutorService.execute() method calls delegate.execute(wrappedTask) which creates a new virtual thread immediately, and the semaphore.acquire() only happens inside that already-spawned thread. Under burst submission (e.g., a flood of audit events), this allows unbounded virtual thread creation — threads are spawned eagerly and then park on the semaphore.

While parked virtual threads are lightweight (they don't consume OS threads), this defeats the PR's stated goal of preventing "unbounded virtual thread creation." The approach does effectively bound the number of concurrently running tasks, which protects downstream resources like DB connections.

To actually bound thread creation, the semaphore should be acquired before delegating to the underlying executor:

public void execute(Runnable command) {
    try {
        semaphore.acquire();  // block the CALLER, not the spawned thread
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException("Interrupted waiting for concurrency permit", e);
    }
    delegate.execute(() -> {
        try {
            command.run();
        } finally {
            semaphore.release();
        }
    });
}

This would apply back-pressure to callers and truly bound the number of created threads. The trade-off is that callers (like the audit log writer) would block until a permit is available, which may be acceptable since they're already running on virtual threads.

Suggested fix:

    @Override
    public void execute(Runnable command) {
      try {
        semaphore.acquire();
      } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException("Interrupted waiting for concurrency permit", e);
      }
      delegate.execute(
          () -> {
            try {
              command.run();
            } finally {
              semaphore.release();
            }
          });
    }

Was this helpful? React with 👍 / 👎

}

private static int resolveMaxConcurrency() {
String env = System.getenv("ASYNC_SERVICE_MAX_CONCURRENCY");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Invalid env var ASYNC_SERVICE_MAX_CONCURRENCY silently ignored

When ASYNC_SERVICE_MAX_CONCURRENCY is set but contains an invalid value (non-numeric string, zero, or negative number), the code silently falls through to the CPU/DB-pool heuristic at lines 46-47. This can confuse operators who set the env var expecting explicit control but get different behavior with no feedback.

A log warning would help operators detect misconfiguration quickly, especially since this is the highest-priority configuration source.

Suggested fix:

    String env = System.getenv("ASYNC_SERVICE_MAX_CONCURRENCY");
    if (env != null) {
      try {
        int value = Integer.parseInt(env.trim());
        if (value > 0) {
          return value;
        }
        LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY must be positive, got {}; falling back to auto", value);
      } catch (NumberFormatException e) {
        LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY is not a valid integer: '{}'; falling back to auto", env);
      }
    }

Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Feb 20, 2026

🔍 CI failure analysis for ac0ebf5: After semaphore-based concurrency limiting: 2 search index timeout failures. Tests expect entities to appear in 1 minute but semaphore may throttle async indexing tasks, causing delays.

Issue

CI failures after new commit (43d1cba) with BoundedExecutorService:

2 tests failed out of 10,714 (0.02% rate):

  1. DomainResourceIT.java:889: Domain not found in search index
  2. TableResourceIT.java:1: Entity didn't appear in search index within 1 minute

Both are search index synchronization timeouts.

Root Cause

Semaphore-based concurrency limiting may be throttling async indexing operations.

New commit introduced:

  • BoundedExecutorService wrapper with semaphore on ALL execution paths
  • Every task acquires permit before running, releases after
  • Max concurrency: min(CPU_budget, database_pool_size / 3) with minimum 4
  • Environment override: ASYNC_SERVICE_MAX_CONCURRENCY

Critical calculation for tests:

If database_pool_size = 10:
  maxConcurrency = max(4, 10/3) = max(4, 3) = 4
  
With only 4 concurrent async slots:
  10,714 tests → thousands of entities → indexing queue
  Tests asserting within 1 minute → timeouts while queue drains

Details

Background errors (non-fatal):

  • OpenSearch version conflicts on chart/domain indices
  • Search query failed: search_phase_execution_exception: all shards failed
  • Domain FQN update conflicts

Key observations:

  1. Different failure pattern: Previous runs had version conflicts; this run has timeouts
  2. Timing-sensitive: Search indexing is async via executor
  3. Semaphore enforcement: Now blocks ALL paths (execute, submit)
  4. Test expectations: Still expect 1-minute completion

Relevance Assessment

HIGH CONFIDENCE THIS IS RELATED TO SEMAPHORE:

Evidence:

  • Timing failures (not version conflicts like before)
  • Search operations affected (async background tasks)
  • Conservative limits (pool_size / 3 formula)
  • Tests have fixed timeouts (1 minute wait)

How semaphore causes delays:

  1. Background indexing tasks queue up waiting for permits
  2. With low concurrency (4-8), queue drains slowly
  3. Tests assert entity presence before indexing completes
  4. 1-minute timeout expires → test fails

Example scenario:

Test creates entity → DB write succeeds immediately
→ Async indexing task submitted
→ Semaphore has 0 available permits (4 other tasks running)
→ Task waits in queue
→ Test polls search index every few seconds
→ 60 seconds elapse
→ Task still hasn't acquired permit / hasn't indexed
→ Test fails

Recommendation

PRIORITY: Increase Concurrency for Test Environment

Option 1: Environment variable (quickest)

export ASYNC_SERVICE_MAX_CONCURRENCY=32  # in test CI config

Option 2: Test-specific calculation

if (isTestEnvironment()) {
  return Math.max(16, cpuBudget);  // Higher for tests
}
return Math.max(4, Math.min(cpuBudget, poolSize / 3));  // Conservative for prod

Option 3: Non-blocking for indexing

// In execute() method
if (!semaphore.tryAcquire(5, TimeUnit.SECONDS)) {
  LOG.warn("Semaphore timeout, running task anyway");
  delegate.execute(command);  // Bypass semaphore
  return;
}
// ... rest of wrapped execution

Option 4: Increase test timeouts (treating symptom)
Change 1-minute waits to 2-3 minutes.

Assessment

The semaphore approach is architecturally sound but needs tuning:

  • Production: Conservative limits prevent resource exhaustion ✅
  • Tests: Need higher limits or separate config ⚠️
  • Search indexing: May need non-blocking or higher priority ⚠️

Recommended fix: Set ASYNC_SERVICE_MAX_CONCURRENCY=32 in test CI configuration or adjust formula to be less conservative for test workloads.

Code Review ⚠️ Changes requested 3 resolved / 6 findings

The semaphore-based bounded executor is a good improvement over the previous fixed-pool approach — it's idiomatic for virtual threads. The remaining previous findings (race condition in joinAndProcessJob, silent env var handling, swallowed exceptions in writeAuthEvent) are still present but low-impact.

⚠️ Edge Case: Race condition: joinAndProcessJob can overwrite active thread ref

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobParticipant.java:207 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedJobParticipant.java:215

joinAndProcessJob() guards against concurrent calls with participating.compareAndSet(false, true), and onJobDiscovered() checks participating.get() before calling it. However, there's a race window in the finally block of the spawned thread (line 215-216):

finally {
    participantThread = null;     // line 215
    participating.set(false);      // line 216

Between setting participantThread = null and participating.set(false), the onJobDiscovered callback (running on a different thread) could see participating == true, skip the call, and miss a job notification. More concerning: if a new onJobDiscovered fires right after participating.set(false) on line 216 but before the virtual thread fully exits, the new call to joinAndProcessJob will set participantThread to a new thread while the old one is still in its finally block.

The more critical scenario is during stop(): stop() reads participantThread into a local variable, and then the thread's finally block nulls it out. The stop() method correctly captures the reference via Thread thread = participantThread (line 121), so the interrupt/join itself is safe. But the ordering of cleanup in the finally block should set participating.set(false) last (after all other cleanup including resetting the notifier) to minimize the race window. Currently participantThread = null is set first, then participating, which is the right relative order for those two — but currentJobId = null should also be set before participating goes false, which it is. So this is mostly a theoretical concern, but worth noting the window exists.

💡 Quality: Invalid env var ASYNC_SERVICE_MAX_CONCURRENCY silently ignored

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:39

When ASYNC_SERVICE_MAX_CONCURRENCY is set but contains an invalid value (non-numeric string, zero, or negative number), the code silently falls through to the CPU/DB-pool heuristic at lines 46-47. This can confuse operators who set the env var expecting explicit control but get different behavior with no feedback.

A log warning would help operators detect misconfiguration quickly, especially since this is the highest-priority configuration source.

Suggested fix
    String env = System.getenv("ASYNC_SERVICE_MAX_CONCURRENCY");
    if (env != null) {
      try {
        int value = Integer.parseInt(env.trim());
        if (value > 0) {
          return value;
        }
        LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY must be positive, got {}; falling back to auto", value);
      } catch (NumberFormatException e) {
        LOG.warn("ASYNC_SERVICE_MAX_CONCURRENCY is not a valid integer: '{}'; falling back to auto", env);
      }
    }
💡 Bug: Swallowed exceptions from writeAuthEvent after pool migration

📄 openmetadata-service/src/main/java/org/openmetadata/service/audit/AuditLogRepository.java:126

The writeAuthEvent method now submits to the executor pool via .submit(), which returns a Future. If the task throws an exception (e.g., database write failure in write(changeEvent)), the exception is silently swallowed because the returned Future is never checked.

With the previous Thread.startVirtualThread(), uncaught exceptions would at least be logged by the default uncaught exception handler. With ExecutorService.submit(), exceptions are captured in the Future and only thrown when Future.get() is called — which never happens here.

Consider using .execute() instead of .submit() so that the virtual thread factory's uncaught exception handler can log failures, or wrap the Runnable with try/catch and explicit logging.

Suggested fix
    AsyncService.getInstance()
        .getExecutorService()
        .execute(
            () -> {
              ChangeEvent changeEvent =
                  new ChangeEvent()
                      .withId(UUID.randomUUID())
                      .withEventType(eventType)
                      .withEntityType(Entity.USER)
                      .withEntityId(userId)
                      .withUserName(userName)
                      .withTimestamp(System.currentTimeMillis());
              write(changeEvent);
            });
✅ 3 resolved
Performance: Fixed pool of 20 shared by 20+ callers risks thread starvation

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:21 📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:25
The AsyncService executor is shared by at least 20+ call sites across the application: EntityResource (5 uses), ColumnResource (2), SearchResource (2), LineageResource (2), AuditLogResource, AppResource, GlossaryTermResource, TestSuiteRepository, UserRepository, EntityCsv (2), and now AuditLogRepository.

A fixed pool of 20 threads means that under load, long-running tasks (e.g., CSV exports in EntityCsv, search operations in SearchResource) can block all 20 threads, preventing short tasks like writeAuthEvent() from executing. This is a classic thread starvation problem — the bounded pool addresses unbounded thread creation but introduces a new failure mode.

Additionally, Executors.newFixedThreadPool uses an unbounded LinkedBlockingQueue, so the queue itself can grow without limit when all 20 threads are busy, which ironically doesn't fully solve the resource exhaustion concern — it just shifts it from thread count to queue memory.

Suggestions:

  1. Make POOL_SIZE configurable (e.g., via environment variable or application config) so deployments can tune it based on their workload.
  2. Consider using a ThreadPoolExecutor with a bounded queue and a rejection policy (e.g., CallerRunsPolicy) to provide backpressure instead of unbounded queuing.
  3. At minimum, document why 20 was chosen and the trade-offs involved.
Quality: Using fixed pool defeats the purpose of virtual threads

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/AsyncService.java:25
Executors.newFixedThreadPool(20, Thread.ofVirtual().factory()) creates a fixed pool where virtual threads are reused like platform threads — this effectively negates the key benefit of virtual threads (cheap creation/destruction, no pooling needed). Virtual threads are designed to be created per-task and are not meant to be pooled.

The original concern of "unbounded virtual threads" is largely a non-issue for virtual threads because they don't consume OS threads 1:1. The real concern would be unbounded task submission (memory for task objects), not thread count.

A better approach to bound concurrency with virtual threads would be:

  • Use Executors.newVirtualThreadPerTaskExecutor() (original) with a Semaphore to limit concurrency
  • Or use Executors.newFixedThreadPool(20) with platform threads if true pooling is desired

The current approach works correctly but is architecturally contradictory — it's a pattern mismatch that may confuse future maintainers.

  • ✅ Semaphore acquired after thread creation doesn't bound thread count

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@pmbrull pmbrull merged commit 1f97248 into main Feb 20, 2026
34 of 35 checks passed
@pmbrull pmbrull deleted the fix-virtual-unbound' branch February 20, 2026 16:40
pmbrull pushed a commit that referenced this pull request Feb 20, 2026
* Fix Virtual Threads unbounded

* Bound all AsyncService paths with semaphore-wrapped executor

---------

Co-authored-by: Adrià Manero <adria.estivill@getcollate.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants