fix(reindex): unwedge orchestrator and recover stale Quartz job on crash by mohityadav766 · Pull Request #28699 · open-metadata/OpenMetadata

mohityadav766 · 2026-06-04T10:59:19Z

Problem

A distributed reindex that crashes or wedges mid-run could leave SearchIndexApp permanently un-retriggerable — every attempt rejected with "Job is already running, please wait for it to complete." — until a manual pod restart, even though the job row was already FAILED and the distributed lock released.

Observed in production (nbn-dev): the OM pod restarted mid-reindex during a cluster upgrade; afterward the job showed FAILED but retriggers were blocked, and only a pod restart cleared it.

Root cause — two independent paths

1. Orchestrator hang (local, AppScheduler:308).
SearchIndexApp.execute() runs the reindex synchronously on the Quartz worker thread, which parked on an unbounded workerLatch.await(). When a worker wedged on a degraded search backend (the DeadlineTimeoutException in the incident), it never counted down the latch, so the thread never returned and getCurrentlyExecutingJobs() kept reporting the app as running. The DB-side lock/recovery (JobRecoveryManager) correctly marked the job FAILED and released the lock, but it can't kill a wedged JVM thread — hence status said "free" while Quartz said "running."

2. Stale Quartz entry (cross-pod, AppScheduler:333).
The on-demand job is non-durable, so a crash leaves a persisted QRTZ_* JobDetail. Because the store is clustered, a retrigger from any pod then throws ObjectAlreadyExistsException. The old code rethrew unconditionally without checking whether the app was actually running, so a stale entry blocked retriggers indefinitely.

Fix

DistributedSearchIndexExecutor.awaitWorkers() replaces the unbounded await with a 5s poll loop. While the job keeps progressing it is never terminal, so it simply keeps waiting — no wall-clock cap (a healthy reindex can legitimately run for hours). The moment the job goes terminal/STOPPING, it forces stop() (which shutdownNow()-interrupts wedged workers) and returns; the existing finally performs the bounded drain. PartitionWorker was already stop-aware between batches.

AppScheduler.scheduleOnDemandJob() recovers a stale Quartz entry: on ObjectAlreadyExistsException it consults the DB-backed AppRunRecord (cross-pod truth). Genuinely active runs are rethrown; stale entries are cleared (deleteJob + unscheduleJob) and rescheduled once. Fail-safe: if the run record can't be read, it treats the app as active so a live job is never wrongly cleared. This is generic — it helps all non-concurrent on-demand apps.

Tests

DistributedSearchIndexExecutorTest (+2): orchestrator unwinds when the job is terminal even if a worker never finishes; isJobTerminalOrStopping state coverage.
AppSchedulerTest (+2): stale entry (terminal run record) → cleared and rescheduled; genuinely active run → rethrown, not cleared.

All green: DistributedSearchIndexExecutorTest 37/37, AppSchedulerTest 14/14. mvn spotless:apply run.

🤖 Generated with Claude Code

A distributed reindex that crashes or wedges mid-run could leave the SearchIndexApp permanently un-retriggerable ("Job is already running, please wait for it to complete.") until a manual pod restart, even though the job row was already FAILED and the distributed lock released. Two independent causes, both fixed: 1. Orchestrator hang (local path, AppScheduler:308). execute() runs the reindex synchronously on the Quartz worker thread, which parked on an unbounded workerLatch.await(). A worker wedged on a degraded search backend never counted down the latch, so the thread never returned and getCurrentlyExecutingJobs() kept reporting the app as running. Fix: awaitWorkers() polls job state every 5s; on terminal/STOPPING it forces stop() (shutdownNow interrupts wedged workers) and returns, letting the existing finally do the bounded drain. No wall-clock cap — a healthy multi-hour reindex is never terminal, so it keeps waiting. 2. Stale Quartz entry (cross-pod path, AppScheduler:333). The on-demand job is non-durable, so a crash leaves a persisted QRTZ_* JobDetail; because the store is clustered, a retrigger from any pod then throws ObjectAlreadyExistsException even when nothing runs. The old code rethrew unconditionally without checking whether the app was running. Fix: scheduleOnDemandJob() consults the DB-backed AppRunRecord (cross-pod truth) — genuinely active runs rethrow; stale entries are cleared and rescheduled once. Fail-safe: if the run record can't be read, treat as active so a live job is never wrongly cleared. Tests: DistributedSearchIndexExecutorTest (+2), AppSchedulerTest (+2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves resilience of distributed reindex orchestration by (1) preventing the Quartz execution thread from hanging indefinitely when a worker wedges, and (2) recovering from stale Quartz JobDetail/trigger entries that can block retriggers after a pod crash in a clustered Quartz setup.

Changes:

Replace unbounded CountDownLatch.await() with a poll loop that can unwind and force stop() once the job becomes terminal/STOPPING.
Add stale Quartz entry recovery on ObjectAlreadyExistsException by consulting the latest DB-backed AppRunRecord and clearing/rescheduling when the run is not active.
Add unit tests covering both the orchestrator unwind behavior and the stale Quartz-entry recovery behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java	Adds stale Quartz-entry recovery logic for on-demand app scheduling based on latest `AppRunRecord`.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java	Prevents indefinite orchestrator waits by polling latch completion and checking job terminal/stopping state.
openmetadata-service/src/test/java/org/openmetadata/service/apps/scheduler/AppSchedulerTest.java	Adds tests validating stale Quartz entry cleanup vs. rethrow when a run is genuinely active.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutorTest.java	Adds tests ensuring the orchestrator unwinds on terminal job state even if workers never finish.

+    } catch (ObjectAlreadyExistsException ex) {
+      if (hasActiveAppRun(application)) {
+        throw ex;
+      }


…ng, log clarity - ACTIVE_ERROR is a terminal AppRunRecord status (per OmAppJobListener), so the hand-rolled active-status set wrongly treated it as active and would leave retriggers wedged. Reuse OmAppJobListener.isTerminalStatus (now public) as the single source of truth instead. - Gate stale Quartz-entry recovery to non-concurrent jobs only. Concurrent jobs use a unique identity per run, so a collision is not a stale entry and the app-wide latest run record is not a reliable signal. - On a recovery reschedule that collides again (cross-pod race), let the ObjectAlreadyExistsException propagate to the standard "already running" message rather than disrupting the job another pod just scheduled. - awaitWorkers now returns whether workers drained normally; the caller logs "All workers completed" only on a normal drain, and a distinct warning on forced unwind, so stuck workers are easier to diagnose. Tests: AppSchedulerTest (+2: ACTIVE_ERROR stale recovery, concurrent collision rethrow without clearing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-04T14:04:46Z

🔴 Playwright Results — 1 failure(s), 14 flaky

✅ 4264 passed · ❌ 1 failed · 🟡 14 flaky · ⏭️ 89 skipped

Shard	Passed	Failed	Flaky	Skipped
🟡 Shard 1	299	0	2	4
🟡 Shard 2	803	0	1	9
🟡 Shard 3	801	0	4	8
🟡 Shard 4	854	0	1	12
🟡 Shard 5	720	0	1	47
🔴 Shard 6	787	1	5	9

Genuine Failures (failed on all attempts)

❌ Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6)

Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: locator('[data-testid="success-badge"]').or(locator('[data-testid="warning-badge"]'))
Expected: visible
Timeout: 210000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 210000ms�[22m
�[2m  - waiting for locator('[data-testid="success-badge"]').or(locator('[data-testid="warning-badge"]'))�[22m

🟡 14 flaky test(s) (passed on retry)

Flow/Tour.spec.ts › Tour should work from help section (shard 1, 2 retries)
Pages/SearchSettings.spec.ts › Preview config reflects reverted n-gram weight after save (shard 1, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Advanced Blocks (shard 3, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/Tasks/TaskNavigation.spec.ts › navigating to /table/TASK-XXXXX should show 404 (invalid URL pattern) (shard 3, 1 retry)
Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3, 1 retry)
Pages/CustomProperties.spec.ts › Email (shard 4, 1 retry)
Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Follow-up review caught that reusing OmAppJobListener.isTerminalStatus (which classifies ACTIVE_ERROR as terminal for run-timing purposes) was unsafe here: ACTIVE_ERROR is an in-flight status — apps set it while still progressing (CacheWarmupApp) and jobWasExecuted only normalizes it to FAILED when the run actually finishes; crash recovery (markRunningEntriesFailed*) only flips 'running', never 'activeError'. Treating it as terminal could make a retrigger delete a job another pod is genuinely running. Use a dedicated TERMINAL_RUN_STATUSES set {SUCCESS, FAILED, STOPPED, COMPLETED}; any other status (incl. ACTIVE_ERROR) counts as a live run we must not clear. Erring toward "active" is the safe direction — a stale entry is recoverable, deleting a live job is not. Revert isTerminalStatus back to private. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

+              .map(
+                  run ->
+                      run.getStatus() != null && !TERMINAL_RUN_STATUSES.contains(run.getStatus()))
+              .orElse(false);


+  /** Statuses that mean a run has finished; anything else (incl. ACTIVE_ERROR) is in-flight. */
+  private static final Set<AppRunRecord.Status> TERMINAL_RUN_STATUSES =
+      Set.of(
+          AppRunRecord.Status.SUCCESS,
+          AppRunRecord.Status.FAILED,
+          AppRunRecord.Status.STOPPED,
+          AppRunRecord.Status.COMPLETED);


+   * <p>"Active" is defined by {@link #TERMINAL_RUN_STATUSES}: any non-terminal status (including
+   * {@code ACTIVE_ERROR}, which is in-flight — set by apps that are still progressing and only
+   * normalized to {@code FAILED} when the run actually finishes) is treated as a live run we must
+   * not clear. Erring toward "active" is deliberate: leaving a stale entry is recoverable, while
+   * deleting a job another pod is genuinely running risks a duplicate/disrupted execution.


+  private boolean awaitWorkers(CountDownLatch workerLatch, UUID jobId) throws InterruptedException {
+    boolean drained = false;
+    boolean done = false;
+    while (!done) {
+      drained = workerLatch.await(LATCH_POLL_INTERVAL_SECONDS, TimeUnit.SECONDS);
+      if (drained) {
+        done = true;
+      } else if (isJobTerminalOrStopping(jobId)) {
+        LOG.warn(
+            "Job {} is terminal/stopping but workers have not drained; forcing executor "
+                + "shutdown so the orchestrator can unwind",
+            jobId);
+        stop();
+        done = true;
+      }
+    }
+    return drained;
+  }


sonarqubecloud · 2026-06-04T17:26:09Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…rrors The orchestrator's awaitWorkers loop polls coordinator.getJob() every cycle to detect a terminal/STOPPING transition. The previous workerLatch.await() was DB-independent; the polling added thousands of getJob() reads on the orchestrator's critical path over a multi-hour reindex, any one of which could throw (connection reset, pool exhaustion) and tear the job down via the finally block. - isJobTerminalOrStopping now wraps the read in try/catch and treats a read failure as non-terminal, so a transient DB blip keeps the orchestrator waiting and the wedge unwinds on the next clean poll (mirrors the hasActiveAppRun fail-safe). - Widen the re-check cadence 5s -> 15s (3x less steady DB load; unwind is not latency-critical) and make it an injectable instance field so tests stay fast instead of slowing 3x. Tests: +2 (transient-read keeps waiting then unwinds; isJobTerminalOrStopping treats an unreadable state as non-terminal); existing unwind test injects a 1s interval. All 54 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-06-05T00:27:25Z

Code Review ✅ Approved 3 resolved / 3 findings

Implements a fail-safe poll loop for the reindex orchestrator and a robust stale-job recovery mechanism, resolving thread-hanging and cross-pod Quartz blocking issues.

✅ 3 resolved

✅ Edge Case: Stale-job recovery is a non-atomic check-then-act across pods

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java:362-376
scheduleOnDemandJob() recovers from ObjectAlreadyExistsException by reading the DB-backed run record (hasActiveAppRun) and, if not active, calling deleteJob + unscheduleJob + scheduleJob. This check-then-act sequence is not atomic with respect to other pods (the store is clustered) nor with a job that starts executing in the window between the read and the delete.

Concrete race: two pods both retrigger; both fail the first scheduleJob, both call hasActiveAppRun() and read the same stale terminal record (FAILED) so both proceed. Pod A then deleteJob+scheduleJob and its trigger fires the job. Pod B, still acting on its earlier read, executes deleteJob — now deleting/disrupting the job A just (re)scheduled — and schedules its own. The result can be a disrupted run or a second concurrent execution of an app declared allowConcurrentExecution=false. The earlier guard in triggerOnDemandApplication (getCurrentlyExecutingJobs()) only inspects the local pod, so it does not close this cross-pod window.

The window is narrow (requires near-simultaneous cross-pod retriggers), but since the whole purpose of this path is crash recovery in a clustered deployment, it is worth hardening. Consider re-checking hasActiveAppRun() immediately before the reschedule, and/or catching a second ObjectAlreadyExistsException from the recovery scheduleJob and re-deriving active state rather than letting it surface as the misleading "Job is already running" message.

✅ Bug: awaitWorkers still hangs if a wedged worker leaves job non-terminal

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java:946-960 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java:676-685
awaitWorkers() only breaks out of the poll loop when isJobTerminalOrStopping(jobId) becomes true (job null / terminal / STOPPING). The unwedge guarantee therefore depends entirely on some external party flipping the job to a terminal state. A worker wedged inside PartitionWorker.processPartition() on a degraded backend does not itself change job status — runWorkerLoop only re-checks status between partitions (line 679), not during a single blocked partition. Meanwhile the lock-refresh thread (runLockRefreshLoop) and heartbeat thread (runPartitionHeartbeatLoop) keep running on separate threads, so the distributed lock is not lost and JobRecoveryManager may not consider the run stale. In that case the job stays RUNNING, isJobTerminalOrStopping stays false, and awaitWorkers polls forever — the exact orchestrator-hang this PR set out to remove.

This may be acceptable if you are confident that a wedged worker always eventually triggers lock loss/recovery, but that assumption is not enforced by this code. Consider a defensive escape that also accounts for "no partition has made progress in N intervals" (heartbeats/partition completion stalled) rather than relying solely on the job-status transition.

✅ Edge Case: ACTIVE_ERROR run treated as stale can clear a live cross-pod job

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java:383-397 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/OmAppJobListener.java:82-87 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/OmAppJobListener.java:211-222
This commit replaces the explicit ACTIVE_RUN_STATUSES set (which listed ACTIVE_ERROR as active) with !OmAppJobListener.isTerminalStatus(status) in hasActiveAppRun (AppScheduler.java:389-392). isTerminalStatus classifies ACTIVE_ERROR as terminal (OmAppJobListener.java:84). However, ACTIVE_ERROR is an in-flight status: jobWasExecuted rewrites it to FAILED when a run actually completes (OmAppJobListener.java:211-222), and CacheWarmupApp sets ACTIVE_ERROR on a run that is still progressing/retriable (CacheWarmupApp.java:447-453). A persisted latest-run record with ACTIVE_ERROR therefore most often means a run is genuinely executing (commonly on another pod, since this is the clustered-store recovery path), not a stale crash remnant.

Consequence: a retrigger from pod B hits ObjectAlreadyExistsException, reads the latest run as ACTIVE_ERROR, treats it as stale, and deleteJob + unscheduleJob + reschedules — clearing/duplicating the Quartz job that pod A is actively running. This directly contradicts the PR's stated fail-safe that "a live job is never wrongly cleared," and unlike RUNNING/STARTED (which remain protected because they are non-terminal), ACTIVE_ERROR was flipped to the unsafe side. The new test testNonConcurrentApp_treatsActiveErrorRunAsStale encodes this behavior as intentional, but the semantics of ACTIVE_ERROR make it risky.

Suggest using a dedicated 'active run' predicate that keeps ACTIVE_ERROR (and any other in-flight status) as active rather than reusing the terminal-timings classifier, which serves a different purpose.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot AI review requested due to automatic review settings June 4, 2026 10:59

Copilot started reviewing on behalf of mohityadav766 June 4, 2026 10:59 View session

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 4, 2026

Copilot AI reviewed Jun 4, 2026

View reviewed changes

gitar-bot Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java

gitar-bot Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread ...penmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java Outdated

mohityadav766 added the To release Will cherry-pick this PR into the release branch label Jun 4, 2026

gitar-bot Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java

mohityadav766 temporarily deployed to test June 4, 2026 11:25 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test June 4, 2026 11:25 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test June 4, 2026 11:25 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings June 4, 2026 16:14

Copilot started reviewing on behalf of mohityadav766 June 4, 2026 16:14 View session

Merge branch 'main' into fix/reindex-orchestrator-wedge-stale-quartz-job

ae5938f

Copilot AI reviewed Jun 4, 2026

View reviewed changes

mohityadav766 temporarily deployed to test June 4, 2026 16:28 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test June 4, 2026 16:28 — with GitHub Actions Failure

harshach temporarily deployed to test June 5, 2026 00:37 — with GitHub Actions Inactive

harshach had a problem deploying to test June 5, 2026 00:37 — with GitHub Actions Failure

harshach temporarily deployed to test June 5, 2026 00:37 — with GitHub Actions Inactive

mohityadav766 mentioned this pull request Jun 5, 2026

Add external run Tests #28637

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(reindex): unwedge orchestrator and recover stale Quartz job on crash#28699

fix(reindex): unwedge orchestrator and recover stale Quartz job on crash#28699
mohityadav766 wants to merge 5 commits into
mainfrom
fix/reindex-orchestrator-wedge-stale-quartz-job

mohityadav766 commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Uh oh!

gitar-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mohityadav766 commented Jun 4, 2026

Problem

Root cause — two independent paths

Fix

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Playwright Results — 1 failure(s), 14 flaky

Genuine Failures (failed on all attempts)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

gitar-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 4, 2026 •

edited

Loading