Skip to content

fix(reindex): unwedge orchestrator and recover stale Quartz job on crash#28699

Open
mohityadav766 wants to merge 5 commits into
mainfrom
fix/reindex-orchestrator-wedge-stale-quartz-job
Open

fix(reindex): unwedge orchestrator and recover stale Quartz job on crash#28699
mohityadav766 wants to merge 5 commits into
mainfrom
fix/reindex-orchestrator-wedge-stale-quartz-job

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

Problem

A distributed reindex that crashes or wedges mid-run could leave SearchIndexApp permanently un-retriggerable — every attempt rejected with "Job is already running, please wait for it to complete." — until a manual pod restart, even though the job row was already FAILED and the distributed lock released.

Observed in production (nbn-dev): the OM pod restarted mid-reindex during a cluster upgrade; afterward the job showed FAILED but retriggers were blocked, and only a pod restart cleared it.

Root cause — two independent paths

1. Orchestrator hang (local, AppScheduler:308).
SearchIndexApp.execute() runs the reindex synchronously on the Quartz worker thread, which parked on an unbounded workerLatch.await(). When a worker wedged on a degraded search backend (the DeadlineTimeoutException in the incident), it never counted down the latch, so the thread never returned and getCurrentlyExecutingJobs() kept reporting the app as running. The DB-side lock/recovery (JobRecoveryManager) correctly marked the job FAILED and released the lock, but it can't kill a wedged JVM thread — hence status said "free" while Quartz said "running."

2. Stale Quartz entry (cross-pod, AppScheduler:333).
The on-demand job is non-durable, so a crash leaves a persisted QRTZ_* JobDetail. Because the store is clustered, a retrigger from any pod then throws ObjectAlreadyExistsException. The old code rethrew unconditionally without checking whether the app was actually running, so a stale entry blocked retriggers indefinitely.

Fix

DistributedSearchIndexExecutor.awaitWorkers() replaces the unbounded await with a 5s poll loop. While the job keeps progressing it is never terminal, so it simply keeps waiting — no wall-clock cap (a healthy reindex can legitimately run for hours). The moment the job goes terminal/STOPPING, it forces stop() (which shutdownNow()-interrupts wedged workers) and returns; the existing finally performs the bounded drain. PartitionWorker was already stop-aware between batches.

AppScheduler.scheduleOnDemandJob() recovers a stale Quartz entry: on ObjectAlreadyExistsException it consults the DB-backed AppRunRecord (cross-pod truth). Genuinely active runs are rethrown; stale entries are cleared (deleteJob + unscheduleJob) and rescheduled once. Fail-safe: if the run record can't be read, it treats the app as active so a live job is never wrongly cleared. This is generic — it helps all non-concurrent on-demand apps.

Tests

  • DistributedSearchIndexExecutorTest (+2): orchestrator unwinds when the job is terminal even if a worker never finishes; isJobTerminalOrStopping state coverage.
  • AppSchedulerTest (+2): stale entry (terminal run record) → cleared and rescheduled; genuinely active run → rethrown, not cleared.

All green: DistributedSearchIndexExecutorTest 37/37, AppSchedulerTest 14/14. mvn spotless:apply run.

🤖 Generated with Claude Code

A distributed reindex that crashes or wedges mid-run could leave the
SearchIndexApp permanently un-retriggerable ("Job is already running,
please wait for it to complete.") until a manual pod restart, even though
the job row was already FAILED and the distributed lock released.

Two independent causes, both fixed:

1. Orchestrator hang (local path, AppScheduler:308). execute() runs the
   reindex synchronously on the Quartz worker thread, which parked on an
   unbounded workerLatch.await(). A worker wedged on a degraded search
   backend never counted down the latch, so the thread never returned and
   getCurrentlyExecutingJobs() kept reporting the app as running.
   Fix: awaitWorkers() polls job state every 5s; on terminal/STOPPING it
   forces stop() (shutdownNow interrupts wedged workers) and returns,
   letting the existing finally do the bounded drain. No wall-clock cap —
   a healthy multi-hour reindex is never terminal, so it keeps waiting.

2. Stale Quartz entry (cross-pod path, AppScheduler:333). The on-demand
   job is non-durable, so a crash leaves a persisted QRTZ_* JobDetail;
   because the store is clustered, a retrigger from any pod then throws
   ObjectAlreadyExistsException even when nothing runs. The old code
   rethrew unconditionally without checking whether the app was running.
   Fix: scheduleOnDemandJob() consults the DB-backed AppRunRecord
   (cross-pod truth) — genuinely active runs rethrow; stale entries are
   cleared and rescheduled once. Fail-safe: if the run record can't be
   read, treat as active so a live job is never wrongly cleared.

Tests: DistributedSearchIndexExecutorTest (+2), AppSchedulerTest (+2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 4, 2026 10:59
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resilience of distributed reindex orchestration by (1) preventing the Quartz execution thread from hanging indefinitely when a worker wedges, and (2) recovering from stale Quartz JobDetail/trigger entries that can block retriggers after a pod crash in a clustered Quartz setup.

Changes:

  • Replace unbounded CountDownLatch.await() with a poll loop that can unwind and force stop() once the job becomes terminal/STOPPING.
  • Add stale Quartz entry recovery on ObjectAlreadyExistsException by consulting the latest DB-backed AppRunRecord and clearing/rescheduling when the run is not active.
  • Add unit tests covering both the orchestrator unwind behavior and the stale Quartz-entry recovery behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java Adds stale Quartz-entry recovery logic for on-demand app scheduling based on latest AppRunRecord.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java Prevents indefinite orchestrator waits by polling latch completion and checking job terminal/stopping state.
openmetadata-service/src/test/java/org/openmetadata/service/apps/scheduler/AppSchedulerTest.java Adds tests validating stale Quartz entry cleanup vs. rethrow when a run is genuinely active.
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutorTest.java Adds tests ensuring the orchestrator unwinds on terminal job state even if workers never finish.

Comment on lines +367 to +370
} catch (ObjectAlreadyExistsException ex) {
if (hasActiveAppRun(application)) {
throw ex;
}
@mohityadav766 mohityadav766 added the To release Will cherry-pick this PR into the release branch label Jun 4, 2026
…ng, log clarity

- ACTIVE_ERROR is a terminal AppRunRecord status (per OmAppJobListener),
  so the hand-rolled active-status set wrongly treated it as active and
  would leave retriggers wedged. Reuse OmAppJobListener.isTerminalStatus
  (now public) as the single source of truth instead.
- Gate stale Quartz-entry recovery to non-concurrent jobs only. Concurrent
  jobs use a unique identity per run, so a collision is not a stale entry
  and the app-wide latest run record is not a reliable signal.
- On a recovery reschedule that collides again (cross-pod race), let the
  ObjectAlreadyExistsException propagate to the standard "already running"
  message rather than disrupting the job another pod just scheduled.
- awaitWorkers now returns whether workers drained normally; the caller
  logs "All workers completed" only on a normal drain, and a distinct
  warning on forced unwind, so stuck workers are easier to diagnose.

Tests: AppSchedulerTest (+2: ACTIVE_ERROR stale recovery, concurrent
collision rethrow without clearing).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🔴 Playwright Results — 1 failure(s), 14 flaky

✅ 4264 passed · ❌ 1 failed · 🟡 14 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 299 0 2 4
🟡 Shard 2 803 0 1 9
🟡 Shard 3 801 0 4 8
🟡 Shard 4 854 0 1 12
🟡 Shard 5 720 0 1 47
🔴 Shard 6 787 1 5 9

Genuine Failures (failed on all attempts)

Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: locator('[data-testid="success-badge"]').or(locator('[data-testid="warning-badge"]'))
Expected: visible
Timeout: 210000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 210000ms�[22m
�[2m  - waiting for locator('[data-testid="success-badge"]').or(locator('[data-testid="warning-badge"]'))�[22m

🟡 14 flaky test(s) (passed on retry)
  • Flow/Tour.spec.ts › Tour should work from help section (shard 1, 2 retries)
  • Pages/SearchSettings.spec.ts › Preview config reflects reverted n-gram weight after save (shard 1, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Advanced Blocks (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Tasks/TaskNavigation.spec.ts › navigating to /table/TASK-XXXXX should show 404 (invalid URL pattern) (shard 3, 1 retry)
  • Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Email (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Follow-up review caught that reusing OmAppJobListener.isTerminalStatus
(which classifies ACTIVE_ERROR as terminal for run-timing purposes) was
unsafe here: ACTIVE_ERROR is an in-flight status — apps set it while still
progressing (CacheWarmupApp) and jobWasExecuted only normalizes it to
FAILED when the run actually finishes; crash recovery
(markRunningEntriesFailed*) only flips 'running', never 'activeError'.
Treating it as terminal could make a retrigger delete a job another pod
is genuinely running.

Use a dedicated TERMINAL_RUN_STATUSES set {SUCCESS, FAILED, STOPPED,
COMPLETED}; any other status (incl. ACTIVE_ERROR) counts as a live run we
must not clear. Erring toward "active" is the safe direction — a stale
entry is recoverable, deleting a live job is not. Revert isTerminalStatus
back to private.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 4, 2026 16:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

.map(
run ->
run.getStatus() != null && !TERMINAL_RUN_STATUSES.contains(run.getStatus()))
.orElse(false);
Comment on lines +390 to +396
/** Statuses that mean a run has finished; anything else (incl. ACTIVE_ERROR) is in-flight. */
private static final Set<AppRunRecord.Status> TERMINAL_RUN_STATUSES =
Set.of(
AppRunRecord.Status.SUCCESS,
AppRunRecord.Status.FAILED,
AppRunRecord.Status.STOPPED,
AppRunRecord.Status.COMPLETED);
Comment on lines +362 to +366
* <p>"Active" is defined by {@link #TERMINAL_RUN_STATUSES}: any non-terminal status (including
* {@code ACTIVE_ERROR}, which is in-flight — set by apps that are still progressing and only
* normalized to {@code FAILED} when the run actually finishes) is treated as a live run we must
* not clear. Erring toward "active" is deliberate: leaving a stale entry is recoverable, while
* deleting a job another pod is genuinely running risks a duplicate/disrupted execution.
Comment on lines +955 to +972
private boolean awaitWorkers(CountDownLatch workerLatch, UUID jobId) throws InterruptedException {
boolean drained = false;
boolean done = false;
while (!done) {
drained = workerLatch.await(LATCH_POLL_INTERVAL_SECONDS, TimeUnit.SECONDS);
if (drained) {
done = true;
} else if (isJobTerminalOrStopping(jobId)) {
LOG.warn(
"Job {} is terminal/stopping but workers have not drained; forcing executor "
+ "shutdown so the orchestrator can unwind",
jobId);
stop();
done = true;
}
}
return drained;
}
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 4, 2026

…rrors

The orchestrator's awaitWorkers loop polls coordinator.getJob() every cycle
to detect a terminal/STOPPING transition. The previous workerLatch.await()
was DB-independent; the polling added thousands of getJob() reads on the
orchestrator's critical path over a multi-hour reindex, any one of which
could throw (connection reset, pool exhaustion) and tear the job down via
the finally block.

- isJobTerminalOrStopping now wraps the read in try/catch and treats a read
  failure as non-terminal, so a transient DB blip keeps the orchestrator
  waiting and the wedge unwinds on the next clean poll (mirrors the
  hasActiveAppRun fail-safe).
- Widen the re-check cadence 5s -> 15s (3x less steady DB load; unwind is
  not latency-critical) and make it an injectable instance field so tests
  stay fast instead of slowing 3x.

Tests: +2 (transient-read keeps waiting then unwinds; isJobTerminalOrStopping
treats an unreadable state as non-terminal); existing unwind test injects a
1s interval. All 54 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Jun 5, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Implements a fail-safe poll loop for the reindex orchestrator and a robust stale-job recovery mechanism, resolving thread-hanging and cross-pod Quartz blocking issues.

✅ 3 resolved
Edge Case: Stale-job recovery is a non-atomic check-then-act across pods

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java:362-376
scheduleOnDemandJob() recovers from ObjectAlreadyExistsException by reading the DB-backed run record (hasActiveAppRun) and, if not active, calling deleteJob + unscheduleJob + scheduleJob. This check-then-act sequence is not atomic with respect to other pods (the store is clustered) nor with a job that starts executing in the window between the read and the delete.

Concrete race: two pods both retrigger; both fail the first scheduleJob, both call hasActiveAppRun() and read the same stale terminal record (FAILED) so both proceed. Pod A then deleteJob+scheduleJob and its trigger fires the job. Pod B, still acting on its earlier read, executes deleteJob — now deleting/disrupting the job A just (re)scheduled — and schedules its own. The result can be a disrupted run or a second concurrent execution of an app declared allowConcurrentExecution=false. The earlier guard in triggerOnDemandApplication (getCurrentlyExecutingJobs()) only inspects the local pod, so it does not close this cross-pod window.

The window is narrow (requires near-simultaneous cross-pod retriggers), but since the whole purpose of this path is crash recovery in a clustered deployment, it is worth hardening. Consider re-checking hasActiveAppRun() immediately before the reschedule, and/or catching a second ObjectAlreadyExistsException from the recovery scheduleJob and re-deriving active state rather than letting it surface as the misleading "Job is already running" message.

Bug: awaitWorkers still hangs if a wedged worker leaves job non-terminal

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java:946-960 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java:676-685
awaitWorkers() only breaks out of the poll loop when isJobTerminalOrStopping(jobId) becomes true (job null / terminal / STOPPING). The unwedge guarantee therefore depends entirely on some external party flipping the job to a terminal state. A worker wedged inside PartitionWorker.processPartition() on a degraded backend does not itself change job status — runWorkerLoop only re-checks status between partitions (line 679), not during a single blocked partition. Meanwhile the lock-refresh thread (runLockRefreshLoop) and heartbeat thread (runPartitionHeartbeatLoop) keep running on separate threads, so the distributed lock is not lost and JobRecoveryManager may not consider the run stale. In that case the job stays RUNNING, isJobTerminalOrStopping stays false, and awaitWorkers polls forever — the exact orchestrator-hang this PR set out to remove.

This may be acceptable if you are confident that a wedged worker always eventually triggers lock loss/recovery, but that assumption is not enforced by this code. Consider a defensive escape that also accounts for "no partition has made progress in N intervals" (heartbeats/partition completion stalled) rather than relying solely on the job-status transition.

Edge Case: ACTIVE_ERROR run treated as stale can clear a live cross-pod job

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/AppScheduler.java:383-397 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/OmAppJobListener.java:82-87 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/scheduler/OmAppJobListener.java:211-222
This commit replaces the explicit ACTIVE_RUN_STATUSES set (which listed ACTIVE_ERROR as active) with !OmAppJobListener.isTerminalStatus(status) in hasActiveAppRun (AppScheduler.java:389-392). isTerminalStatus classifies ACTIVE_ERROR as terminal (OmAppJobListener.java:84). However, ACTIVE_ERROR is an in-flight status: jobWasExecuted rewrites it to FAILED when a run actually completes (OmAppJobListener.java:211-222), and CacheWarmupApp sets ACTIVE_ERROR on a run that is still progressing/retriable (CacheWarmupApp.java:447-453). A persisted latest-run record with ACTIVE_ERROR therefore most often means a run is genuinely executing (commonly on another pod, since this is the clustered-store recovery path), not a stale crash remnant.

Consequence: a retrigger from pod B hits ObjectAlreadyExistsException, reads the latest run as ACTIVE_ERROR, treats it as stale, and deleteJob + unscheduleJob + reschedules — clearing/duplicating the Quartz job that pod A is actively running. This directly contradicts the PR's stated fail-safe that "a live job is never wrongly cleared," and unlike RUNNING/STARTED (which remain protected because they are non-terminal), ACTIVE_ERROR was flipped to the unsafe side. The new test testNonConcurrentApp_treatsActiveErrorRunAsStale encodes this behavior as intentional, but the semantics of ACTIVE_ERROR make it risky.

Suggest using a dedicated 'active run' predicate that keeps ACTIVE_ERROR (and any other in-flight status) as active rather than reusing the terminal-timings classifier, which serves a different purpose.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants