Skip to content

Support Optimistic Execution in Scheduler#70

Merged
liobrasil merged 1 commit intomainfrom
navid/scheduler-optimistic-execution-fix
Mar 13, 2026
Merged

Support Optimistic Execution in Scheduler#70
liobrasil merged 1 commit intomainfrom
navid/scheduler-optimistic-execution-fix

Conversation

@nvdtf
Copy link
Member

@nvdtf nvdtf commented Mar 13, 2026

Summary

  • Check for failed requests on every scheduler run. Previously, it was only checked when pending > 0.
  • Fixes crash recovery race condition caused by FlowTransactionScheduler's optimistic execution status updates.

Problem

The FlowTransactionScheduler uses an optimistic status update pattern where scheduled transactions are marked as Executed before the handler code actually runs. This is done to enable concurrent execution (see FlowTransactionScheduler.cdc lines 1315-1321):

// after pending execution event is emitted we set the transaction as executed because we
// must rely on execution node to actually execute it. Execution of the transaction is
// done in a separate transaction that calls executeTransaction(id) function.
// Executing the transaction can not update the status of transaction or any other shared state,
// since that blocks concurrent transaction execution.
// Therefore an optimistic update to executed is made here to avoid race condition.
tx.setStatus(newStatus: Status.Executed)

This caused _checkForFailedWorkerRequests to incorrectly mark workers as failed when:

  1. Worker transactions were picked up for execution (status = Executed)
  2. But the actual handler code hadn't finished running yet
  3. The scheduler saw Executed status + entry still in scheduledRequests = assumed panic

Solution

  1. Added configurable grace period (crashRecoveryGracePeriod, default 10 seconds) before checking for failed workers
  2. Moved _checkForFailedWorkerRequests to the beginning of execute() for clearer execution flow
  3. Added admin function setCrashRecoveryGracePeriod() to adjust the grace period if needed

The crash recovery logic now waits scheduledTimestamp + gracePeriod before checking if a worker failed, giving the handler enough time to complete execution and remove itself from scheduledRequests.

Changes

  • Added crashRecoveryGracePeriod configuration variable (default: 10.0 seconds)
  • Added Admin.setCrashRecoveryGracePeriod() function
  • Moved _checkForFailedWorkerRequests call from _runScheduler to execute()
  • Updated grace period check: currentTimestamp <= scheduledTimestamp + gracePeriod
  • Increased SCHEDULER_BASE_EFFORT from 700 to 1200
  • Updated test config SCHEDULER_WAKEUP_INTERVAL from 2 to 1

@claude
Copy link

claude bot commented Mar 13, 2026

PR Review: Support Optimistic Execution in Scheduler. The core fix moves _checkForFailedWorkerRequests() to run unconditionally at the top of every scheduler tick (instead of only inside _runScheduler when fetchCount > 0). This is correct - failed requests were previously stuck in PROCESSING indefinitely when no new work arrived. Issues: (1) Stale docstring on _runScheduler still lists crash recovery as step 1 - should be removed/renumbered. (2) No upper bound on crashRecoveryGracePeriod - a misconfiguration like 86400s would silently disable recovery. (3) SCHEDULER_BASE_EFFORT bump from 700 to 1200 is unjustified in comments. Grace period logic is correct: checkAfterTimestamp = workerScheduledTimestamp + gracePeriod accounts properly for optimistic execution. Default of 10s is reasonable vs 1s wakeup interval. Test coverage gap for _checkForFailedWorkerRequests is pre-existing. Overall: correct and well-motivated fix.

@nvdtf nvdtf requested a review from liobrasil March 13, 2026 20:29
Copy link
Collaborator

@liobrasil liobrasil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liobrasil liobrasil merged commit 38e5cc4 into main Mar 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants