feat: optimistic scheduling by abelanger5 · Pull Request #2867 · hatchet-dev/hatchet

abelanger5 · 2026-01-27T21:19:37Z

Description

New implementation to replace #2258

Adds support for "optimistic" scheduling, meaning that if we can create tasks from the gRPC engine with transactional safety, and schedule tasks on workers which are connected to the current gRPC session (these are two separate concepts, referred to in code by localScheduler and localDispatcher). We allocate a small set of semaphores for that.

Features:

Up to a 3x speedup in scheduling performance, from 24ms -> 8ms for single-task workflows
Reduces overall pressure on the message queue and downstream components, as there are fewer messages being passed for scheduling purposes
TENTATIVE: Made some improvements to listening for a task completed event for a single-task workflow by hooking into an existing tenant message task-completed. We can similarly add task-failed and task-cancelled in the future.

Drawbacks:

Increases the complexity of scheduling as the paths for optimistic scheduling are quite different from the regular path, since we protect everything with a single transaction. Tried to minimize the amount of complexity here and made a bunch of improvements over feat: optimistic scheduling #2258 to avoid maintenance hell.
Can increase pressure on the engines. I've tried to avoid major issues by only allocating 10 "scheduling slots" to each gRPC process (configurable via an env var)

Limitations:

Scheduling child workflows is still significantly slower than scheduling non-child workflows, because we have ~6ms of latency due to how we're checking idempotency on the child workflow trigger
- I think we could improve this a lot with idempotency keys, it really should only be a single database transaction to insert/lookup the idempotency keys
This won't be turned on in HA mode and as n engines are horizontally scales the chances of optimistic scheduling reduce by 1/n - we only use local schedulers when they have a lease on a tenant. We will need to build out a sticky load balancing strategy to take advantage of optimistic scheduling in HA setups if we'd like to tackle that.

Type of change

New feature (non-breaking change which adds functionality)

vercel · 2026-01-27T21:19:42Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hatchet-docs	Ready	Preview, Comment	Jan 29, 2026 11:07pm

…elanger/optimistic-scheduling-2

abelanger5

Need to address some comments, additionally we should add the optimistic scheduling paths to the testing matrix.

mrkaye97 · 2026-01-28T14:44:44Z

+func (d *DispatcherImpl) handleRetries(
+	ctx context.Context,
+	tenantId string,
+	toRetry []*sqlcv1.V1Task,


maybe this should also be a []*taskWithPayload for consistency? when I initially did this, I was mostly trying to remove the sqlcv1.V1Task from everywhere I could to make it less confusing what was what

mrkaye97 · 2026-01-28T14:52:26Z

 			}

-			dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId], bulkAssigned.QueueItem.TaskID)
+			if !bulkAssigned.IsAssignedLocally {


how much risk is there here that we forget to check this flag somewhere and end up with duplicate assigns?

this should be the only place we instantiate the tasks which get sent to the dispatchers

Copilot

Pull request overview

This pull request implements optimistic scheduling, enabling up to 3x performance improvements (24ms → 8ms for single-task workflows) by processing task creation and assignment within a single transaction when workers are connected to the same gRPC session.

Changes:

Adds optimistic scheduling with semaphore-based concurrency control (configurable via env var, default 10 slots)
Implements new OptimisticTx transaction wrapper with post-commit hooks
Refactors trigger logic to support both transactional and non-transactional paths
Integrates optimistic scheduling into ingestor and admin services for local task dispatch
Updates SQL triggers to conditionally insert into concurrency/dag tables

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
pkg/repository/trigger.go	Refactored to extract reusable `prepareTriggerFrom*` and `triggerWorkflows` methods supporting both transaction modes
pkg/repository/optimistic_tx.go	New transaction wrapper with post-commit hook support for optimistic path
pkg/repository/scheduler_optimistic.go	New repository implementation for transactional trigger+scheduling operations
pkg/scheduling/v1/pool.go	Adds semaphore-based slot management for optimistic scheduling
pkg/scheduling/v1/queuer.go	Implements optimistic queue processing within transactions
pkg/scheduling/v1/tenant_manager.go	Orchestrates optimistic scheduling with transaction coordination
internal/services/ingestor/ingestor_v1.go	Attempts optimistic scheduling for events before falling back to message queue
internal/services/admin/v1/server.go	Attempts optimistic scheduling for workflow triggers before falling back to message queue
internal/services/dispatcher/dispatcher_v1.go	Adds `HandleLocalAssignments` for synchronous local worker dispatch
cmd/hatchet-migrate/migrate/migrations/20260127201500_v1_0_72.sql	Adds conditional inserts to task trigger to avoid unnecessary queries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T17:32:24Z

+	for _, f := range o.postCommit {
+		f()
+	}


The post-commit hooks are executed synchronously after the transaction is committed. If any of these hooks panics or takes a long time, it will block the caller. Since post-commit hooks are typically used for side effects that should not affect transaction success (like sending messages or updating metrics), consider running them in a goroutine or adding error recovery. Additionally, there's no mechanism to handle errors from post-commit hooks - they're fire-and-forget which could lead to silent failures.

Copilot · 2026-01-28T17:32:24Z

+	for _, qr := range allQueueResults {
+		t.resultsCh <- qr
+	}


If the transaction commits successfully but sending results to t.resultsCh blocks or panics (line 360), the function will not return the scheduled tasks. The channel write is unbuffered and could block indefinitely if the receiver is slow or not consuming. Consider making this a non-blocking select with a default case or running it in a goroutine to avoid blocking the optimistic scheduling return path.

Copilot · 2026-01-28T17:32:25Z

+	// we don't run this in a transaction because workflow versions won't change during the course of this operation
+	workflowVersionsByNames, err := r.queries.ListWorkflowsByNames(ctx, r.pool, sqlcv1.ListWorkflowsByNamesParams{


The comment on line 2111 states "we don't run this in a transaction" but this code can now be called with either r.pool or a transaction tx. When called from prepareTriggerFromWorkflowNames with a transaction, this ListWorkflowsByNames query incorrectly uses r.pool instead of the passed tx parameter, breaking transaction isolation. This should use the tx parameter that is passed to the function.

Suggested change

// we don't run this in a transaction because workflow versions won't change during the course of this operation

workflowVersionsByNames, err := r.queries.ListWorkflowsByNames(ctx, r.pool, sqlcv1.ListWorkflowsByNamesParams{

// use the provided executor (transaction or pool); workflow versions won't change during the course of this operation

workflowVersionsByNames, err := r.queries.ListWorkflowsByNames(ctx, tx, sqlcv1.ListWorkflowsByNamesParams{

Copilot · 2026-01-28T17:32:25Z

+	tenantIdWorkflowNameCache := expirable.NewLRU(10000, func(key string, value *sqlcv1.ListWorkflowsByNamesRow) {}, 5*time.Second)
+	stepsInWorkflowVersionCache := expirable.NewLRU(10000, func(key string, value []*sqlcv1.ListStepsByWorkflowVersionIdsRow) {}, 5*time.Second)


The cache TTLs for the new LRU caches are inconsistent with their usage patterns. tenantIdWorkflowNameCache and stepsInWorkflowVersionCache use 5 second expiry, which is very short and may result in excessive cache misses, especially given these are used in critical performance paths for optimistic scheduling. Consider using a longer TTL (e.g., 5 minutes like stepIdLabelsCache) or making these configurable. The short TTL may undermine the performance benefits of optimistic scheduling.

Suggested change

tenantIdWorkflowNameCache := expirable.NewLRU(10000, func(key string, value *sqlcv1.ListWorkflowsByNamesRow) {}, 5*time.Second)

stepsInWorkflowVersionCache := expirable.NewLRU(10000, func(key string, value []*sqlcv1.ListStepsByWorkflowVersionIdsRow) {}, 5*time.Second)

tenantIdWorkflowNameCache := expirable.NewLRU(10000, func(key string, value *sqlcv1.ListWorkflowsByNamesRow) {}, 5*time.Minute)

stepsInWorkflowVersionCache := expirable.NewLRU(10000, func(key string, value []*sqlcv1.ListStepsByWorkflowVersionIdsRow) {}, 5*time.Minute)

mrkaye97

🥳

Copilot

Pull request overview

Copilot reviewed 41 out of 41 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-30T20:52:10Z

+	go func() {
+		ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+		defer cancel()


The goroutine spawned to signal optimistic scheduling results uses context.Background() with a 30-second timeout instead of using the original context. This means if the parent context is cancelled, these signals will still attempt to send for up to 30 seconds. Consider using context.WithTimeout(ctx, 30*time.Second) to respect parent cancellation.

Copilot · 2026-01-30T20:52:11Z

        migrate-strategy: ["latest"]
-        rabbitmq-enabled: ["true", "false"]
-        pg-version: ["15-alpine"]
+        rabbitmq-enabled: ["true"]


The test matrix removed rabbitmq-enabled: "false" option and only tests with RabbitMQ enabled. This reduces test coverage and could miss issues that occur specifically when RabbitMQ is disabled. The previous configuration tested both scenarios.

Copilot · 2026-01-30T20:52:11Z

+			if !bulkAssigned.IsAssignedLocally {
+				dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId], bulkAssigned.QueueItem.TaskID)
+			}


In the optimistic scheduling path, tasks assigned locally skip sending messages to the scheduler via !bulkAssigned.IsAssignedLocally. However, there's no validation that these locally assigned tasks are actually being dispatched. If HandleLocalAssignments fails silently or doesn't cover all error cases, tasks could be "lost" without any retry mechanism.

Suggested change

if !bulkAssigned.IsAssignedLocally {

dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId], bulkAssigned.QueueItem.TaskID)

}

dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(

dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId],

bulkAssigned.QueueItem.TaskID,

)

Copilot · 2026-01-30T20:52:12Z

+func doCallback(f func()) {
+	go func() {
+		defer func() {
+			recover() // nolint: errcheck


The doCallback function uses a bare recover() that silently swallows all panics without logging. This makes debugging very difficult if something goes wrong in the callback. Consider at least logging the recovered value before discarding it.

Copilot · 2026-01-30T20:52:13Z

+	go func() {
+		defer func() {
+			if r := recover(); r != nil {
+				if l != nil {
+					l.Error().Interface("panic", r).Msg("panic in callback")
+				}
 			}
-		}
-	}()
+		}()


The callback execution is moved into a goroutine but the panic recovery is also moved into it. This changes the behavior - previously the panic recovery protected the calling code, now it only protects the goroutine. If the callback panics before the goroutine is scheduled, it could crash the caller. Consider whether this behavior change is intentional.

Copilot · 2026-01-30T20:52:13Z

+		// store in the cache
+		k := fmt.Sprintf("%s:%s", tenantId, workflowVersion.WorkflowName)
+
+		r.tenantIdWorkflowNameCache.Add(k, workflowVersion)


The cache lookups now use .Get() which returns (value, bool), but the cache insertion uses .Add() instead of .Set(). The expirable.LRU's .Add() method returns a boolean indicating if an eviction occurred. If this behavior change is significant (e.g., for monitoring cache pressure), it should be handled. Otherwise, this is just a minor API difference to be aware of.

mrkaye97

lgtm, would be great to run the python tests locally with / without optimistic mode on to see how it looks + just confirm

mrkaye97 · 2026-01-30T20:55:15Z

        required: true
 name: Release
 jobs:
-  # load:


do we want to uncomment/delete these?

* placeholder * feat: db tables for user events (#2862) * feat: db tables for user events * move event payloads to payloads table, fix env var loading * fix: address pr review comments * missed save * feat: optimistic scheduling (#2867) * feat: db tables for user events * move event payloads to payloads table, fix env var loading * refactor: small changes to prepare optimistic txs * feat: optimistic scheduling * address pr review comments * rm comments * fix: rampup test race condition * fix: goleak * feat: grpc-side triggers * fix: config and sem logic * fix: respect optimistic scheduling env var * add optimistic to testing matrix, remove pg-only mode * fix cleanup of pubbuffers * merge migrations * last testing fixes

abelanger5 added 3 commits January 27, 2026 11:32

feat: db tables for user events

bd8ef18

move event payloads to payloads table, fix env var loading

0ea3ba6

refactor: small changes to prepare optimistic txs

b069f8b

abelanger5 changed the base branch from belanger/beta-events-api to belanger/db-implementation January 27, 2026 21:20

vercel Bot deployed to Preview January 27, 2026 21:22 View deployment

Base automatically changed from belanger/db-implementation to belanger/beta-events-api January 27, 2026 21:47

Merge remote-tracking branch 'origin/belanger/beta-events-api' into b…

4a4eb3f

…elanger/optimistic-scheduling-2

vercel Bot deployed to Preview January 27, 2026 21:53 View deployment

feat: optimistic scheduling

df1918c

vercel Bot deployed to Preview January 28, 2026 02:24 View deployment

abelanger5 commented Jan 28, 2026

View reviewed changes

abelanger5 requested a review from mrkaye97 January 28, 2026 02:37

mrkaye97 reviewed Jan 28, 2026

View reviewed changes

abelanger5 added 2 commits January 28, 2026 12:23

address pr review comments

733c5f3

rm comments

7bdde73

abelanger5 marked this pull request as ready for review January 28, 2026 17:27

abelanger5 requested a review from Copilot January 28, 2026 17:27

abelanger5 changed the title ~~wip: optimistic scheduling, take 2~~ feat: optimistic scheduling Jan 28, 2026

Copilot started reviewing on behalf of abelanger5 January 28, 2026 17:27 View session

abelanger5 mentioned this pull request Jan 28, 2026

feat: optimistic scheduling #2258

Closed

1 task

vercel Bot deployed to Preview January 28, 2026 17:30 View deployment

Copilot AI reviewed Jan 28, 2026

View reviewed changes

mrkaye97 approved these changes Jan 28, 2026

View reviewed changes

fix: rampup test race condition

637b286

vercel Bot deployed to Preview January 28, 2026 21:32 View deployment

fix: goleak

876ef47

vercel Bot deployed to Preview January 29, 2026 18:54 View deployment

feat: grpc-side triggers

542f831

vercel Bot deployed to Preview January 29, 2026 20:19 View deployment

fix: config and sem logic

b25a098

vercel Bot deployed to Preview January 29, 2026 20:31 View deployment

fix: respect optimistic scheduling env var

f1e5f49

vercel Bot deployed to Preview January 29, 2026 20:51 View deployment

add optimistic to testing matrix, remove pg-only mode

43fcdd7

vercel Bot deployed to Preview January 29, 2026 20:58 View deployment

fix cleanup of pubbuffers

b659c95

vercel Bot deployed to Preview January 29, 2026 23:07 View deployment

abelanger5 requested review from Copilot, grutt and mrkaye97 January 30, 2026 20:47

Copilot started reviewing on behalf of abelanger5 January 30, 2026 20:47 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

mrkaye97 approved these changes Jan 30, 2026

View reviewed changes

abelanger5 merged commit beaced1 into belanger/beta-events-api Feb 1, 2026
46 checks passed

abelanger5 deleted the belanger/optimistic-scheduling-2 branch February 1, 2026 01:27

abelanger5 mentioned this pull request Feb 2, 2026

feat: durable user event log #2861

Merged

1 task

		// we don't run this in a transaction because workflow versions won't change during the course of this operation
		workflowVersionsByNames, err := r.queries.ListWorkflowsByNames(ctx, r.pool, sqlcv1.ListWorkflowsByNamesParams{

		tenantIdWorkflowNameCache := expirable.NewLRU(10000, func(key string, value sqlcv1.ListWorkflowsByNamesRow) {}, 5time.Second)
		stepsInWorkflowVersionCache := expirable.NewLRU(10000, func(key string, value []sqlcv1.ListStepsByWorkflowVersionIdsRow) {}, 5time.Second)

Conversation

abelanger5 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Uh oh!

vercel Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abelanger5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrkaye97 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

mrkaye97 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

abelanger5 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrkaye97 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 30, 2026

abelanger5 commented Jan 27, 2026 •

edited

Loading

vercel Bot commented Jan 27, 2026 •

edited

Loading