Skip to content

feat: optimistic scheduling#2867

Merged
abelanger5 merged 14 commits intobelanger/beta-events-apifrom
belanger/optimistic-scheduling-2
Feb 1, 2026
Merged

feat: optimistic scheduling#2867
abelanger5 merged 14 commits intobelanger/beta-events-apifrom
belanger/optimistic-scheduling-2

Conversation

@abelanger5
Copy link
Copy Markdown
Contributor

@abelanger5 abelanger5 commented Jan 27, 2026

Description

New implementation to replace #2258

Adds support for "optimistic" scheduling, meaning that if we can create tasks from the gRPC engine with transactional safety, and schedule tasks on workers which are connected to the current gRPC session (these are two separate concepts, referred to in code by localScheduler and localDispatcher). We allocate a small set of semaphores for that.

Features:

  • Up to a 3x speedup in scheduling performance, from 24ms -> 8ms for single-task workflows
  • Reduces overall pressure on the message queue and downstream components, as there are fewer messages being passed for scheduling purposes
  • TENTATIVE: Made some improvements to listening for a task completed event for a single-task workflow by hooking into an existing tenant message task-completed. We can similarly add task-failed and task-cancelled in the future.

Drawbacks:

  • Increases the complexity of scheduling as the paths for optimistic scheduling are quite different from the regular path, since we protect everything with a single transaction. Tried to minimize the amount of complexity here and made a bunch of improvements over feat: optimistic scheduling #2258 to avoid maintenance hell.
  • Can increase pressure on the engines. I've tried to avoid major issues by only allocating 10 "scheduling slots" to each gRPC process (configurable via an env var)

Limitations:

  • Scheduling child workflows is still significantly slower than scheduling non-child workflows, because we have ~6ms of latency due to how we're checking idempotency on the child workflow trigger
    • I think we could improve this a lot with idempotency keys, it really should only be a single database transaction to insert/lookup the idempotency keys
  • This won't be turned on in HA mode and as n engines are horizontally scales the chances of optimistic scheduling reduce by 1/n - we only use local schedulers when they have a lease on a tenant. We will need to build out a sticky load balancing strategy to take advantage of optimistic scheduling in HA setups if we'd like to tackle that.

Type of change

  • New feature (non-breaking change which adds functionality)

@vercel
Copy link
Copy Markdown

vercel Bot commented Jan 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment Jan 29, 2026 11:07pm

Request Review

@abelanger5 abelanger5 changed the base branch from belanger/beta-events-api to belanger/db-implementation January 27, 2026 21:20
Base automatically changed from belanger/db-implementation to belanger/beta-events-api January 27, 2026 21:47
Copy link
Copy Markdown
Contributor Author

@abelanger5 abelanger5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to address some comments, additionally we should add the optimistic scheduling paths to the testing matrix.

Comment thread internal/services/controllers/olap/signal/signal.go
Comment thread cmd/hatchet-migrate/migrate/migrations/20260127201500_v1_0_72.sql
Comment thread internal/services/dispatcher/dispatcher_v1.go
Comment thread internal/services/scheduler/v1/scheduler.go Outdated
Comment thread pkg/repository/scheduler_queue.go Outdated
Comment thread pkg/scheduling/v1/queuer.go Outdated
Comment thread internal/services/dispatcher/dispatcher_v1.go
Comment thread .golangci.yml
Comment thread pkg/repository/trigger.go
Comment thread pkg/repository/scheduler_optimistic.go
@abelanger5 abelanger5 requested a review from mrkaye97 January 28, 2026 02:37
Comment thread internal/services/controllers/olap/signal/signal.go
Comment thread internal/services/dispatcher/dispatcher_v1.go Outdated
Comment thread internal/services/dispatcher/dispatcher_v1.go
Comment thread internal/services/dispatcher/dispatcher_v1.go
func (d *DispatcherImpl) handleRetries(
ctx context.Context,
tenantId string,
toRetry []*sqlcv1.V1Task,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should also be a []*taskWithPayload for consistency? when I initially did this, I was mostly trying to remove the sqlcv1.V1Task from everywhere I could to make it less confusing what was what

}

dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId], bulkAssigned.QueueItem.TaskID)
if !bulkAssigned.IsAssignedLocally {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much risk is there here that we forget to check this flag somewhere and end up with duplicate assigns?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be the only place we instantiate the tasks which get sent to the dispatchers

Comment thread pkg/repository/idempotency.go
Comment thread pkg/repository/trigger.go
Comment thread pkg/repository/trigger.go Outdated
Comment thread pkg/scheduling/v1/queuer.go Outdated
@abelanger5 abelanger5 marked this pull request as ready for review January 28, 2026 17:27
@abelanger5 abelanger5 requested a review from Copilot January 28, 2026 17:27
@abelanger5 abelanger5 changed the title wip: optimistic scheduling, take 2 feat: optimistic scheduling Jan 28, 2026
@abelanger5 abelanger5 mentioned this pull request Jan 28, 2026
1 task
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements optimistic scheduling, enabling up to 3x performance improvements (24ms → 8ms for single-task workflows) by processing task creation and assignment within a single transaction when workers are connected to the same gRPC session.

Changes:

  • Adds optimistic scheduling with semaphore-based concurrency control (configurable via env var, default 10 slots)
  • Implements new OptimisticTx transaction wrapper with post-commit hooks
  • Refactors trigger logic to support both transactional and non-transactional paths
  • Integrates optimistic scheduling into ingestor and admin services for local task dispatch
  • Updates SQL triggers to conditionally insert into concurrency/dag tables

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/repository/trigger.go Refactored to extract reusable prepareTriggerFrom* and triggerWorkflows methods supporting both transaction modes
pkg/repository/optimistic_tx.go New transaction wrapper with post-commit hook support for optimistic path
pkg/repository/scheduler_optimistic.go New repository implementation for transactional trigger+scheduling operations
pkg/scheduling/v1/pool.go Adds semaphore-based slot management for optimistic scheduling
pkg/scheduling/v1/queuer.go Implements optimistic queue processing within transactions
pkg/scheduling/v1/tenant_manager.go Orchestrates optimistic scheduling with transaction coordination
internal/services/ingestor/ingestor_v1.go Attempts optimistic scheduling for events before falling back to message queue
internal/services/admin/v1/server.go Attempts optimistic scheduling for workflow triggers before falling back to message queue
internal/services/dispatcher/dispatcher_v1.go Adds HandleLocalAssignments for synchronous local worker dispatch
cmd/hatchet-migrate/migrate/migrations/20260127201500_v1_0_72.sql Adds conditional inserts to task trigger to avoid unnecessary queries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/services/admin/v1/server.go
Comment thread internal/services/admin/server_v1.go
Comment thread internal/services/ingestor/ingestor_v1.go
Comment on lines +43 to +45
for _, f := range o.postCommit {
f()
}
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post-commit hooks are executed synchronously after the transaction is committed. If any of these hooks panics or takes a long time, it will block the caller. Since post-commit hooks are typically used for side effects that should not affect transaction success (like sending messages or updating metrics), consider running them in a goroutine or adding error recovery. Additionally, there's no mechanism to handle errors from post-commit hooks - they're fire-and-forget which could lead to silent failures.

Copilot uses AI. Check for mistakes.
Comment on lines +359 to +361
for _, qr := range allQueueResults {
t.resultsCh <- qr
}
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the transaction commits successfully but sending results to t.resultsCh blocks or panics (line 360), the function will not return the scheduled tasks. The channel write is unbuffered and could block indefinitely if the receiver is slow or not consuming. Consider making this a non-blocking select with a default case or running it in a goroutine to avoid blocking the optimistic scheduling return path.

Copilot uses AI. Check for mistakes.
Comment thread pkg/repository/trigger.go Outdated
Comment on lines +2111 to +2112
// we don't run this in a transaction because workflow versions won't change during the course of this operation
workflowVersionsByNames, err := r.queries.ListWorkflowsByNames(ctx, r.pool, sqlcv1.ListWorkflowsByNamesParams{
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 2111 states "we don't run this in a transaction" but this code can now be called with either r.pool or a transaction tx. When called from prepareTriggerFromWorkflowNames with a transaction, this ListWorkflowsByNames query incorrectly uses r.pool instead of the passed tx parameter, breaking transaction isolation. This should use the tx parameter that is passed to the function.

Suggested change
// we don't run this in a transaction because workflow versions won't change during the course of this operation
workflowVersionsByNames, err := r.queries.ListWorkflowsByNames(ctx, r.pool, sqlcv1.ListWorkflowsByNamesParams{
// use the provided executor (transaction or pool); workflow versions won't change during the course of this operation
workflowVersionsByNames, err := r.queries.ListWorkflowsByNames(ctx, tx, sqlcv1.ListWorkflowsByNamesParams{

Copilot uses AI. Check for mistakes.
Comment thread pkg/repository/shared.go Outdated
Comment on lines +68 to +69
tenantIdWorkflowNameCache := expirable.NewLRU(10000, func(key string, value *sqlcv1.ListWorkflowsByNamesRow) {}, 5*time.Second)
stepsInWorkflowVersionCache := expirable.NewLRU(10000, func(key string, value []*sqlcv1.ListStepsByWorkflowVersionIdsRow) {}, 5*time.Second)
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache TTLs for the new LRU caches are inconsistent with their usage patterns. tenantIdWorkflowNameCache and stepsInWorkflowVersionCache use 5 second expiry, which is very short and may result in excessive cache misses, especially given these are used in critical performance paths for optimistic scheduling. Consider using a longer TTL (e.g., 5 minutes like stepIdLabelsCache) or making these configurable. The short TTL may undermine the performance benefits of optimistic scheduling.

Suggested change
tenantIdWorkflowNameCache := expirable.NewLRU(10000, func(key string, value *sqlcv1.ListWorkflowsByNamesRow) {}, 5*time.Second)
stepsInWorkflowVersionCache := expirable.NewLRU(10000, func(key string, value []*sqlcv1.ListStepsByWorkflowVersionIdsRow) {}, 5*time.Second)
tenantIdWorkflowNameCache := expirable.NewLRU(10000, func(key string, value *sqlcv1.ListWorkflowsByNamesRow) {}, 5*time.Minute)
stepsInWorkflowVersionCache := expirable.NewLRU(10000, func(key string, value []*sqlcv1.ListStepsByWorkflowVersionIdsRow) {}, 5*time.Minute)

Copilot uses AI. Check for mistakes.
Comment thread cmd/hatchet-migrate/migrate/migrations/20260127201500_v1_0_72.sql
Copy link
Copy Markdown
Contributor

@mrkaye97 mrkaye97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 41 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +20 to +22
go func() {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine spawned to signal optimistic scheduling results uses context.Background() with a 30-second timeout instead of using the original context. This means if the parent context is cancelled, these signals will still attempt to send for up to 30 seconds. Consider using context.WithTimeout(ctx, 30*time.Second) to respect parent cancellation.

Copilot uses AI. Check for mistakes.
migrate-strategy: ["latest"]
rabbitmq-enabled: ["true", "false"]
pg-version: ["15-alpine"]
rabbitmq-enabled: ["true"]
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test matrix removed rabbitmq-enabled: "false" option and only tests with RabbitMQ enabled. This reduces test coverage and could miss issues that occur specifically when RabbitMQ is disabled. The previous configuration tested both scenarios.

Copilot uses AI. Check for mistakes.
Comment on lines +418 to +420
if !bulkAssigned.IsAssignedLocally {
dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId], bulkAssigned.QueueItem.TaskID)
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the optimistic scheduling path, tasks assigned locally skip sending messages to the scheduler via !bulkAssigned.IsAssignedLocally. However, there's no validation that these locally assigned tasks are actually being dispatched. If HandleLocalAssignments fails silently or doesn't cover all error cases, tasks could be "lost" without any retry mechanism.

Suggested change
if !bulkAssigned.IsAssignedLocally {
dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId], bulkAssigned.QueueItem.TaskID)
}
dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId] = append(
dispatcherIdToWorkerIdsToStepRuns[dispatcherId][workerId],
bulkAssigned.QueueItem.TaskID,
)

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/test.yml
Comment thread .github/workflows/release.yaml
Comment thread sql/schema/v1-core.sql
Comment on lines +54 to +57
func doCallback(f func()) {
go func() {
defer func() {
recover() // nolint: errcheck
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doCallback function uses a bare recover() that silently swallows all panics without logging. This makes debugging very difficult if something goes wrong in the callback. Consider at least logging the recovered value before discarding it.

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +36
go func() {
defer func() {
if r := recover(); r != nil {
if l != nil {
l.Error().Interface("panic", r).Msg("panic in callback")
}
}
}
}()
}()
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback execution is moved into a goroutine but the panic recovery is also moved into it. This changes the behavior - previously the panic recovery protected the calling code, now it only protects the goroutine. If the callback panics before the goroutine is scheduled, it could crash the caller. Consider whether this behavior change is intentional.

Copilot uses AI. Check for mistakes.
Comment thread pkg/repository/trigger.go
// store in the cache
k := fmt.Sprintf("%s:%s", tenantId, workflowVersion.WorkflowName)

r.tenantIdWorkflowNameCache.Add(k, workflowVersion)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache lookups now use .Get() which returns (value, bool), but the cache insertion uses .Add() instead of .Set(). The expirable.LRU's .Add() method returns a boolean indicating if an eviction occurred. If this behavior change is significant (e.g., for monitoring cache pressure), it should be handled. Otherwise, this is just a minor API difference to be aware of.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@mrkaye97 mrkaye97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, would be great to run the python tests locally with / without optimistic mode on to see how it looks + just confirm

required: true
name: Release
jobs:
# load:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to uncomment/delete these?

@abelanger5 abelanger5 merged commit beaced1 into belanger/beta-events-api Feb 1, 2026
46 checks passed
@abelanger5 abelanger5 deleted the belanger/optimistic-scheduling-2 branch February 1, 2026 01:27
@abelanger5 abelanger5 mentioned this pull request Feb 2, 2026
1 task
abelanger5 added a commit that referenced this pull request Feb 2, 2026
* placeholder

* feat: db tables for user events (#2862)

* feat: db tables for user events

* move event payloads to payloads table, fix env var loading

* fix: address pr review comments

* missed save

* feat: optimistic scheduling (#2867)

* feat: db tables for user events

* move event payloads to payloads table, fix env var loading

* refactor: small changes to prepare optimistic txs

* feat: optimistic scheduling

* address pr review comments

* rm comments

* fix: rampup test race condition

* fix: goleak

* feat: grpc-side triggers

* fix: config and sem logic

* fix: respect optimistic scheduling env var

* add optimistic to testing matrix, remove pg-only mode

* fix cleanup of pubbuffers

* merge migrations

* last testing fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants