Remove dispatch backlog, replace with timeout lock acquisition by juliusgeo · Pull Request #3290 · hatchet-dev/hatchet

juliusgeo · 2026-03-16T15:33:56Z

Description

Prior to this change, there was a fixed backlog of roughly 20 simultaneous worker dispatches before they would start erroring out to be rescheduled. This was suboptimal as it was adding an additional buffer on top of the existing gRPC and TCP flow control buffers that exist internally. The SendMsg call will only block once 1) TCP buffer is exhausted 2) gRPC buffer is exhausted, thus making the additional worker backlog only activate once 19 additional worker sends were queued after both of those buffers were exhausted. This change makes it so that a timeout controls whether we send tasks back to the scheduler by checking how long it takes to acquire the lock surrounding SendMsg.

Fixes # (issue)

Type of change

Breaking change (fix or feature that would cause existing functionality to not work as expected)

What's Changed

Removed maxWorkerBacklogSize
Adds WorkerLockAcquisitionTimeout

vercel · 2026-03-16T15:34:15Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hatchet-docs	Ready	Preview, Comment	Mar 17, 2026 7:54pm

promptless-for-oss · 2026-03-16T21:03:49Z

📝 Documentation updates detected!

New suggestion: Update flow control env var from backlog size to lock acquisition timeout

Tip: See how your feedback shapes Promptless in Agent Knowledge Base 🧠

abelanger5 · 2026-03-16T21:09:43Z

+	stopTime := time.Now().Add(timeout)
+	for time.Now().Before(stopTime) {
+		if worker.sendMu.TryLock() {
+			return true
+		}
+		time.Sleep(5 * time.Millisecond) // small backoff to avoid busy spinning
+	}


I don't love this pattern for a few reasons:

It doesn't respect ordering, so workers can be crowded out if we happen to call TryLock from a different task send

It feels unpredictable what sort of CPU load we'll see from TryLock running this often

Is there a way we can implement the worker mutex as a semaphore on a channel instead, and have a channel call using time.After, and then have something like:

select { <- worker.sendSemaphore: <- time.After: }

(Not real code but hopefully clear what I mean)

The semaphore can be guarded on either send or receive.

This should be efficient and also respect ordering?

That makes sense. I changed it so it uses a chan as a semaphore.

abelanger5 · 2026-03-16T21:11:04Z

-	GRPCWorkerStreamMaxBacklogSize int `mapstructure:"grpcWorkerStreamMaxBacklogSize" json:"grpcWorkerStreamMaxBacklogSize,omitempty" default:"20"`
+	// GRPCWorkerMaxLockAcquisitionTimeMS is the maximum number of milliseconds that the dispatcher will wait while attempting
+	// to send messages to workers. If it waits longer, the request will be rejected. Default is 250
+	GRPCWorkerMaxWorkerLockAcquisitionTimeMS int `mapstructure:"grpcWorkerMaxLockAcquisitionTimeMS" json:"grpcWorkerMaxLockAcquisitionTimeMS,omitempty" default:"250"`


Would it make more sense for this to be a time.Duration? I can't actually remember if viper supports unmarshalling into a time.Duration directly or if we need to parse a string. I think it might have the advantage of making the config slightly more readable

Looks like Viper does support time.Duration unmarshalling. Fixed!

github-actions · 2026-03-17T14:54:26Z

Benchmark results

goos: linux
goarch: amd64
pkg: github.com/hatchet-dev/hatchet/internal/msgqueue/rabbitmq
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
                              │ /tmp/old.txt │            /tmp/new.txt            │
                              │    sec/op    │    sec/op     vs base              │
CompressPayloads_1x10KiB-8      75.62µ ±  1%   76.59µ ±  2%       ~ (p=0.065 n=6)
CompressPayloads_10x10KiB-8     874.7µ ±  2%   882.9µ ±  2%       ~ (p=0.310 n=6)
CompressPayloads_10x100KiB-8    10.29m ±  2%   10.33m ±  3%       ~ (p=0.589 n=6)
CompressPayloads_Concurrent-8   65.20µ ± 26%   69.26µ ± 19%       ~ (p=0.132 n=6)
geomean                         459.0µ         469.0µ        +2.18%

                              │ /tmp/old.txt │            /tmp/new.txt            │
                              │     B/op     │     B/op      vs base              │
CompressPayloads_1x10KiB-8      11.23Ki ± 3%   11.03Ki ± 3%       ~ (p=0.329 n=6)
CompressPayloads_10x10KiB-8     110.0Ki ± 2%   111.8Ki ± 2%       ~ (p=0.132 n=6)
CompressPayloads_10x100KiB-8    2.920Mi ± 0%   2.920Mi ± 1%       ~ (p=0.554 n=6)
CompressPayloads_Concurrent-8   54.23Ki ± 0%   54.33Ki ± 0%       ~ (p=0.180 n=6)
geomean                         119.0Ki        119.0Ki       -0.00%

                              │ /tmp/old.txt │            /tmp/new.txt            │
                              │  allocs/op   │ allocs/op   vs base                │
CompressPayloads_1x10KiB-8        5.000 ± 0%   5.000 ± 0%       ~ (p=1.000 n=6) ¹
CompressPayloads_10x10KiB-8       32.00 ± 0%   32.00 ± 0%       ~ (p=1.000 n=6) ¹
CompressPayloads_10x100KiB-8      63.00 ± 0%   63.00 ± 2%       ~ (p=1.000 n=6)
CompressPayloads_Concurrent-8     17.00 ± 0%   17.00 ± 0%       ~ (p=1.000 n=6) ¹
geomean                           20.35        20.35       +0.00%
¹ all samples are equal

pkg: github.com/hatchet-dev/hatchet/internal/services/dispatcher
                  │ /tmp/new.txt │
                  │    sec/op    │
LockAcquisition-8   602.5n ± 76%

                  │ /tmp/new.txt │
                  │     B/op     │
LockAcquisition-8   375.0 ± 153%

                  │ /tmp/new.txt │
                  │  allocs/op   │
LockAcquisition-8    4.000 ± 50%

pkg: github.com/hatchet-dev/hatchet/pkg/scheduling/v1
              │ /tmp/old.txt │            /tmp/new.txt             │
              │    sec/op    │    sec/op     vs base               │
RateLimiter-8    40.55µ ± 5%   44.86µ ± 10%  +10.62% (p=0.015 n=6)

              │ /tmp/old.txt │         /tmp/new.txt          │
              │     B/op     │     B/op      vs base         │
RateLimiter-8   137.7Ki ± 0%   137.7Ki ± 0%  ~ (p=0.152 n=6)

              │ /tmp/old.txt │          /tmp/new.txt          │
              │  allocs/op   │  allocs/op   vs base           │
RateLimiter-8    1.022k ± 0%   1.022k ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

_{Compared against main (aebad9e)}

abelanger5

nice!

abelanger5 · 2026-03-17T16:44:54Z

+func (worker *subscribedWorker) tryAcquireSendLockWithTimeout(timeout time.Duration) bool {
+	select {
+	// attempt to send to the semaphore, blocks on contention because it has a buffer of 1
+	case worker.sendSemaphore <- struct{}{}:
+		return true
+	// timing out dequeues the semaphore send
+	case <-time.After(timeout):
+		return false
+	}
+}
+
+func (worker *subscribedWorker) releaseSendLock() {
+	<-worker.sendSemaphore
+}


this looks much cleaner, nice!

Copilot

Pull request overview

This PR replaces the dispatcher’s per-worker “backlog size” flow control with a timeout-based lock acquisition mechanism to prevent unbounded queuing and reject sends when the worker stream is congested.

Changes:

Replace backlog tracking (backlogSize/maxBacklogSize) with a per-worker send semaphore + configurable lock acquisition timeout.
Add runtime/server config wiring for the new lock acquisition timeout setting and plumb it through dispatcher options.
Update dispatcher worker construction to pass the new timeout (with some remaining hard-coded behavior in the v1 Listen endpoint).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/config/server/server.go	Replaces backlog-size runtime setting with a duration-based lock acquisition timeout and binds new env var.
internal/services/dispatcher/subscribed_worker_v1.go	Implements send lock acquisition with timeout and removes backlog counters in worker send paths.
internal/services/dispatcher/subscribed_worker.go	Updates `subscribedWorker` to use a semaphore + timeout instead of a mutex/backlog counters.
internal/services/dispatcher/server.go	Passes lock acquisition timeout when constructing subscribed workers (v1 `Listen` currently hard-coded).
internal/services/dispatcher/dispatcher.go	Renames/plumbs dispatcher option from backlog size to lock acquisition timeout (duration).
cmd/hatchet-engine/engine/run.go	Wires server runtime config into dispatcher via the new option.

Comments suppressed due to low confidence (3)

internal/services/dispatcher/subscribed_worker_v1.go:160

sendToWorker releases the send lock via defer worker.releaseSendLock() in the parent goroutine, but the actual stream.SendMsg happens in a spawned goroutine. If ctx.Done() fires first, sendToWorker returns and releases the lock while the send goroutine may still be running, allowing concurrent sends on the same gRPC stream (not thread-safe). Move the lock release into the send goroutine so it is held until SendMsg completes (similar to the CancelTask pattern).

	defer worker.releaseSendLock()
	defer lockSpan.End()

	telemetry.WithAttributes(span, telemetry.AttributeKV{
		Key:   "lock.duration_ms",
		Value: time.Since(lockBegin).Milliseconds(),
	})

	_, streamSpan := telemetry.NewSpan(ctx, "send-worker-stream")
	defer streamSpan.End()

	sendMsgBegin := time.Now()

	sentCh := make(chan error, 1)

	go func() {
		defer close(sentCh)
		err = worker.stream.SendMsg(msg)

		if err != nil {
			span.RecordError(err)
		}

		if time.Since(sendMsgBegin) > 50*time.Millisecond {
			span.SetStatus(codes.Error, "flow control detected")
			span.RecordError(fmt.Errorf("send took too long, we may be in flow control: %s", time.Since(sendMsgBegin)))
		}

		sentCh <- err
	}()

	select {
	case <-ctx.Done():
		return fmt.Errorf("context done before send could complete: %w", ctx.Err())
	case err = <-sentCh:
		return err
	}

internal/services/dispatcher/subscribed_worker_v1.go:153

There is a data race on the outer err variable: the send goroutine assigns to err while the parent goroutine can also assign/read it (case err = <-sentCh). Use a goroutine-local variable (e.g., sendErr) and only communicate it through sentCh (and use that value for RecordError).

	go func() {
		defer close(sentCh)
		err = worker.stream.SendMsg(msg)

		if err != nil {
			span.RecordError(err)
		}

		if time.Since(sendMsgBegin) > 50*time.Millisecond {
			span.SetStatus(codes.Error, "flow control detected")
			span.RecordError(fmt.Errorf("send took too long, we may be in flow control: %s", time.Since(sendMsgBegin)))
		}

		sentCh <- err
	}()

internal/services/dispatcher/subscribed_worker_v1.go:130

The lock acquisition telemetry is currently misleading: lockBegin is set after the lock is already acquired, and lockSpan.End() is deferred until the whole send completes, so neither the span nor lock.duration_ms reflect acquisition time. Start timing/span before attempting to acquire the lock, then end the acquisition span immediately after it is acquired (recording the actual wait duration).

	if !worker.tryAcquireSendLockWithTimeout(worker.sendLockAcquisitionTimeout) {
		err = fmt.Errorf("could not acquire worker send mutex, flow control is active")
		span.RecordError(err)
		span.SetStatus(codes.Error, "flow control is active")
		return err
	}

	lockBegin := time.Now()

	_, lockSpan := telemetry.NewSpan(ctx, "acquire-worker-stream-lock")

	defer worker.releaseSendLock()
	defer lockSpan.End()

	telemetry.WithAttributes(span, telemetry.AttributeKV{
		Key:   "lock.duration_ms",
		Value: time.Since(lockBegin).Milliseconds(),
	})

You can also share your feedback on Copilot code review. Take the survey.

abelanger5 · 2026-03-17T16:55:14Z

+	select {
+	// attempt to send to the semaphore, blocks on contention because it has a buffer of 1
+	case worker.sendSemaphore <- struct{}{}:
+		return true
+	// timing out dequeues the semaphore send
+	case <-time.After(timeout):


This advice might be outdated:

Before Go 1.23, this documentation warned that the underlying Timer would not be recovered by the garbage collector until the timer fired, and that if efficiency was a concern, code should use NewTimer instead and call Timer.Stop if the timer is no longer needed. As of Go 1.23, the garbage collector can recover unreferenced, unstopped timers. There is no reason to prefer NewTimer when After will do

But perhaps it's not a bad idea for the worker to manage a single timer and to cleanup the timer when we return from this method

Would that work with multiple goroutines at the same time? It's slightly unclear from the documentation.

hmm, yeah I'm not entirely sure. can we try writing a Go benchmark to see how it performs? if we can do at least ~50k/second without much issue I think we'll be fine

promptless-for-oss · 2026-03-17T20:13:16Z

📝 Documentation updates detected!

Updated existing suggestion: Update flow control env var from backlog size to lock acquisition timeout

Tip: Worried about broken links? Ask Promptless to find and fix them automatically 🔗

juliusgeo added 2 commits March 13, 2026 13:14

initial commit

6b7b78c

use polling lock

96c6410

vercel Bot deployed to Preview March 16, 2026 15:36 View deployment

clean up diff

1abf24a

vercel Bot deployed to Preview March 16, 2026 15:42 View deployment

Merge branch 'main' into better_dispatch_backlog

f072cc3

vercel Bot deployed to Preview March 16, 2026 15:50 View deployment

juliusgeo added 2 commits March 16, 2026 12:12

tune timeouts

daed95d

parameterize the wait time

34e51dc

vercel Bot deployed to Preview March 16, 2026 18:04 View deployment

juliusgeo added 2 commits March 16, 2026 15:54

use int not int64

4213773

Merge branch 'main' into better_dispatch_backlog

041f4c2

juliusgeo marked this pull request as ready for review March 16, 2026 21:00

juliusgeo requested a review from abelanger5 March 16, 2026 21:00

vercel Bot deployed to Preview March 16, 2026 21:04 View deployment

abelanger5 reviewed Mar 16, 2026

View reviewed changes

use time.duration

183e884

vercel Bot deployed to Preview March 16, 2026 22:14 View deployment

juliusgeo added 2 commits March 16, 2026 21:05

use channel semaphore

beaf3fb

remove references to ms

aa64b9e

vercel Bot deployed to Preview March 17, 2026 13:30 View deployment

tweak argument comment

21fb267

vercel Bot deployed to Preview March 17, 2026 13:38 View deployment

got the order wrong, need to send not block to acquire lock

76343ed

vercel Bot deployed to Preview March 17, 2026 14:55 View deployment

juliusgeo requested a review from abelanger5 March 17, 2026 16:27

abelanger5 approved these changes Mar 17, 2026

View reviewed changes

abelanger5 requested a review from Copilot March 17, 2026 16:45

Copilot started reviewing on behalf of abelanger5 March 17, 2026 16:45 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

fix some naming, verbiage

76d62c6

vercel Bot deployed to Preview March 17, 2026 17:04 View deployment

add some tests/benchmarks

a4d3ab8

vercel Bot deployed to Preview March 17, 2026 19:36 View deployment

change tests because they're flaky

d658def

vercel Bot deployed to Preview March 17, 2026 19:54 View deployment

abelanger5 approved these changes Mar 17, 2026

View reviewed changes

juliusgeo merged commit 86b25fe into main Mar 17, 2026
50 of 51 checks passed

juliusgeo deleted the better_dispatch_backlog branch March 17, 2026 20:10

juliusgeo mentioned this pull request Mar 22, 2026

Improve error message when failing to send task to worker #3350

Merged

10 tasks

Conversation

juliusgeo commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

What's Changed

Uh oh!

vercel Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

promptless-for-oss commented Mar 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Uh oh!

abelanger5 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

promptless-for-oss commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juliusgeo commented Mar 16, 2026 •

edited

Loading

vercel Bot commented Mar 16, 2026 •

edited

Loading

github-actions Bot commented Mar 17, 2026 •

edited

Loading