Skip to content

Remove dispatch backlog, replace with timeout lock acquisition#3290

Merged
juliusgeo merged 16 commits into
mainfrom
better_dispatch_backlog
Mar 17, 2026
Merged

Remove dispatch backlog, replace with timeout lock acquisition#3290
juliusgeo merged 16 commits into
mainfrom
better_dispatch_backlog

Conversation

@juliusgeo
Copy link
Copy Markdown
Contributor

@juliusgeo juliusgeo commented Mar 16, 2026

Description

Prior to this change, there was a fixed backlog of roughly 20 simultaneous worker dispatches before they would start erroring out to be rescheduled. This was suboptimal as it was adding an additional buffer on top of the existing gRPC and TCP flow control buffers that exist internally. The SendMsg call will only block once 1) TCP buffer is exhausted 2) gRPC buffer is exhausted, thus making the additional worker backlog only activate once 19 additional worker sends were queued after both of those buffers were exhausted. This change makes it so that a timeout controls whether we send tasks back to the scheduler by checking how long it takes to acquire the lock surrounding SendMsg.

Fixes # (issue)

Type of change

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

What's Changed

  • Removed maxWorkerBacklogSize
  • Adds WorkerLockAcquisitionTimeout

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment Mar 17, 2026 7:54pm

Request Review

@juliusgeo juliusgeo marked this pull request as ready for review March 16, 2026 21:00
@juliusgeo juliusgeo requested a review from abelanger5 March 16, 2026 21:00
@promptless-for-oss
Copy link
Copy Markdown

📝 Documentation updates detected!

New suggestion: Update flow control env var from backlog size to lock acquisition timeout


Tip: See how your feedback shapes Promptless in Agent Knowledge Base 🧠

Comment on lines +23 to +29
stopTime := time.Now().Add(timeout)
for time.Now().Before(stopTime) {
if worker.sendMu.TryLock() {
return true
}
time.Sleep(5 * time.Millisecond) // small backoff to avoid busy spinning
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this pattern for a few reasons:

  1. It doesn't respect ordering, so workers can be crowded out if we happen to call TryLock from a different task send
  2. It feels unpredictable what sort of CPU load we'll see from TryLock running this often

Is there a way we can implement the worker mutex as a semaphore on a channel instead, and have a channel call using time.After, and then have something like:

select {
  <- worker.sendSemaphore:
  <- time.After:
}

(Not real code but hopefully clear what I mean)

The semaphore can be guarded on either send or receive.

This should be efficient and also respect ordering?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I changed it so it uses a chan as a semaphore.

Comment thread pkg/config/server/server.go Outdated
GRPCWorkerStreamMaxBacklogSize int `mapstructure:"grpcWorkerStreamMaxBacklogSize" json:"grpcWorkerStreamMaxBacklogSize,omitempty" default:"20"`
// GRPCWorkerMaxLockAcquisitionTimeMS is the maximum number of milliseconds that the dispatcher will wait while attempting
// to send messages to workers. If it waits longer, the request will be rejected. Default is 250
GRPCWorkerMaxWorkerLockAcquisitionTimeMS int `mapstructure:"grpcWorkerMaxLockAcquisitionTimeMS" json:"grpcWorkerMaxLockAcquisitionTimeMS,omitempty" default:"250"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for this to be a time.Duration? I can't actually remember if viper supports unmarshalling into a time.Duration directly or if we need to parse a string. I think it might have the advantage of making the config slightly more readable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Viper does support time.Duration unmarshalling. Fixed!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 17, 2026

Benchmark results

goos: linux
goarch: amd64
pkg: github.com/hatchet-dev/hatchet/internal/msgqueue/rabbitmq
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
                              │ /tmp/old.txt │            /tmp/new.txt            │
                              │    sec/op    │    sec/op     vs base              │
CompressPayloads_1x10KiB-8      75.62µ ±  1%   76.59µ ±  2%       ~ (p=0.065 n=6)
CompressPayloads_10x10KiB-8     874.7µ ±  2%   882.9µ ±  2%       ~ (p=0.310 n=6)
CompressPayloads_10x100KiB-8    10.29m ±  2%   10.33m ±  3%       ~ (p=0.589 n=6)
CompressPayloads_Concurrent-8   65.20µ ± 26%   69.26µ ± 19%       ~ (p=0.132 n=6)
geomean                         459.0µ         469.0µ        +2.18%

                              │ /tmp/old.txt │            /tmp/new.txt            │
                              │     B/op     │     B/op      vs base              │
CompressPayloads_1x10KiB-8      11.23Ki ± 3%   11.03Ki ± 3%       ~ (p=0.329 n=6)
CompressPayloads_10x10KiB-8     110.0Ki ± 2%   111.8Ki ± 2%       ~ (p=0.132 n=6)
CompressPayloads_10x100KiB-8    2.920Mi ± 0%   2.920Mi ± 1%       ~ (p=0.554 n=6)
CompressPayloads_Concurrent-8   54.23Ki ± 0%   54.33Ki ± 0%       ~ (p=0.180 n=6)
geomean                         119.0Ki        119.0Ki       -0.00%

                              │ /tmp/old.txt │            /tmp/new.txt            │
                              │  allocs/op   │ allocs/op   vs base                │
CompressPayloads_1x10KiB-8        5.000 ± 0%   5.000 ± 0%       ~ (p=1.000 n=6) ¹
CompressPayloads_10x10KiB-8       32.00 ± 0%   32.00 ± 0%       ~ (p=1.000 n=6) ¹
CompressPayloads_10x100KiB-8      63.00 ± 0%   63.00 ± 2%       ~ (p=1.000 n=6)
CompressPayloads_Concurrent-8     17.00 ± 0%   17.00 ± 0%       ~ (p=1.000 n=6) ¹
geomean                           20.35        20.35       +0.00%
¹ all samples are equal

pkg: github.com/hatchet-dev/hatchet/internal/services/dispatcher
                  │ /tmp/new.txt │
                  │    sec/op    │
LockAcquisition-8   602.5n ± 76%

                  │ /tmp/new.txt │
                  │     B/op     │
LockAcquisition-8   375.0 ± 153%

                  │ /tmp/new.txt │
                  │  allocs/op   │
LockAcquisition-8    4.000 ± 50%

pkg: github.com/hatchet-dev/hatchet/pkg/scheduling/v1
              │ /tmp/old.txt │            /tmp/new.txt             │
              │    sec/op    │    sec/op     vs base               │
RateLimiter-8    40.55µ ± 5%   44.86µ ± 10%  +10.62% (p=0.015 n=6)

              │ /tmp/old.txt │         /tmp/new.txt          │
              │     B/op     │     B/op      vs base         │
RateLimiter-8   137.7Ki ± 0%   137.7Ki ± 0%  ~ (p=0.152 n=6)

              │ /tmp/old.txt │          /tmp/new.txt          │
              │  allocs/op   │  allocs/op   vs base           │
RateLimiter-8    1.022k ± 0%   1.022k ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

Compared against main (aebad9e)

Copy link
Copy Markdown
Contributor

@abelanger5 abelanger5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment on lines +22 to +35
func (worker *subscribedWorker) tryAcquireSendLockWithTimeout(timeout time.Duration) bool {
select {
// attempt to send to the semaphore, blocks on contention because it has a buffer of 1
case worker.sendSemaphore <- struct{}{}:
return true
// timing out dequeues the semaphore send
case <-time.After(timeout):
return false
}
}

func (worker *subscribedWorker) releaseSendLock() {
<-worker.sendSemaphore
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks much cleaner, nice!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the dispatcher’s per-worker “backlog size” flow control with a timeout-based lock acquisition mechanism to prevent unbounded queuing and reject sends when the worker stream is congested.

Changes:

  • Replace backlog tracking (backlogSize/maxBacklogSize) with a per-worker send semaphore + configurable lock acquisition timeout.
  • Add runtime/server config wiring for the new lock acquisition timeout setting and plumb it through dispatcher options.
  • Update dispatcher worker construction to pass the new timeout (with some remaining hard-coded behavior in the v1 Listen endpoint).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/config/server/server.go Replaces backlog-size runtime setting with a duration-based lock acquisition timeout and binds new env var.
internal/services/dispatcher/subscribed_worker_v1.go Implements send lock acquisition with timeout and removes backlog counters in worker send paths.
internal/services/dispatcher/subscribed_worker.go Updates subscribedWorker to use a semaphore + timeout instead of a mutex/backlog counters.
internal/services/dispatcher/server.go Passes lock acquisition timeout when constructing subscribed workers (v1 Listen currently hard-coded).
internal/services/dispatcher/dispatcher.go Renames/plumbs dispatcher option from backlog size to lock acquisition timeout (duration).
cmd/hatchet-engine/engine/run.go Wires server runtime config into dispatcher via the new option.
Comments suppressed due to low confidence (3)

internal/services/dispatcher/subscribed_worker_v1.go:160

  • sendToWorker releases the send lock via defer worker.releaseSendLock() in the parent goroutine, but the actual stream.SendMsg happens in a spawned goroutine. If ctx.Done() fires first, sendToWorker returns and releases the lock while the send goroutine may still be running, allowing concurrent sends on the same gRPC stream (not thread-safe). Move the lock release into the send goroutine so it is held until SendMsg completes (similar to the CancelTask pattern).
	defer worker.releaseSendLock()
	defer lockSpan.End()

	telemetry.WithAttributes(span, telemetry.AttributeKV{
		Key:   "lock.duration_ms",
		Value: time.Since(lockBegin).Milliseconds(),
	})

	_, streamSpan := telemetry.NewSpan(ctx, "send-worker-stream")
	defer streamSpan.End()

	sendMsgBegin := time.Now()

	sentCh := make(chan error, 1)

	go func() {
		defer close(sentCh)
		err = worker.stream.SendMsg(msg)

		if err != nil {
			span.RecordError(err)
		}

		if time.Since(sendMsgBegin) > 50*time.Millisecond {
			span.SetStatus(codes.Error, "flow control detected")
			span.RecordError(fmt.Errorf("send took too long, we may be in flow control: %s", time.Since(sendMsgBegin)))
		}

		sentCh <- err
	}()

	select {
	case <-ctx.Done():
		return fmt.Errorf("context done before send could complete: %w", ctx.Err())
	case err = <-sentCh:
		return err
	}

internal/services/dispatcher/subscribed_worker_v1.go:153

  • There is a data race on the outer err variable: the send goroutine assigns to err while the parent goroutine can also assign/read it (case err = <-sentCh). Use a goroutine-local variable (e.g., sendErr) and only communicate it through sentCh (and use that value for RecordError).
	go func() {
		defer close(sentCh)
		err = worker.stream.SendMsg(msg)

		if err != nil {
			span.RecordError(err)
		}

		if time.Since(sendMsgBegin) > 50*time.Millisecond {
			span.SetStatus(codes.Error, "flow control detected")
			span.RecordError(fmt.Errorf("send took too long, we may be in flow control: %s", time.Since(sendMsgBegin)))
		}

		sentCh <- err
	}()

internal/services/dispatcher/subscribed_worker_v1.go:130

  • The lock acquisition telemetry is currently misleading: lockBegin is set after the lock is already acquired, and lockSpan.End() is deferred until the whole send completes, so neither the span nor lock.duration_ms reflect acquisition time. Start timing/span before attempting to acquire the lock, then end the acquisition span immediately after it is acquired (recording the actual wait duration).
	if !worker.tryAcquireSendLockWithTimeout(worker.sendLockAcquisitionTimeout) {
		err = fmt.Errorf("could not acquire worker send mutex, flow control is active")
		span.RecordError(err)
		span.SetStatus(codes.Error, "flow control is active")
		return err
	}

	lockBegin := time.Now()

	_, lockSpan := telemetry.NewSpan(ctx, "acquire-worker-stream-lock")

	defer worker.releaseSendLock()
	defer lockSpan.End()

	telemetry.WithAttributes(span, telemetry.AttributeKV{
		Key:   "lock.duration_ms",
		Value: time.Since(lockBegin).Milliseconds(),
	})

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +23 to +28
select {
// attempt to send to the semaphore, blocks on contention because it has a buffer of 1
case worker.sendSemaphore <- struct{}{}:
return true
// timing out dequeues the semaphore send
case <-time.After(timeout):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This advice might be outdated:

Before Go 1.23, this documentation warned that the underlying Timer would not be recovered by the garbage collector until the timer fired, and that if efficiency was a concern, code should use NewTimer instead and call Timer.Stop if the timer is no longer needed. As of Go 1.23, the garbage collector can recover unreferenced, unstopped timers. There is no reason to prefer NewTimer when After will do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But perhaps it's not a bad idea for the worker to manage a single timer and to cleanup the timer when we return from this method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that work with multiple goroutines at the same time? It's slightly unclear from the documentation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah I'm not entirely sure. can we try writing a Go benchmark to see how it performs? if we can do at least ~50k/second without much issue I think we'll be fine

Comment thread pkg/config/server/server.go Outdated
Comment thread internal/services/dispatcher/server.go Outdated
Comment thread internal/services/dispatcher/subscribed_worker_v1.go Outdated
@juliusgeo juliusgeo merged commit 86b25fe into main Mar 17, 2026
50 of 51 checks passed
@juliusgeo juliusgeo deleted the better_dispatch_backlog branch March 17, 2026 20:10
@promptless-for-oss
Copy link
Copy Markdown

📝 Documentation updates detected!

Updated existing suggestion: Update flow control env var from backlog size to lock acquisition timeout


Tip: Worried about broken links? Ask Promptless to find and fix them automatically 🔗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants