Fix fillToHead deadlock by BitWonka · Pull Request #264 · migalabs/goteth

BitWonka · 2026-05-07T16:19:02Z

Motivation

After a long historical backfill, fillToHead returns the original headSlot it queried at the start and hands off to runHead. If the chain has moved forward more than SlotsPerEpoch during the backfill, runHead starts with a gap larger than the processer pool can safely absorb.

runHead's tight loop tries to enqueue every slot between the old nextSlotDownload and the current head SSE event:

for nextSlotDownload <= event.HeadEvent.Slot {
    if s.processerBook.NumFreePages() > 0 {
        s.downloadTaskChan <- nextSlotDownload
        nextSlotDownload++
    }
}

When the gap is hundreds or thousands of slots wide, every page in processerBook ends up held by ProcessBlock/ProcessStateTransitionMetrics goroutines that are blocked on BlockHistory.Wait for cross-epoch dependencies (state metrics for epoch E need blocks from epoch E-1, which are still in flight). No page ever frees, the loop spins forever, and the head channel is never drained. Goteth deadlocks until restart.

The symptom is repeated Waiting for too long to acquire page slot=N and Waiting for spec.AgnosticBlock M warnings on the same slots over many minutes.

Keeping the historical-to-head handoff gap below one epoch keeps the page demand bounded so the deadlock cannot form.

Related links:

Description

Wrap the runHistorical call in fillToHead with a loop that re-queries RequestCurrentHead after each pass. If the new head is more than SlotsPerEpoch ahead of the previous headSlot, run another historical pass for the gap. Once the gap is within one epoch, return and let runHead take over.

Type of change

Bug fix (non-breaking)
New feature (non-breaking)
Breaking change (CLI flag rename, schema change, behavior change)
Documentation only
Refactor / internal cleanup
Performance improvement

Tasks

Add loop around runHistorical
Re-query head after each pass
Handoff threshold of SlotsPerEpoch
Verified deployed: logs show convergence (large gap, smaller gap, handoff)

Testing

go build ./...

End-to-end verified on production. The deadlock requires real backpressure on processerBook plus chain advancement during backfill, so unit-testing means mocking the beacon client, SSE stream, and processer pool together.

Reproduction steps (for bug fixes)

Run goteth, then stop it for a few few hours.
Restart goteth.
While runHistorical runs, the chain advances by another K slots beyond the start-time head.
When runHistorical returns, runHead takes over with a K-slot gap.
If K is large enough (in practice, more than the processer pool size, ~hundreds of slots), the loop in runHead saturates processerBook and deadlocks. Logs show repeated Waiting for too long to acquire page slot=N warnings on the same slots over many minutes with no progress.

Mitigation options considered

Resize the processer pool: papers over the deadlock without fixing the root cause. A larger pool just shifts the threshold.
Single re-check, no loop: insufficient when the historical pass itself takes long enough that the chain moves another epoch during the second pass.

Proof of Success

Real run on 2026-05-01 showing the loop converging and handing off cleanly:

time="2026-05-01T02:24:13Z" level=info msg="head moved 1066 slots during catch-up, looping historical" module=analyzer
time="2026-05-01T02:24:13Z" level=info msg="Switch to historical mode: 14230453 - 14231518" module=analyzer
time="2026-05-01T02:48:49Z" level=info msg="historical mode: all download tasks sent" module=analyzer
time="2026-05-01T02:48:49Z" level=info msg="head moved 123 slots during catch-up, looping historical" module=analyzer
time="2026-05-01T02:48:49Z" level=info msg="Switch to historical mode: 14231519 - 14231641" module=analyzer
time="2026-05-01T02:51:46Z" level=info msg="historical mode: all download tasks sent" module=analyzer
time="2026-05-01T02:51:46Z" level=info msg="waiting for remaining historical blocks (14231481 to 14231641) to complete..." module=analyzer
time="2026-05-01T02:52:30Z" level=info msg="Switch to head mode: following chain head" module=analyzer

Three iterations: 1066-slot gap, 123-slot gap, then under one epoch so the handoff happens.

Pre-fix run that deadlocked (2026-04-28) shows the failure mode for comparison:

time="2026-04-28T05:57:24Z" level=warning msg="Waiting for spec.AgnosticBlock 14210688..." module=analyzer
time="2026-04-28T05:57:25Z" level=warning msg="Waiting for too long to acquire page slot=14210913..." bookTag=processer module=utils
time="2026-04-28T05:57:25Z" level=warning msg="Waiting for spec.AgnosticBlock 14210910..." module=analyzer
... (same warnings on the same slots repeating for 15+ minutes, no progress)

Documentation

README.md updated (if user-facing flag, install, or run change)
docs/tables.md updated (if persisted schema change)
Inline comments added where the why is non-obvious

Backwards compatibility

No

Reviewer notes

Handoff threshold is SlotsPerEpoch (32 slots, ~6.4 minutes). Small enough that runHead cannot saturate the processer pool, large enough to avoid edge cases where head moves a slot or two during the head re-query itself.
The wait-group Add(1) is inside the loop because runHistorical does defer s.wgMainRoutine.Done() on entry. Each iteration is a self-contained add/done pair.

Fix Lighthouse v8.1.0 SSE race condition and reward calculation bugs

fix: historical deadlock, attestation flag, concurrent map race, orphan duties

fix: propagate block changes to dependent epochs after reorg + v3.8.0

Fix RoutineBook.Acquire deadlock causing missing block rewards

fix: transaction value uint64 overflow and Float32 precision loss

Fix ProcessSlashings accumulation + ManualReward race condition + block rewards validation

fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker

fix(relay): remove securerpc and wenmerge mainnet relays

v3.8.1

Zyra-V21 · 2026-05-11T08:26:14Z

Thanks for tracking this one down — the deadlock pattern is real, the analysis is right, and the reproduction is convincing. The fix is correct in what it does: it enforces an invariant on entry to runHead (the handoff gap is bounded), and that invariant happens to be exactly what runHead's inner enqueue loop needs in order not to deadlock against processerBook. Happy with the wgMainRoutine balancing and the convergence behavior shown in the log output.

A couple of nits worth tightening before merge:

Threshold has no headroom. SlotsPerEpoch is exactly processerBook's capacity, so the last loop iteration can leave a gap that immediately saturates the pool the moment runHead starts dispatching. Worth dropping the threshold a bit (or deriving it from the pool size with margin) so the first burst of head events doesn't sit on the edge.
The inline comment could spell out the link between the threshold and the pool size — right now a future reader sees SlotsPerEpoch and has to reverse-engineer why that number specifically. One extra line explaining "matches processerBook capacity so runHead's enqueue burst cannot saturate the pool" would save someone the archaeology later.

Bigger picture though: this PR addresses the symptom (the handoff is the path that triggers the deadlock today) but the underlying fragility lives deeper in runHead's event handler — specifically in how the inner dispatch loop interacts with processerBook under saturation. The handoff invariant keeps the bug out of the hot path, which is fine for shipping. But if you have the appetite, it would be top-tier to follow this up with a more complete RCA on runHead's saturation behavior — i.e. what happens when the pool fills for reasons other than handoff (slow ClickHouse flush, SSE burst after reconnection, etc.). The shape of the deadlock you describe isn't unique to the handoff; it's reachable any time the pool stays full long enough for cross-epoch dependencies to lock the chain. The same patch wouldn't help in those scenarios. Worth a look.

Not asking you to expand the PR — land this as the tactical fix, it does its job. But if you want to file a follow-up issue with the broader saturation pattern (and what would need to change in the event loop to handle it robustly), that would be the higher-leverage contribution.

leobago

I believe there is a bug introduced with this PR: no s.stop check in the outer loop

If a shutdown is signalled while we're between iterations, runHistorical returns immediately (it checks s.stop at the top of its inner loop), but fillToHead's new outer loop has no s.stop guard. Since the chain keeps advancing regardless, RequestCurrentHead() will keep returning a value above handoffThreshold, and the loop will spin — calling runHistorical over and over, each returning immediately. This is an infinite busy-loop on shutdown. Fix:

for {
s.wgMainRoutine.Add(1)
s.runHistorical(nextSlotDownload, headSlot)

  if s.stop {
      return headSlot
  }

  nextSlotDownload = headSlot + 1
  newHead := s.cli.RequestCurrentHead()
  ...

@Zyra-V21 please fix this and add the changes you requested before merging.

Two issues addressed on top of the existing outer-loop change: 1. Shutdown busy-loop. The outer `for` loop did not check `s.stop` between iterations. `runHistorical` returns immediately on `s.stop`, but since the chain keeps advancing the new outer loop re-queried `RequestCurrentHead` and called `runHistorical` again, producing a tight CPU-bound spin on shutdown. Add an explicit `if s.stop { return headSlot }` guard right after `runHistorical`. 2. Handoff threshold sits exactly on the pool capacity. The previous threshold was `SlotsPerEpoch` (32 slots), which is also the size of `processerBook` (`utils.NewRoutineBook(32, ...)` in chain_analyzer.go). Returning with a 32-slot gap lets `runHead`'s first enqueue burst fill every page in the pool; if any of those slots hit a cross-epoch `BlockHistory.Wait` dependency, the pool deadlocks — the failure mode this loop was added to avoid in the first place. Drop the threshold to `SlotsPerEpoch / 2` so there is room for the cross-epoch dependencies to land without the first dispatch burst sitting on the edge of the pool. The threshold change adds at most one or two extra iterations near the end of catch-up (each iteration is bounded by `runHistorical` draining its slot range) and removes the only path that can leave `runHead` starting in an immediately-saturated state.

Zyra-V21 · 2026-06-01T09:14:24Z

Done! @leobago

leobago and others added 12 commits February 17, 2026 16:29

Merge pull request migalabs#219 from migalabs/dev

a848734

Fix Lighthouse v8.1.0 SSE race condition and reward calculation bugs

Merge pull request migalabs#221 from migalabs/dev

c45331c

fix: historical deadlock, attestation flag, concurrent map race, orphan duties

Merge pull request migalabs#230 from migalabs/dev

82a6cee

fix: propagate block changes to dependent epochs after reorg + v3.8.0

Merge pull request migalabs#233 from migalabs/dev

f2326f2

Fix RoutineBook.Acquire deadlock causing missing block rewards

Merge pull request migalabs#236 from migalabs/dev

80a735e

fix: transaction value uint64 overflow and Float32 precision loss

Merge pull request migalabs#241 from migalabs/dev

0e680b2

Fix ProcessSlashings accumulation + ManualReward race condition + block rewards validation

Merge pull request migalabs#246 from migalabs/dev

3b64392

fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker

Merge pull request migalabs#252 from migalabs/dev

72c16a0

fix(relay): remove securerpc and wenmerge mainnet relays

Merge pull request migalabs#257 from migalabs/dev

0ccb525

v3.8.1

feat: loop fillToHead until live head gap is within one epoch

6c7c3f0

fix: return correct slot in loop

063f310

trim verbose comment

e5db1b7

BitWonka changed the title ~~Fix fillToHead~~ Fix fillToHead deadlock May 7, 2026

Zyra-V21 mentioned this pull request May 12, 2026

fix: prevent goteth stalls on networks with large validator sets #269

Merged

3 tasks

leobago requested changes Jun 1, 2026

View reviewed changes

leobago merged commit 7bc6f27 into migalabs:dev Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fillToHead deadlock#264

Fix fillToHead deadlock#264
leobago merged 13 commits into
migalabs:devfrom
BitWonka:feat-fill-to-head-loop

BitWonka commented May 7, 2026 •

edited by Zyra-V21

Loading

Uh oh!

Zyra-V21 commented May 11, 2026

Uh oh!

leobago left a comment

Uh oh!

Zyra-V21 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BitWonka commented May 7, 2026 • edited by Zyra-V21 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Type of change

Tasks

Testing

Reproduction steps (for bug fixes)

Mitigation options considered

Proof of Success

Documentation

Backwards compatibility

Reviewer notes

Uh oh!

Zyra-V21 commented May 11, 2026

Uh oh!

leobago left a comment

Choose a reason for hiding this comment

Uh oh!

Zyra-V21 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BitWonka commented May 7, 2026 •

edited by Zyra-V21

Loading