Fix fillToHead deadlock#264
Conversation
Fix Lighthouse v8.1.0 SSE race condition and reward calculation bugs
fix: historical deadlock, attestation flag, concurrent map race, orphan duties
fix: propagate block changes to dependent epochs after reorg + v3.8.0
Fix RoutineBook.Acquire deadlock causing missing block rewards
fix: transaction value uint64 overflow and Float32 precision loss
Fix ProcessSlashings accumulation + ManualReward race condition + block rewards validation
fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker
fix(relay): remove securerpc and wenmerge mainnet relays
|
Thanks for tracking this one down — the deadlock pattern is real, the analysis is right, and the reproduction is convincing. The fix is correct in what it does: it enforces an invariant on entry to A couple of nits worth tightening before merge:
Bigger picture though: this PR addresses the symptom (the handoff is the path that triggers the deadlock today) but the underlying fragility lives deeper in Not asking you to expand the PR — land this as the tactical fix, it does its job. But if you want to file a follow-up issue with the broader saturation pattern (and what would need to change in the event loop to handle it robustly), that would be the higher-leverage contribution. |
leobago
left a comment
There was a problem hiding this comment.
I believe there is a bug introduced with this PR: no s.stop check in the outer loop
If a shutdown is signalled while we're between iterations, runHistorical returns immediately (it checks s.stop at the top of its inner loop), but fillToHead's new outer loop has no s.stop guard. Since the chain keeps advancing regardless, RequestCurrentHead() will keep returning a value above handoffThreshold, and the loop will spin — calling runHistorical over and over, each returning immediately. This is an infinite busy-loop on shutdown. Fix:
for {
s.wgMainRoutine.Add(1)
s.runHistorical(nextSlotDownload, headSlot)
if s.stop {
return headSlot
}
nextSlotDownload = headSlot + 1
newHead := s.cli.RequestCurrentHead()
...
@Zyra-V21 please fix this and add the changes you requested before merging.
Two issues addressed on top of the existing outer-loop change:
1. Shutdown busy-loop. The outer `for` loop did not check `s.stop`
between iterations. `runHistorical` returns immediately on
`s.stop`, but since the chain keeps advancing the new outer loop
re-queried `RequestCurrentHead` and called `runHistorical` again,
producing a tight CPU-bound spin on shutdown. Add an explicit
`if s.stop { return headSlot }` guard right after `runHistorical`.
2. Handoff threshold sits exactly on the pool capacity. The previous
threshold was `SlotsPerEpoch` (32 slots), which is also the size
of `processerBook` (`utils.NewRoutineBook(32, ...)` in
chain_analyzer.go). Returning with a 32-slot gap lets `runHead`'s
first enqueue burst fill every page in the pool; if any of those
slots hit a cross-epoch `BlockHistory.Wait` dependency, the pool
deadlocks — the failure mode this loop was added to avoid in the
first place. Drop the threshold to `SlotsPerEpoch / 2` so there
is room for the cross-epoch dependencies to land without the
first dispatch burst sitting on the edge of the pool.
The threshold change adds at most one or two extra iterations near
the end of catch-up (each iteration is bounded by `runHistorical`
draining its slot range) and removes the only path that can leave
`runHead` starting in an immediately-saturated state.
|
Done! @leobago |
Motivation
After a long historical backfill,
fillToHeadreturns the originalheadSlotit queried at the start and hands off torunHead. If the chain has moved forward more thanSlotsPerEpochduring the backfill,runHeadstarts with a gap larger than the processer pool can safely absorb.runHead's tight loop tries to enqueue every slot between the oldnextSlotDownloadand the current head SSE event:When the gap is hundreds or thousands of slots wide, every page in
processerBookends up held byProcessBlock/ProcessStateTransitionMetricsgoroutines that are blocked onBlockHistory.Waitfor cross-epoch dependencies (state metrics for epoch E need blocks from epoch E-1, which are still in flight). No page ever frees, the loop spins forever, and the head channel is never drained. Goteth deadlocks until restart.The symptom is repeated Waiting for too long to acquire page slot=N and Waiting for spec.AgnosticBlock M warnings on the same slots over many minutes.
Keeping the historical-to-head handoff gap below one epoch keeps the page demand bounded so the deadlock cannot form.
Related links:
Description
Wrap the
runHistoricalcall infillToHeadwith a loop that re-queriesRequestCurrentHeadafter each pass. If the new head is more than SlotsPerEpoch ahead of the previous headSlot, run another historical pass for the gap. Once the gap is within one epoch, return and let runHead take over.Type of change
Tasks
Testing
go build ./...End-to-end verified on production. The deadlock requires real backpressure on processerBook plus chain advancement during backfill, so unit-testing means mocking the beacon client, SSE stream, and processer pool together.
Reproduction steps (for bug fixes)
Waiting for too long to acquire page slot=N warningson the same slots over many minutes with no progress.Mitigation options considered
Proof of Success
Real run on 2026-05-01 showing the loop converging and handing off cleanly:
Three iterations: 1066-slot gap, 123-slot gap, then under one epoch so the handoff happens.
Pre-fix run that deadlocked (2026-04-28) shows the failure mode for comparison:
Documentation
README.mdupdated (if user-facing flag, install, or run change)docs/tables.mdupdated (if persisted schema change)Backwards compatibility
No
Reviewer notes
Add(1)is inside the loop because runHistorical does defers.wgMainRoutine.Done()on entry. Each iteration is a self-contained add/done pair.