Skip to content

fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker#246

Merged
leobago merged 6 commits into
masterfrom
dev
Apr 7, 2026
Merged

fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker#246
leobago merged 6 commits into
masterfrom
dev

Conversation

@Zyra-V21
Copy link
Copy Markdown
Collaborator

@Zyra-V21 Zyra-V21 commented Mar 23, 2026

Summary

Three fixes for goteth stability and relay data quality:

fix #245 — StateHistory.Wait deadlock after reorg + AdvanceFinalized

Add ensureDependencyStates() that re-downloads missing states (E, E-1, E-2) before ProcessStateTransitionMetrics in AdvanceFinalized. States evicted by a prior CleanUpTo call caused StateHistory.Wait() to block forever.

fix #248 — Block downloads lost during historical-to-head transition

Wait for ALL blocks in the historical range before switching to head mode. With parallel workers, blocks download out of order — the last slot can complete while intermediate slots are still in-flight, causing BlockHistory.Wait() deadlock.

fix #247 — Dead relays removed + circuit breaker

  • Remove 4 permanently dead mainnet relays (securerpc, wenmerge, titan global/regional) that wasted 10s timeout per epoch
  • Add circuit breaker to RelayClient: after 3 consecutive failures, relay is skipped for 2-minute cooldown. After cooldown, one probe is allowed (half-open). Successful probe resets; failed probe re-opens.
  • Refactor RelayClient to use pointer receivers and store address directly

Changes

  • pkg/analyzer/reorg.goensureDependencyStates() helper
  • pkg/analyzer/routines.go — full-range wait before head mode switch
  • pkg/relay/constants.go — remove 4 dead relays
  • pkg/relay/relay_bid_trace.go — circuit breaker, pointer receivers

Test plan

Fixes #245, Fixes #247, Fixes #248

…AdvanceFinalized

ProcessStateTransitionMetrics(E) calls StateHistory.Wait for epochs E,
E-1 and E-2. After a reorg, AdvanceFinalized may need to reprocess an
epoch whose dependency states were already evicted by a prior CleanUpTo
call. StateHistory.Wait then blocks forever because nobody re-downloads
the missing state.

Add ensureDependencyStates() that checks Available() for all three
dependency epochs and re-downloads any missing states (and their blocks)
before calling ProcessStateTransitionMetrics.

Fixes #245
Copy link
Copy Markdown
Member

@leobago leobago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
But let's first test it on eth-arch before merging

…d mode

When transitioning from historical to head mode, runHead() only waited
for the last slot (headSlot) to be downloaded. With parallel download
workers, blocks are fetched out of order — the last slot can complete
while intermediate slots are still in-flight. The switch to head mode
then discards those pending downloads, causing BlockHistory.Wait() to
deadlock forever in the processer.

Now wait for every slot in the historical range [initSlot, headSlot]
before switching to head mode.

Fixes #248
@Zyra-V21 Zyra-V21 changed the title fix: prevent StateHistory.Wait deadlock after reorg + AdvanceFinalized fix: prevent Wait() deadlocks in AdvanceFinalized and historical-to-head transition Mar 24, 2026
…lays

Remove 4 permanently unreachable mainnet relays that each wasted 10s of
timeout per epoch query:
- mainnet-relay.securerpc.com (Manifold): DNS no such host
- relay.wenmerge.com: HTTP 403 Cloudflare block
- global.titanrelay.xyz: DNS timeout / HTTP 408
- regional.titanrelay.xyz: DNS timeout / HTTP 408

Add a circuit breaker to RelayClient: after 3 consecutive failures, the
relay is skipped for a 2-minute cooldown instead of blocking on timeout.
After cooldown, one probe request is allowed (half-open state). A
successful probe fully resets the breaker; a failed probe re-opens it.

This prevents dead or temporarily overloaded relays from blocking the
entire epoch's relay query via WaitGroup.Wait(), which was causing
relay data loss during historical catchup.

Also refactor RelayClient to store the address string directly instead
of calling client.Address() at log time, and use pointer receivers for
mutable circuit breaker state.

Fixes #247
@Zyra-V21 Zyra-V21 changed the title fix: prevent Wait() deadlocks in AdvanceFinalized and historical-to-head transition fix: prevent Wait() deadlocks, remove dead relays, add relay circuit breaker Mar 24, 2026
Add processerBook.WaitUntilInactive barrier in processBlockRewards
to ensure ProcessBlock has finished appending transactions before
BlockGasFees() reads them. Same synchronization pattern used in
reorg.go for the ManualReward fix (#242).

Fixes #249
…250)

processEpochMetrics writes for epoch N-1 while processPoolMetrics
writes for epoch N-2. With only 3 epochs of lookback on restart,
the epoch_metrics write for the boundary epoch is skipped, leaving
epochs with pool_summary data but no epoch_metrics.

Fixes #250
@leobago
Copy link
Copy Markdown
Member

leobago commented Mar 27, 2026

Please remove the relay part of this PR, as we now know this was not the case.

Reverts the relay removal from ac21d57 — these relays were timing out
but are still operational. The circuit breaker added in the same commit
will handle transient failures gracefully. Restores:
- Titan Global (global.titanrelay.xyz)
- Titan Regional (regional.titanrelay.xyz)
- Manifold/SecureRPC (mainnet-relay.securerpc.com)
- Wenmerge (relay.wenmerge.com)
@Zyra-V21
Copy link
Copy Markdown
Collaborator Author

Please remove the relay part of this PR, as we now know this was not the case.

Done, relays have been restored

Copy link
Copy Markdown
Member

@leobago leobago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leobago leobago merged commit 3b64392 into master Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants