Conversation
…AdvanceFinalized ProcessStateTransitionMetrics(E) calls StateHistory.Wait for epochs E, E-1 and E-2. After a reorg, AdvanceFinalized may need to reprocess an epoch whose dependency states were already evicted by a prior CleanUpTo call. StateHistory.Wait then blocks forever because nobody re-downloads the missing state. Add ensureDependencyStates() that checks Available() for all three dependency epochs and re-downloads any missing states (and their blocks) before calling ProcessStateTransitionMetrics. Fixes #245
leobago
approved these changes
Mar 23, 2026
Member
leobago
left a comment
There was a problem hiding this comment.
LGTM
But let's first test it on eth-arch before merging
…d mode When transitioning from historical to head mode, runHead() only waited for the last slot (headSlot) to be downloaded. With parallel download workers, blocks are fetched out of order — the last slot can complete while intermediate slots are still in-flight. The switch to head mode then discards those pending downloads, causing BlockHistory.Wait() to deadlock forever in the processer. Now wait for every slot in the historical range [initSlot, headSlot] before switching to head mode. Fixes #248
…lays Remove 4 permanently unreachable mainnet relays that each wasted 10s of timeout per epoch query: - mainnet-relay.securerpc.com (Manifold): DNS no such host - relay.wenmerge.com: HTTP 403 Cloudflare block - global.titanrelay.xyz: DNS timeout / HTTP 408 - regional.titanrelay.xyz: DNS timeout / HTTP 408 Add a circuit breaker to RelayClient: after 3 consecutive failures, the relay is skipped for a 2-minute cooldown instead of blocking on timeout. After cooldown, one probe request is allowed (half-open state). A successful probe fully resets the breaker; a failed probe re-opens it. This prevents dead or temporarily overloaded relays from blocking the entire epoch's relay query via WaitGroup.Wait(), which was causing relay data loss during historical catchup. Also refactor RelayClient to store the address string directly instead of calling client.Address() at log time, and use pointer receivers for mutable circuit breaker state. Fixes #247
Member
|
Please remove the relay part of this PR, as we now know this was not the case. |
Reverts the relay removal from ac21d57 — these relays were timing out but are still operational. The circuit breaker added in the same commit will handle transient failures gracefully. Restores: - Titan Global (global.titanrelay.xyz) - Titan Regional (regional.titanrelay.xyz) - Manifold/SecureRPC (mainnet-relay.securerpc.com) - Wenmerge (relay.wenmerge.com)
Collaborator
Author
Done, relays have been restored |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes for goteth stability and relay data quality:
fix #245 — StateHistory.Wait deadlock after reorg + AdvanceFinalized
Add
ensureDependencyStates()that re-downloads missing states (E, E-1, E-2) beforeProcessStateTransitionMetricsinAdvanceFinalized. States evicted by a priorCleanUpTocall causedStateHistory.Wait()to block forever.fix #248 — Block downloads lost during historical-to-head transition
Wait for ALL blocks in the historical range before switching to head mode. With parallel workers, blocks download out of order — the last slot can complete while intermediate slots are still in-flight, causing
BlockHistory.Wait()deadlock.fix #247 — Dead relays removed + circuit breaker
RelayClient: after 3 consecutive failures, relay is skipped for 2-minute cooldown. After cooldown, one probe is allowed (half-open). Successful probe resets; failed probe re-opens.RelayClientto use pointer receivers and store address directlyChanges
pkg/analyzer/reorg.go—ensureDependencyStates()helperpkg/analyzer/routines.go— full-range wait before head mode switchpkg/relay/constants.go— remove 4 dead relayspkg/relay/relay_bid_trace.go— circuit breaker, pointer receiversTest plan
go buildpassesFixes #245, Fixes #247, Fixes #248