feat(l1): weighted peer selection and better tolerance in header downloads#6428
feat(l1): weighted peer selection and better tolerance in header downloads#6428
Conversation
…, more fetch attempts, faster timeout
…res, not in total failures
🤖 Kimi Code ReviewCritical Issues 1. Log message inconsistency in
2. Error handling in
Medium Issues 3. Potential integer underflow in weight calculation
4. Unnecessary clone of
Minor Issues 5. Aggressive timeout reduction
Positive Notes
Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Codex Code Review
Assumption: this review is from static inspection of the diff and touched code. I couldn’t run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR improves full sync stability by addressing three interconnected issues with peer selection and failure tracking, plus reduces timeout latency.
Confidence Score: 5/5Safe to merge; all changes are targeted reliability improvements with no correctness issues found. The logic for weighted selection, consecutive-failure reset, and header scoring additions is all sound. The only finding is a stale log string in snap_sync.rs ("retrying in 5s" vs actual 2s sleep), which is cosmetic and does not affect runtime behavior. crates/networking/p2p/sync/snap_sync.rs — stale "retrying in 5s" log message (cosmetic only)
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/peer_table.rs | Replaces uniform-random peer selection with weighted selection using WeightedIndex; score is mapped from [-150, 50] to weights [1, 201], ensuring no peer weight reaches 0 and the distribution is always valid. |
| crates/networking/p2p/peer_handler.rs | Adds record_success and record_failure calls on the peer table after header download results — fixes the previously noted discrepancy between logs and actual score updates. |
| crates/networking/p2p/sync/full.rs | Resets attempts counter to 0 on each successful header fetch so the limit now tracks only consecutive failures; retry sleep correctly reduced to 2s and log updated to match. |
| crates/networking/p2p/sync/snap_sync.rs | Same consecutive-failure reset applied as in full.rs; retry sleep reduced to 2s but the warning log still says "retrying in 5s" — minor inconsistency. |
| crates/networking/p2p/snap/constants.rs | PEER_REPLY_TIMEOUT reduced from 15s to 5s and MAX_HEADER_FETCH_ATTEMPTS raised from 5 to 10 to compensate for the new consecutive-reset semantics; comment updated accordingly. |
Sequence Diagram
sequenceDiagram
participant Sync as sync_cycle_full / sync_cycle_snap
participant PH as PeerHandler
participant PT as PeerTable
Sync->>PT: get_random_peer(capabilities)
Note over PT: WeightedIndex by score<br/>maps [-150,50] → weight [1,201]
PT-->>Sync: (peer_id, connection)
Sync->>PH: request_block_headers(peer_id)
PH->>PT: make_request (timeout: 5s)
alt Headers valid & chained
PT-->>PH: block_headers
PH->>PT: record_success(peer_id) [score +1]
PH-->>Sync: Some(block_headers)
Note over Sync: attempts = 0 (reset)
else Empty / unchained
PH->>PT: record_failure(peer_id) [score -1]
PH-->>Sync: None
Note over Sync: attempts += 1
else Timeout
PH->>PT: record_failure(peer_id) [score -1]
PH-->>Sync: None
Note over Sync: attempts += 1
end
alt attempts > MAX_HEADER_FETCH_ATTEMPTS (10)
Sync->>Sync: abort, wait for newer CL sync head
else
Sync->>Sync: sleep 2s, retry
end
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 146
Comment:
**Stale log message after sleep reduction**
The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in `sync/full.rs` was correctly updated to "2s". This will mislead operators reading logs.
```suggestion
"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s"
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "reset failure counter on success so we o..." | Re-trigger Greptile
Lines of code reportTotal lines added: Detailed view |
| @@ -145,9 +145,11 @@ pub async fn sync_cycle_snap( | |||
| warn!( | |||
| "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s" | |||
There was a problem hiding this comment.
Stale log message after sleep reduction
The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in sync/full.rs was correctly updated to "2s". This will mislead operators reading logs.
| "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s" | |
| "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s" |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 146
Comment:
**Stale log message after sleep reduction**
The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in `sync/full.rs` was correctly updated to "2s". This will mislead operators reading logs.
```suggestion
"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s"
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Claude Code ReviewNow I have enough context to write a thorough review. PR Review: feat(l1): weighted peer selection and better tolerance in header downloadsOverall: The motivation is sound and the changes are well-scoped. A few correctness issues need attention before merge. Critical: Potential integer overflow in weighted peer selection
let weights: Vec<u64> = peers
.iter()
.map(|(_, _, score)| (score - MIN_SCORE_CRITICAL + 1) as u64)
.collect();
The comment states the range is .map(|(_, _, score)| (score.max(MIN_SCORE_CRITICAL) - MIN_SCORE_CRITICAL + 1) as u64)This is worth verifying: does Minor: Stale log message in
|
Brings in main commits since the prior merge: #6516 EIP-8025 compliance (Electra-aligned ExecutionRequests typed container in NewPayloadRequest, MAX_CONSOLIDATION_REQUESTS_PER_PAYLOAD corrected from 1 to 2, to_encoded_requests() helper for EIP-7685 bytes, removal of ExecutionPayloadHeader/NewPayloadRequestHeader, new byte-oriented execution_program entrypoint that decodes the wire format internally and returns valid: false instead of erroring on post-decode failures), #6463 BAL withdrawal reverse check (DB->BAL direction so a malicious builder can't omit a withdrawal recipient from the BAL), #6505 Kademlia k-bucket revert (PeerTableServer::spawn no longer takes a node_id), plus snap-sync observability + dashboards (#6470), pivot-update crash fix (#6475), weighted peer selection (#6428), txpool_contentFrom/txpool_inspect RPC (#6446), block-by-block exec fallback (#6464), Amsterdam EELS branch pin (#6495), and rollup store SQLite v9->v10 migration (#6514). Conflict resolutions: - crates/common/types/stateless_ssz.rs: this branch had already moved the EIP-8025 SSZ types out of crates/common/types/eip8025_ssz.rs into stateless_ssz.rs and tucked the native-rollup containers below them. Kept that layout, applied #6516's content updates to the EIP-8025 section (renamed spec-limit constants, ExecutionRequests typed container with to_encoded_requests, dropped header types and their tests), pulled in the EncodedRequests import, and kept both the new test_execution_requests_to_encoded_bytes and the branch's stateless round-trip tests. - crates/guest-program/src/l1/program.rs: adopted #6516's new execution_program(bytes: &[u8], crypto) API with the internal decode_eip8025 call, the validate_eip8025_execution helper, and the decode-failure test. Rewrote all `eip-8025` feature gates as `experimental-devnet` and all `eip8025_ssz::` paths as `stateless_ssz::` to match this branch's renames. - crates/guest-program/bin/{sp1,risc0,zisk,openvm}/src/main.rs: applied #6516's simplification (drop decode_eip8025 import, pass &input straight to execution_program) under the experimental-devnet feature gate. Also flipped the rkyv::rancor::Error import gate from the old `eip-8025` name to `experimental-devnet` so the non-devnet build still has the import it needs. - crates/prover/src/backend/exec.rs: kept #6516's updated comment ("raw input bytes" instead of "(NewPayloadRequest, ExecutionWitness)") under the experimental-devnet feature gate. Auto-merged regions checked: crates/vm/backends/levm/mod.rs picked up all of #6463's Part B (DB->BAL) reverse check intact, and cmd/ethrex/l2/initializers.rs picked up #6505's PeerTableServer::spawn signature change. Verified cargo fmt --all clean, cargo check --workspace clean, cargo check --workspace --tests clean, and cargo check -p ethrex-guest-program --features experimental-devnet --tests clean.
…loads (#6428) Full sync was very unstable because of peer selection and intolerance issues. - Peers were previously selected randomly. We now weight the peer selection to conserver randomness but prefer nodes with better scores. - Peer score was not modified with header download success and failure despite the logs saying so. Added both failure and success peer scoring modifications. Peer score is only decreased when there is a non-empty but unchained response, which is invalid. Empty responses are legitimate and are not penalized. - Failure policy was cumulative. That is, we were not counting a peer to fail 5 consecutive times (which would mean unresponsiveness), we were counting a total failure count of 5, regardless of success. We changed this so success resets the failure count. This is also fixed for snap sync. - Made failure faster (5 seconds instead of 15).
Motivation
Full sync was very unstable because of peer selection and intolerance issues.
Changes