feat(l1): weighted peer selection and better tolerance in header downloads by Arkenan · Pull Request #6428 · lambdaclass/ethrex

Arkenan · 2026-03-31T10:01:15Z

Motivation

Full sync was very unstable because of peer selection and intolerance issues.

Changes

Peers were previously selected randomly. We now weight the peer selection to conserver randomness but prefer nodes with better scores.
Peer score was not modified with header download success and failure despite the logs saying so. Added both failure and success peer scoring modifications. Peer score is only decreased when there is a non-empty but unchained response, which is invalid. Empty responses are legitimate and are not penalized.
Failure policy was cumulative. That is, we were not counting a peer to fail 5 consecutive times (which would mean unresponsiveness), we were counting a total failure count of 5, regardless of success. We changed this so success resets the failure count. This is also fixed for snap sync.
Made failure faster (5 seconds instead of 15).

…, more fetch attempts, faster timeout

…res, not in total failures

github-actions · 2026-03-31T10:03:24Z

🤖 Kimi Code Review

Critical Issues

1. Log message inconsistency in snap_sync.rs

File: crates/networking/p2p/sync/snap_sync.rs
Lines: 146-148
The warning message states "retrying in 5s" but the actual sleep duration was changed to 2 seconds. This mismatch will mislead operators during debugging.
Fix: Update the log message to say "retrying in 2s" to match the Duration::from_secs(2) on line 148.

2. Error handling in peer_handler.rs may abort sync unnecessarily

File: crates/networking/p2p/peer_handler.rs
Lines: 443, 453, 457
Using .await? on record_success and record_failure means that if the peer table operation fails (e.g., lock contention, channel closed), the entire header fetch operation returns an error, even when headers were successfully received.

Recommendation: Log peer scoring failures but don't fail the sync operation. Consider:

if let Err(e) = self.peer_table.record_success(&peer_id).await {
    warn!("Failed to record peer success: {}", e);
}

Medium Issues

3. Potential integer underflow in weight calculation

File: crates/networking/p2p/peer_table.rs
Line: 1085
The cast (score - MIN_SCORE_CRITICAL + 1) as u64 assumes score >= MIN_SCORE_CRITICAL. If a peer's score drops below this threshold (e.g., due to bugs or edge cases), the subtraction yields a negative i64, which when cast to u64 becomes a very large positive number (near u64::MAX), dominating the weighted distribution.

Recommendation: Use saturating_sub or clamp the score before calculation:

.map(|(_, _, score)| score.saturating_sub(MIN_SCORE_CRITICAL).saturating_add(1).max(1) as u64)

4. Unnecessary clone of PeerConnection

File: crates/networking/p2p/peer_table.rs
Lines: 1075, 1091
The connection is cloned twice: once when building the peers vector (line 1075) and again when returning (line 1091).
Recommendation: Store references in the temporary vector to avoid the double clone, or use swap_remove if order doesn't matter.

Minor Issues

5. Aggressive timeout reduction

File: crates/networking/p2p/snap/constants.rs
Line: 66
Reducing PEER_REPLY_TIMEOUT from 15s to 5s may cause excessive timeouts on high-latency networks or during temporary congestion. Ensure this is acceptable for the target network conditions.

Positive Notes

The weighted peer selection using WeightedIndex correctly uses OsRng for cryptographic randomness.
Resetting attempts = 0 on success (lines 78 in full.rs and 151 in snap_sync.rs) correctly implements the "consecutive failures" semantics described in the constant documentation.
The score mapping logic (shifting by MIN_SCORE_CRITICAL) ensures bad peers remain selectable but unlikely, preventing total network isolation if all peers temporarily misbehave.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-03-31T10:04:05Z

🤖 Codex Code Review

PEER_REPLY_TIMEOUT is advertised as a snap-only tuning, but this constant is re-exported and used by the generic ETH request path too. Changing it from 15s to 5s in snap/constants.rs:66 also shortens the timeout for GetBlockHeaders / GetBlockBodies in peer_handler.rs:29, peer_handler.rs:71, peer_handler.rs:431, and peer_handler.rs:520. That broadens the behavior change well beyond snap sync and can turn slow-but-valid peers into false timeouts, especially for large body/header responses, which then feeds the new scoring/selection logic. This should either stay scoped to snap requests or use separate timeouts for ETH header/body paths.
request_block_headers_from_hash now penalizes an empty header response the same way as invalid data in peer_handler.rs:440 and peer_handler.rs:449. For full sync, though, an empty response is already treated by the caller as a normal “this peer doesn’t know that head yet” retry case in full.rs:59 and full.rs:63. With the new global peer scoring, this will downgrade honest but slightly lagging peers during ordinary head churn/reorg conditions, and that score then affects unrelated future peer selection. I’d only penalize malformed/non-chained data here, not block_headers.is_empty().
The new “consecutive failures” semantics are off by one: both loops abort only when attempts > MAX_HEADER_FETCH_ATTEMPTS, so with MAX_HEADER_FETCH_ATTEMPTS = 10 they actually allow 11 consecutive failures in full.rs:63 and snap_sync.rs:138. In the snap path there is also a stale log message saying “retrying in 5s” while the code sleeps 2s in snap_sync.rs:146 and snap_sync.rs:148.

Assumption: this review is from static inspection of the diff and touched code. I couldn’t run cargo test in the sandbox because rustup attempted to write to a read-only location outside the workspace.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

greptile-apps · 2026-03-31T10:04:13Z

Greptile Summary

This PR improves full sync stability by addressing three interconnected issues with peer selection and failure tracking, plus reduces timeout latency.

Weighted peer selection (peer_table.rs): get_random_peer now uses WeightedIndex to bias selection toward higher-scored peers. Scores are mapped from [-150, 50] to weights [1, 201], so even the worst peer remains eligible but is ~200× less likely to be chosen than the best peer.
Header download scoring (peer_handler.rs): record_success and record_failure calls were missing after header fetches despite the log messages claiming otherwise. Both are now correctly wired up.
Consecutive failure counter (sync/full.rs, sync/snap_sync.rs): The attempts counter is reset to 0 on each successful header response, so MAX_HEADER_FETCH_ATTEMPTS (now 10) truly reflects consecutive failures rather than a lifetime total.
Timeout / latency (snap/constants.rs): PEER_REPLY_TIMEOUT drops from 15s → 5s and the inter-attempt retry sleep drops from 5s → 2s, making stalled syncs recover much faster.

Confidence Score: 5/5

Safe to merge; all changes are targeted reliability improvements with no correctness issues found.

The logic for weighted selection, consecutive-failure reset, and header scoring additions is all sound. The only finding is a stale log string in snap_sync.rs ("retrying in 5s" vs actual 2s sleep), which is cosmetic and does not affect runtime behavior.

crates/networking/p2p/sync/snap_sync.rs — stale "retrying in 5s" log message (cosmetic only)

Important Files Changed

Filename	Overview
crates/networking/p2p/peer_table.rs	Replaces uniform-random peer selection with weighted selection using `WeightedIndex`; score is mapped from [-150, 50] to weights [1, 201], ensuring no peer weight reaches 0 and the distribution is always valid.
crates/networking/p2p/peer_handler.rs	Adds `record_success` and `record_failure` calls on the peer table after header download results — fixes the previously noted discrepancy between logs and actual score updates.
crates/networking/p2p/sync/full.rs	Resets `attempts` counter to 0 on each successful header fetch so the limit now tracks only consecutive failures; retry sleep correctly reduced to 2s and log updated to match.
crates/networking/p2p/sync/snap_sync.rs	Same consecutive-failure reset applied as in full.rs; retry sleep reduced to 2s but the warning log still says "retrying in 5s" — minor inconsistency.
crates/networking/p2p/snap/constants.rs	`PEER_REPLY_TIMEOUT` reduced from 15s to 5s and `MAX_HEADER_FETCH_ATTEMPTS` raised from 5 to 10 to compensate for the new consecutive-reset semantics; comment updated accordingly.

Sequence Diagram

sequenceDiagram
    participant Sync as sync_cycle_full / sync_cycle_snap
    participant PH as PeerHandler
    participant PT as PeerTable

    Sync->>PT: get_random_peer(capabilities)
    Note over PT: WeightedIndex by score<br/>maps [-150,50] → weight [1,201]
    PT-->>Sync: (peer_id, connection)

    Sync->>PH: request_block_headers(peer_id)
    PH->>PT: make_request (timeout: 5s)
    alt Headers valid & chained
        PT-->>PH: block_headers
        PH->>PT: record_success(peer_id)  [score +1]
        PH-->>Sync: Some(block_headers)
        Note over Sync: attempts = 0 (reset)
    else Empty / unchained
        PH->>PT: record_failure(peer_id)  [score -1]
        PH-->>Sync: None
        Note over Sync: attempts += 1
    else Timeout
        PH->>PT: record_failure(peer_id)  [score -1]
        PH-->>Sync: None
        Note over Sync: attempts += 1
    end

    alt attempts > MAX_HEADER_FETCH_ATTEMPTS (10)
        Sync->>Sync: abort, wait for newer CL sync head
    else
        Sync->>Sync: sleep 2s, retry
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 146

Comment:
**Stale log message after sleep reduction**

The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in `sync/full.rs` was correctly updated to "2s". This will mislead operators reading logs.

```suggestion
                "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s"
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "reset failure counter on success so we o..." | Re-trigger Greptile}

github-actions · 2026-03-31T10:04:14Z

Lines of code report

Total lines added: 19
Total lines removed: 0
Total lines changed: 19

Detailed view

+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 596   | +6   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs     | 1067  | +11  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/full.rs      | 249   | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1149  | +1   |
+------------------------------------------------+-------+------+

greptile-apps · 2026-03-31T10:04:17Z

@@ -145,9 +145,11 @@ pub async fn sync_cycle_snap(
            warn!(
                "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s"


Stale log message after sleep reduction

The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in sync/full.rs was correctly updated to "2s". This will mislead operators reading logs.

Suggested change

"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s"

"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s"

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 146 Comment: **Stale log message after sleep reduction** The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in `sync/full.rs` was correctly updated to "2s". This will mislead operators reading logs. ```suggestion "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s" ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-03-31T10:04:26Z

🤖 Claude Code Review

Now I have enough context to write a thorough review.

PR Review: feat(l1): weighted peer selection and better tolerance in header downloads

Overall: The motivation is sound and the changes are well-scoped. A few correctness issues need attention before merge.

Critical: Potential integer overflow in weighted peer selection

peer_table.rs, lines 1085–1088

let weights: Vec<u64> = peers
    .iter()
    .map(|(_, _, score)| (score - MIN_SCORE_CRITICAL + 1) as u64)
    .collect();

MIN_SCORE_CRITICAL = MIN_SCORE * 3 = -150. The formula is score + 151, cast to u64. If a peer's score drops below -151, the i64 value becomes negative, and the as u64 cast wraps to a near-u64::MAX value — that peer would then receive an astronomically high selection weight, completely defeating the purpose of the scoring system.

The comment states the range is [-150, 50], but there's no clamping enforcing this at the call site. Defensively, you should clamp before casting:

.map(|(_, _, score)| (score.max(MIN_SCORE_CRITICAL) - MIN_SCORE_CRITICAL + 1) as u64)

This is worth verifying: does record_failure clamp or is peers-below-MIN_SCORE_CRITICAL disconnected before reaching this point? If it's reliably enforced upstream, the risk is lower — but since this is consensus-adjacent networking code, a one-line guard is cheap insurance.

Minor: Stale log message in `snap_sync.rs`

snap_sync.rs, line 146 (post-diff context, line ~146 in the file)

"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s"

The sleep was changed to 2s (Duration::from_secs(2)) but the log message still says "retrying in 5s". The equivalent message in full.rs was correctly updated to "retrying in 2s".

Correctness: `attempts` increment/check ordering

full.rs and snap_sync.rs

The current structure is:

if attempts > MAX_HEADER_FETCH_ATTEMPTS {
    // abort
}
attempts += 1;

With attempts = 0 on success and MAX_HEADER_FETCH_ATTEMPTS = 10, this means up to 11 failures are tolerated before aborting (the check fires when attempts > 10, so the 11th increment takes it to 11 and the 12th iteration hits the guard). This is a pre-existing pattern, not introduced here, but worth noting. Consider >= instead of > if you want exactly 10 attempts.

Correctness: `WeightedIndex::new` silent failure

peer_table.rs, line 1089

let dist = WeightedIndex::new(&weights).ok()?;

If construction fails (e.g., due to all-zero weights from the overflow case above), this returns None, silently behaving as if no peers are available. A fallback to uniform selection or at least a warn!() log would make this failure mode observable.

Positive observations

Calling record_success / record_failure in peer_handler.rs is a clear bug fix — the scoring infrastructure existed but was never invoked for header downloads.
Resetting attempts = 0 on success to track consecutive failures rather than cumulative is semantically correct and the docstring update on MAX_HEADER_FETCH_ATTEMPTS reflects this accurately.
The weight formula score + 151 is mathematically reasonable — it preserves some selection probability for bad peers (weight 1 at minimum valid score) while making good peers ~200x more likely than worst-case peers.
Timeout reduction (15s → 5s, 5s → 2s) is consistent with the "fail fast" direction of the PR.

Automated review by Claude (Anthropic) · sonnet · custom prompt

…e fix, log fix

…ponses.

…ceive a ref

Brings in main commits since the prior merge: #6516 EIP-8025 compliance (Electra-aligned ExecutionRequests typed container in NewPayloadRequest, MAX_CONSOLIDATION_REQUESTS_PER_PAYLOAD corrected from 1 to 2, to_encoded_requests() helper for EIP-7685 bytes, removal of ExecutionPayloadHeader/NewPayloadRequestHeader, new byte-oriented execution_program entrypoint that decodes the wire format internally and returns valid: false instead of erroring on post-decode failures), #6463 BAL withdrawal reverse check (DB->BAL direction so a malicious builder can't omit a withdrawal recipient from the BAL), #6505 Kademlia k-bucket revert (PeerTableServer::spawn no longer takes a node_id), plus snap-sync observability + dashboards (#6470), pivot-update crash fix (#6475), weighted peer selection (#6428), txpool_contentFrom/txpool_inspect RPC (#6446), block-by-block exec fallback (#6464), Amsterdam EELS branch pin (#6495), and rollup store SQLite v9->v10 migration (#6514). Conflict resolutions: - crates/common/types/stateless_ssz.rs: this branch had already moved the EIP-8025 SSZ types out of crates/common/types/eip8025_ssz.rs into stateless_ssz.rs and tucked the native-rollup containers below them. Kept that layout, applied #6516's content updates to the EIP-8025 section (renamed spec-limit constants, ExecutionRequests typed container with to_encoded_requests, dropped header types and their tests), pulled in the EncodedRequests import, and kept both the new test_execution_requests_to_encoded_bytes and the branch's stateless round-trip tests. - crates/guest-program/src/l1/program.rs: adopted #6516's new execution_program(bytes: &[u8], crypto) API with the internal decode_eip8025 call, the validate_eip8025_execution helper, and the decode-failure test. Rewrote all `eip-8025` feature gates as `experimental-devnet` and all `eip8025_ssz::` paths as `stateless_ssz::` to match this branch's renames. - crates/guest-program/bin/{sp1,risc0,zisk,openvm}/src/main.rs: applied #6516's simplification (drop decode_eip8025 import, pass &input straight to execution_program) under the experimental-devnet feature gate. Also flipped the rkyv::rancor::Error import gate from the old `eip-8025` name to `experimental-devnet` so the non-devnet build still has the import it needs. - crates/prover/src/backend/exec.rs: kept #6516's updated comment ("raw input bytes" instead of "(NewPayloadRequest, ExecutionWitness)") under the experimental-devnet feature gate. Auto-merged regions checked: crates/vm/backends/levm/mod.rs picked up all of #6463's Part B (DB->BAL) reverse check intact, and cmd/ethrex/l2/initializers.rs picked up #6505's PeerTableServer::spawn signature change. Verified cargo fmt --all clean, cargo check --workspace clean, cargo check --workspace --tests clean, and cargo check -p ethrex-guest-program --features experimental-devnet --tests clean.

…loads (#6428) Full sync was very unstable because of peer selection and intolerance issues. - Peers were previously selected randomly. We now weight the peer selection to conserver randomness but prefer nodes with better scores. - Peer score was not modified with header download success and failure despite the logs saying so. Added both failure and success peer scoring modifications. Peer score is only decreased when there is a non-empty but unchained response, which is invalid. Empty responses are legitimate and are not penalized. - Failure policy was cumulative. That is, we were not counting a peer to fail 5 consecutive times (which would mean unresponsiveness), we were counting a total failure count of 5, regardless of success. We changed this so success resets the failure count. This is also fixed for snap sync. - Made failure faster (5 seconds instead of 15).

Arkenan added 2 commits March 31, 2026 11:22

weighted peer selection, record failure or success on header download…

880244a

…, more fetch attempts, faster timeout

reset failure counter on success so we only fail on consecutive failu…

efaa344

…res, not in total failures

Arkenan requested a review from a team as a code owner March 31, 2026 10:01

github-actions Bot assigned Arkenan Mar 31, 2026

Arkenan changed the title ~~Fullsync improvement~~ feat(l1): fullsync peer selection and header download improvments Mar 31, 2026

github-actions Bot added the L1 Ethereum client label Mar 31, 2026

github-project-automation Bot added this to ethrex_l1 Mar 31, 2026

Arkenan changed the title ~~feat(l1): fullsync peer selection and header download improvments~~ feat(l1): weighted peer selection and better tolerance in header downloads Mar 31, 2026

greptile-apps Bot reviewed Mar 31, 2026

View reviewed changes

azteca1998 approved these changes Mar 31, 2026

View reviewed changes

Arkenan added 2 commits March 31, 2026 12:17

Avoid cloning peers on iteration, avoid potential overflow, off by on…

80880e3

…e fix, log fix

only penalize unchained non-empty responses. Empty ones are legit res…

ca90382

…ponses.

ElFantasma approved these changes Mar 31, 2026

View reviewed changes

iovoid approved these changes Mar 31, 2026

View reviewed changes

github-project-automation Bot moved this to In Review in ethrex_l1 Mar 31, 2026

avilagaston9 approved these changes Apr 6, 2026

View reviewed changes

edg-l approved these changes Apr 21, 2026

View reviewed changes

merge with main

9eef66f

Arkenan enabled auto-merge April 21, 2026 13:21

Fix calls to record success and failure which used to be async and re…

d2509cb

…ceive a ref

Arkenan added this pull request to the merge queue Apr 21, 2026

Merged via the queue into main with commit 06ad6f5 Apr 21, 2026
52 checks passed

Arkenan deleted the fullsync-improvement branch April 21, 2026 14:58

github-project-automation Bot moved this from In Review to Done in ethrex_l1 Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l1): weighted peer selection and better tolerance in header downloads#6428

feat(l1): weighted peer selection and better tolerance in header downloads#6428
Arkenan merged 6 commits intomainfrom
fullsync-improvement

Arkenan commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

greptile-apps Bot commented Mar 31, 2026

Important Files Changed

Uh oh!

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		@@ -145,9 +145,11 @@ pub async fn sync_cycle_snap(
		warn!(
		"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s"

Conversation

Arkenan commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

github-actions Bot commented Mar 31, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Mar 31, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps Bot commented Mar 31, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

greptile-apps Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 31, 2026

🤖 Claude Code Review

PR Review: feat(l1): weighted peer selection and better tolerance in header downloads

Critical: Potential integer overflow in weighted peer selection

Minor: Stale log message in snap_sync.rs

Correctness: attempts increment/check ordering

Correctness: WeightedIndex::new silent failure

Positive observations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Arkenan commented Mar 31, 2026 •

edited

Loading

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Minor: Stale log message in `snap_sync.rs`

Correctness: `attempts` increment/check ordering

Correctness: `WeightedIndex::new` silent failure