Skip to content

feat(l1): weighted peer selection and better tolerance in header downloads#6428

Merged
Arkenan merged 6 commits intomainfrom
fullsync-improvement
Apr 21, 2026
Merged

feat(l1): weighted peer selection and better tolerance in header downloads#6428
Arkenan merged 6 commits intomainfrom
fullsync-improvement

Conversation

@Arkenan
Copy link
Copy Markdown
Collaborator

@Arkenan Arkenan commented Mar 31, 2026

Motivation

Full sync was very unstable because of peer selection and intolerance issues.

Changes

  • Peers were previously selected randomly. We now weight the peer selection to conserver randomness but prefer nodes with better scores.
  • Peer score was not modified with header download success and failure despite the logs saying so. Added both failure and success peer scoring modifications. Peer score is only decreased when there is a non-empty but unchained response, which is invalid. Empty responses are legitimate and are not penalized.
  • Failure policy was cumulative. That is, we were not counting a peer to fail 5 consecutive times (which would mean unresponsiveness), we were counting a total failure count of 5, regardless of success. We changed this so success resets the failure count. This is also fixed for snap sync.
  • Made failure faster (5 seconds instead of 15).

@Arkenan Arkenan requested a review from a team as a code owner March 31, 2026 10:01
@Arkenan Arkenan changed the title Fullsync improvement feat(l1): fullsync peer selection and header download improvments Mar 31, 2026
@github-actions github-actions Bot added the L1 Ethereum client label Mar 31, 2026
@Arkenan Arkenan changed the title feat(l1): fullsync peer selection and header download improvments feat(l1): weighted peer selection and better tolerance in header downloads Mar 31, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Critical Issues

1. Log message inconsistency in snap_sync.rs

  • File: crates/networking/p2p/sync/snap_sync.rs
  • Lines: 146-148
  • The warning message states "retrying in 5s" but the actual sleep duration was changed to 2 seconds. This mismatch will mislead operators during debugging.
  • Fix: Update the log message to say "retrying in 2s" to match the Duration::from_secs(2) on line 148.

2. Error handling in peer_handler.rs may abort sync unnecessarily

  • File: crates/networking/p2p/peer_handler.rs
  • Lines: 443, 453, 457
  • Using .await? on record_success and record_failure means that if the peer table operation fails (e.g., lock contention, channel closed), the entire header fetch operation returns an error, even when headers were successfully received.
  • Recommendation: Log peer scoring failures but don't fail the sync operation. Consider:
    if let Err(e) = self.peer_table.record_success(&peer_id).await {
        warn!("Failed to record peer success: {}", e);
    }

Medium Issues

3. Potential integer underflow in weight calculation

  • File: crates/networking/p2p/peer_table.rs
  • Line: 1085
  • The cast (score - MIN_SCORE_CRITICAL + 1) as u64 assumes score >= MIN_SCORE_CRITICAL. If a peer's score drops below this threshold (e.g., due to bugs or edge cases), the subtraction yields a negative i64, which when cast to u64 becomes a very large positive number (near u64::MAX), dominating the weighted distribution.
  • Recommendation: Use saturating_sub or clamp the score before calculation:
    .map(|(_, _, score)| score.saturating_sub(MIN_SCORE_CRITICAL).saturating_add(1).max(1) as u64)

4. Unnecessary clone of PeerConnection

  • File: crates/networking/p2p/peer_table.rs
  • Lines: 1075, 1091
  • The connection is cloned twice: once when building the peers vector (line 1075) and again when returning (line 1091).
  • Recommendation: Store references in the temporary vector to avoid the double clone, or use swap_remove if order doesn't matter.

Minor Issues

5. Aggressive timeout reduction

  • File: crates/networking/p2p/snap/constants.rs
  • Line: 66
  • Reducing PEER_REPLY_TIMEOUT from 15s to 5s may cause excessive timeouts on high-latency networks or during temporary congestion. Ensure this is acceptable for the target network conditions.

Positive Notes

  • The weighted peer selection using WeightedIndex correctly uses OsRng for cryptographic randomness.
  • Resetting attempts = 0 on success (lines 78 in full.rs and 151 in snap_sync.rs) correctly implements the "consecutive failures" semantics described in the constant documentation.
  • The score mapping logic (shifting by MIN_SCORE_CRITICAL) ensures bad peers remain selectable but unlikely, preventing total network isolation if all peers temporarily misbehave.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. PEER_REPLY_TIMEOUT is advertised as a snap-only tuning, but this constant is re-exported and used by the generic ETH request path too. Changing it from 15s to 5s in snap/constants.rs:66 also shortens the timeout for GetBlockHeaders / GetBlockBodies in peer_handler.rs:29, peer_handler.rs:71, peer_handler.rs:431, and peer_handler.rs:520. That broadens the behavior change well beyond snap sync and can turn slow-but-valid peers into false timeouts, especially for large body/header responses, which then feeds the new scoring/selection logic. This should either stay scoped to snap requests or use separate timeouts for ETH header/body paths.

  2. request_block_headers_from_hash now penalizes an empty header response the same way as invalid data in peer_handler.rs:440 and peer_handler.rs:449. For full sync, though, an empty response is already treated by the caller as a normal “this peer doesn’t know that head yet” retry case in full.rs:59 and full.rs:63. With the new global peer scoring, this will downgrade honest but slightly lagging peers during ordinary head churn/reorg conditions, and that score then affects unrelated future peer selection. I’d only penalize malformed/non-chained data here, not block_headers.is_empty().

  3. The new “consecutive failures” semantics are off by one: both loops abort only when attempts > MAX_HEADER_FETCH_ATTEMPTS, so with MAX_HEADER_FETCH_ATTEMPTS = 10 they actually allow 11 consecutive failures in full.rs:63 and snap_sync.rs:138. In the snap path there is also a stale log message saying “retrying in 5s” while the code sleeps 2s in snap_sync.rs:146 and snap_sync.rs:148.

Assumption: this review is from static inspection of the diff and touched code. I couldn’t run cargo test in the sandbox because rustup attempted to write to a read-only location outside the workspace.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 31, 2026

Greptile Summary

This PR improves full sync stability by addressing three interconnected issues with peer selection and failure tracking, plus reduces timeout latency.

  • Weighted peer selection (peer_table.rs): get_random_peer now uses WeightedIndex to bias selection toward higher-scored peers. Scores are mapped from [-150, 50] to weights [1, 201], so even the worst peer remains eligible but is ~200× less likely to be chosen than the best peer.
  • Header download scoring (peer_handler.rs): record_success and record_failure calls were missing after header fetches despite the log messages claiming otherwise. Both are now correctly wired up.
  • Consecutive failure counter (sync/full.rs, sync/snap_sync.rs): The attempts counter is reset to 0 on each successful header response, so MAX_HEADER_FETCH_ATTEMPTS (now 10) truly reflects consecutive failures rather than a lifetime total.
  • Timeout / latency (snap/constants.rs): PEER_REPLY_TIMEOUT drops from 15s → 5s and the inter-attempt retry sleep drops from 5s → 2s, making stalled syncs recover much faster.

Confidence Score: 5/5

Safe to merge; all changes are targeted reliability improvements with no correctness issues found.

The logic for weighted selection, consecutive-failure reset, and header scoring additions is all sound. The only finding is a stale log string in snap_sync.rs ("retrying in 5s" vs actual 2s sleep), which is cosmetic and does not affect runtime behavior.

crates/networking/p2p/sync/snap_sync.rs — stale "retrying in 5s" log message (cosmetic only)

Important Files Changed

Filename Overview
crates/networking/p2p/peer_table.rs Replaces uniform-random peer selection with weighted selection using WeightedIndex; score is mapped from [-150, 50] to weights [1, 201], ensuring no peer weight reaches 0 and the distribution is always valid.
crates/networking/p2p/peer_handler.rs Adds record_success and record_failure calls on the peer table after header download results — fixes the previously noted discrepancy between logs and actual score updates.
crates/networking/p2p/sync/full.rs Resets attempts counter to 0 on each successful header fetch so the limit now tracks only consecutive failures; retry sleep correctly reduced to 2s and log updated to match.
crates/networking/p2p/sync/snap_sync.rs Same consecutive-failure reset applied as in full.rs; retry sleep reduced to 2s but the warning log still says "retrying in 5s" — minor inconsistency.
crates/networking/p2p/snap/constants.rs PEER_REPLY_TIMEOUT reduced from 15s to 5s and MAX_HEADER_FETCH_ATTEMPTS raised from 5 to 10 to compensate for the new consecutive-reset semantics; comment updated accordingly.

Sequence Diagram

sequenceDiagram
    participant Sync as sync_cycle_full / sync_cycle_snap
    participant PH as PeerHandler
    participant PT as PeerTable

    Sync->>PT: get_random_peer(capabilities)
    Note over PT: WeightedIndex by score<br/>maps [-150,50] → weight [1,201]
    PT-->>Sync: (peer_id, connection)

    Sync->>PH: request_block_headers(peer_id)
    PH->>PT: make_request (timeout: 5s)
    alt Headers valid & chained
        PT-->>PH: block_headers
        PH->>PT: record_success(peer_id)  [score +1]
        PH-->>Sync: Some(block_headers)
        Note over Sync: attempts = 0 (reset)
    else Empty / unchained
        PH->>PT: record_failure(peer_id)  [score -1]
        PH-->>Sync: None
        Note over Sync: attempts += 1
    else Timeout
        PH->>PT: record_failure(peer_id)  [score -1]
        PH-->>Sync: None
        Note over Sync: attempts += 1
    end

    alt attempts > MAX_HEADER_FETCH_ATTEMPTS (10)
        Sync->>Sync: abort, wait for newer CL sync head
    else
        Sync->>Sync: sleep 2s, retry
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 146

Comment:
**Stale log message after sleep reduction**

The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in `sync/full.rs` was correctly updated to "2s". This will mislead operators reading logs.

```suggestion
                "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s"
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "reset failure counter on success so we o..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 31, 2026

Lines of code report

Total lines added: 19
Total lines removed: 0
Total lines changed: 19

Detailed view
+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 596   | +6   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs     | 1067  | +11  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/full.rs      | 249   | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1149  | +1   |
+------------------------------------------------+-------+------+

Comment thread crates/networking/p2p/sync/snap_sync.rs Outdated
@@ -145,9 +145,11 @@ pub async fn sync_cycle_snap(
warn!(
"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale log message after sleep reduction

The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in sync/full.rs was correctly updated to "2s". This will mislead operators reading logs.

Suggested change
"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s"
"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s"
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 146

Comment:
**Stale log message after sleep reduction**

The sleep duration was updated to 2s, but the log message still claims "retrying in 5s". The equivalent line in `sync/full.rs` was correctly updated to "2s". This will mislead operators reading logs.

```suggestion
                "Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 2s"
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context to write a thorough review.


PR Review: feat(l1): weighted peer selection and better tolerance in header downloads

Overall: The motivation is sound and the changes are well-scoped. A few correctness issues need attention before merge.


Critical: Potential integer overflow in weighted peer selection

peer_table.rs, lines 1085–1088

let weights: Vec<u64> = peers
    .iter()
    .map(|(_, _, score)| (score - MIN_SCORE_CRITICAL + 1) as u64)
    .collect();

MIN_SCORE_CRITICAL = MIN_SCORE * 3 = -150. The formula is score + 151, cast to u64. If a peer's score drops below -151, the i64 value becomes negative, and the as u64 cast wraps to a near-u64::MAX value — that peer would then receive an astronomically high selection weight, completely defeating the purpose of the scoring system.

The comment states the range is [-150, 50], but there's no clamping enforcing this at the call site. Defensively, you should clamp before casting:

.map(|(_, _, score)| (score.max(MIN_SCORE_CRITICAL) - MIN_SCORE_CRITICAL + 1) as u64)

This is worth verifying: does record_failure clamp or is peers-below-MIN_SCORE_CRITICAL disconnected before reaching this point? If it's reliably enforced upstream, the risk is lower — but since this is consensus-adjacent networking code, a one-line guard is cheap insurance.


Minor: Stale log message in snap_sync.rs

snap_sync.rs, line 146 (post-diff context, line ~146 in the file)

"Failed to fetch headers for sync head (attempt {attempts}/{MAX_HEADER_FETCH_ATTEMPTS}), retrying in 5s"

The sleep was changed to 2s (Duration::from_secs(2)) but the log message still says "retrying in 5s". The equivalent message in full.rs was correctly updated to "retrying in 2s".


Correctness: attempts increment/check ordering

full.rs and snap_sync.rs

The current structure is:

if attempts > MAX_HEADER_FETCH_ATTEMPTS {
    // abort
}
attempts += 1;

With attempts = 0 on success and MAX_HEADER_FETCH_ATTEMPTS = 10, this means up to 11 failures are tolerated before aborting (the check fires when attempts > 10, so the 11th increment takes it to 11 and the 12th iteration hits the guard). This is a pre-existing pattern, not introduced here, but worth noting. Consider >= instead of > if you want exactly 10 attempts.


Correctness: WeightedIndex::new silent failure

peer_table.rs, line 1089

let dist = WeightedIndex::new(&weights).ok()?;

If construction fails (e.g., due to all-zero weights from the overflow case above), this returns None, silently behaving as if no peers are available. A fallback to uniform selection or at least a warn!() log would make this failure mode observable.


Positive observations

  • Calling record_success / record_failure in peer_handler.rs is a clear bug fix — the scoring infrastructure existed but was never invoked for header downloads.
  • Resetting attempts = 0 on success to track consecutive failures rather than cumulative is semantically correct and the docstring update on MAX_HEADER_FETCH_ATTEMPTS reflects this accurately.
  • The weight formula score + 151 is mathematically reasonable — it preserves some selection probability for bad peers (weight 1 at minimum valid score) while making good peers ~200x more likely than worst-case peers.
  • Timeout reduction (15s → 5s, 5s → 2s) is consistent with the "fail fast" direction of the PR.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-project-automation github-project-automation Bot moved this to In Review in ethrex_l1 Mar 31, 2026
@Arkenan Arkenan enabled auto-merge April 21, 2026 13:21
@Arkenan Arkenan added this pull request to the merge queue Apr 21, 2026
Merged via the queue into main with commit 06ad6f5 Apr 21, 2026
52 checks passed
@Arkenan Arkenan deleted the fullsync-improvement branch April 21, 2026 14:58
@github-project-automation github-project-automation Bot moved this from In Review to Done in ethrex_l1 Apr 21, 2026
avilagaston9 added a commit that referenced this pull request Apr 27, 2026
Brings in main commits since the prior merge: #6516 EIP-8025 compliance
(Electra-aligned ExecutionRequests typed container in NewPayloadRequest,
MAX_CONSOLIDATION_REQUESTS_PER_PAYLOAD corrected from 1 to 2,
to_encoded_requests() helper for EIP-7685 bytes, removal of
ExecutionPayloadHeader/NewPayloadRequestHeader, new byte-oriented
execution_program entrypoint that decodes the wire format internally and
returns valid: false instead of erroring on post-decode failures), #6463
BAL withdrawal reverse check (DB->BAL direction so a malicious builder
can't omit a withdrawal recipient from the BAL), #6505 Kademlia k-bucket
revert (PeerTableServer::spawn no longer takes a node_id), plus snap-sync
observability + dashboards (#6470), pivot-update crash fix (#6475),
weighted peer selection (#6428), txpool_contentFrom/txpool_inspect RPC
(#6446), block-by-block exec fallback (#6464), Amsterdam EELS branch pin
(#6495), and rollup store SQLite v9->v10 migration (#6514).

Conflict resolutions:
- crates/common/types/stateless_ssz.rs: this branch had already moved
  the EIP-8025 SSZ types out of crates/common/types/eip8025_ssz.rs into
  stateless_ssz.rs and tucked the native-rollup containers below them.
  Kept that layout, applied #6516's content updates to the EIP-8025
  section (renamed spec-limit constants, ExecutionRequests typed
  container with to_encoded_requests, dropped header types and their
  tests), pulled in the EncodedRequests import, and kept both the new
  test_execution_requests_to_encoded_bytes and the branch's stateless
  round-trip tests.
- crates/guest-program/src/l1/program.rs: adopted #6516's new
  execution_program(bytes: &[u8], crypto) API with the internal
  decode_eip8025 call, the validate_eip8025_execution helper, and the
  decode-failure test. Rewrote all `eip-8025` feature gates as
  `experimental-devnet` and all `eip8025_ssz::` paths as
  `stateless_ssz::` to match this branch's renames.
- crates/guest-program/bin/{sp1,risc0,zisk,openvm}/src/main.rs: applied
  #6516's simplification (drop decode_eip8025 import, pass &input
  straight to execution_program) under the experimental-devnet feature
  gate. Also flipped the rkyv::rancor::Error import gate from the old
  `eip-8025` name to `experimental-devnet` so the non-devnet build still
  has the import it needs.
- crates/prover/src/backend/exec.rs: kept #6516's updated comment ("raw
  input bytes" instead of "(NewPayloadRequest, ExecutionWitness)") under
  the experimental-devnet feature gate.

Auto-merged regions checked: crates/vm/backends/levm/mod.rs picked up
all of #6463's Part B (DB->BAL) reverse check intact, and
cmd/ethrex/l2/initializers.rs picked up #6505's PeerTableServer::spawn
signature change. Verified cargo fmt --all clean, cargo check --workspace
clean, cargo check --workspace --tests clean, and cargo check -p
ethrex-guest-program --features experimental-devnet --tests clean.
Arkenan added a commit that referenced this pull request Apr 30, 2026
…loads (#6428)

Full sync was very unstable because of peer selection and intolerance
issues.

- Peers were previously selected randomly. We now weight the peer
selection to conserver randomness but prefer nodes with better scores.
- Peer score was not modified with header download success and failure
despite the logs saying so. Added both failure and success peer scoring
modifications. Peer score is only decreased when there is a non-empty
but unchained response, which is invalid. Empty responses are legitimate
and are not penalized.
- Failure policy was cumulative. That is, we were not counting a peer to
fail 5 consecutive times (which would mean unresponsiveness), we were
counting a total failure count of 5, regardless of success. We changed
this so success resets the failure count. This is also fixed for snap
sync.
- Made failure faster (5 seconds instead of 15).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants