Skip to content

feat(p2p): use BlocksByRange for long-range sync#351

Open
dicethedev wants to merge 8 commits intolambdaclass:mainfrom
dicethedev:feat/blocks-by-range-long-range-sync
Open

feat(p2p): use BlocksByRange for long-range sync#351
dicethedev wants to merge 8 commits intolambdaclass:mainfrom
dicethedev:feat/blocks-by-range-long-range-sync

Conversation

@dicethedev
Copy link
Copy Markdown
Contributor

🗒️ Description / Motivation

This PR closes #347 by wiring the BlocksByRange protocol added in #348 into the status-response sync path.

Previously, when a peer's Status response revealed it was ahead of our local head, we had no mechanism to backfill the gap. Now, when the gap exceeds a configurable threshold (LONG_RANGE_SYNC_THRESHOLD = 2 slots), we request the missing range using BlocksByRange instead of relying on gossip or individual BlocksByRoot fetches.

For small gaps (1–2 slots), we defer to the existing FetchBlock path since roots are typically already available from gossip and BlocksByRoot is more precise for that case.


What Changed

lib.rs

  • Added LONG_RANGE_SYNC_THRESHOLD: u64 = 2 constant

req_resp/handlers.rs

  • Updated handle_status_response to branch on gap size:
    • gap > LONG_RANGE_SYNC_THRESHOLDrequest_blocks_by_range_from_peer
    • gap ≤ threshold → defers to gossip / FetchBlock

Correctness / Behavior Guarantees

  • request_blocks_by_range_from_peer already batches internally at MAX_REQUEST_BLOCKS (1024), so nodes thousands of slots behind are handled correctly across multiple requests with no additional changes
  • handle_blocks_by_range_response (added in feat(p2p): add inbound BlocksByRange req/resp support #348) already forwards each block to the blockchain layer — the response path is complete
  • BlocksByRoot behavior for individual missing blocks (FetchBlock, retry/backoff logic) is unchanged

Tests Added / Run

No new tests required. The range response handling and canonical block selection are covered by the test added in #348 (blocks_by_range_returns_canonical_blocks_in_requested_order).

Related Issues / PRs

✅ Verification Checklist

  • Ran make fmt — clean
  • Ran make lint (clippy with -D warnings) — clean
  • Ran cargo test --workspace --release — all passing

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR wires the previously-added BlocksByRange protocol into the status-response sync path: when a peer's Status reveals it is more than LONG_RANGE_SYNC_THRESHOLD (2) slots ahead, the node now requests the missing range in batches of up to 1024 blocks instead of relying solely on gossip.

  • handle_status_response now computes the slot gap and dispatches request_blocks_by_range_from_peer, which loops ⌈gap / 1024⌉ times sending batched requests; the gap is taken directly from the peer's untrusted Status message with no upper bound.
  • canonical_blocks_by_range serves inbound requests by walking the canonical chain backwards from store.head() until reaching start_slot; traversal cost is O(head_slot − start_slot), not O(count), making it vulnerable to cheap deep-history requests.
  • handle_blocks_by_range_response forwards received blocks to the blockchain layer with no slot-range validation and no retry path wired into the OutboundFailure event handler.

Confidence Score: 3/5

The codec, messages, and module wiring changes are safe, but handlers.rs has three defects on the changed sync path that should be addressed before merging.

The gap passed to the batch-request loop is taken directly from a peer's unauthenticated Status message with no ceiling, so a malicious peer can send an astronomically large head slot and pin the node in a near-infinite async loop. Separately, BlocksByRange request IDs are never registered in request_id_map, so every outbound-failure event silently drops the failure with no retry, leaving the node permanently behind if any batch times out. Additionally, the inbound handler walks the canonical chain from head to the requested start_slot, making its cost proportional to chain depth rather than the requested block count — a cheap way to induce heavy work on the responding node.

crates/net/p2p/src/req_resp/handlers.rs warrants close attention across handle_status_response, request_blocks_by_range_from_peer, canonical_blocks_by_range, and the OutboundFailure branch of handle_req_resp_message.

Important Files Changed

Filename Overview
crates/net/p2p/src/req_resp/handlers.rs Core file with the most significant changes: adds handle_status_response gap-based branching, canonical_blocks_by_range chain traversal, and BlocksByRange request/response handlers — has uncapped gap loop, no OutboundFailure recovery, and O(head_slot) traversal vulnerability
crates/net/p2p/src/req_resp/codec.rs Cleanly refactors decode_blocks_by_root_response into shared decode_blocks_response helper and wires BlocksByRange into the codec; no issues found
crates/net/p2p/src/req_resp/messages.rs Adds BlocksByRangeRequest struct, BlocksByRange response payload variant, MAX_REQUEST_BLOCKS constant, and removes the dead_code attribute from error_message; straightforward and correct
crates/net/p2p/src/lib.rs Adds LONG_RANGE_SYNC_THRESHOLD constant and registers BlocksByRange protocol with ProtocolSupport::Full in the swarm builder; no issues found
crates/net/p2p/src/req_resp/mod.rs Re-exports new BlocksByRange symbols; trivial change with no issues

Sequence Diagram

sequenceDiagram
    participant Local as Local Node
    participant Peer as Remote Peer
    participant Blockchain as Blockchain Layer

    Local->>Peer: Status Request
    Peer-->>Local: Status Response (head_slot, finalized)
    
    Local->>Local: "gap = peer.head_slot - our_head_slot"

    alt "gap <= LONG_RANGE_SYNC_THRESHOLD (2)"
        Local->>Local: rely on gossip / FetchBlock (BlocksByRoot)
    else "gap > LONG_RANGE_SYNC_THRESHOLD"
        loop ceil(gap / MAX_REQUEST_BLOCKS) batches
            Local->>Peer: "BlocksByRange Request (start_slot, count<=1024, step=1)"
            Peer-->>Local: BlocksByRange Response [blocks...]
            Local->>Blockchain: new_block() for each block
        end
    end

    note over Local,Peer: Inbound path (serving requests)
    Peer->>Local: BlocksByRange Request
    Local->>Local: canonical_blocks_by_range() walk chain from head to start_slot
    Local-->>Peer: BlocksByRange Response [canonical blocks]
Loading

Comments Outside Diff (1)

  1. crates/net/p2p/src/req_resp/handlers.rs, line 93-105 (link)

    P1 BlocksByRange outbound failures are silently discarded with no recovery

    request_id_map is only populated for BlocksByRoot requests (see fetch_block_from_peer). When an OutboundFailure fires for a BlocksByRange request, the if let Some(root) = server.request_id_map.remove(&request_id) branch is never taken, so the failure is logged but nothing else happens. If any batch in a long-range sync fails (network error, peer disconnect, timeout), the sync silently stops with no retry or fallback. The node remains stuck behind its peers with no automatic recovery until the next Status message happens to arrive.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/net/p2p/src/req_resp/handlers.rs
    Line: 93-105
    
    Comment:
    **BlocksByRange outbound failures are silently discarded with no recovery**
    
    `request_id_map` is only populated for `BlocksByRoot` requests (see `fetch_block_from_peer`). When an `OutboundFailure` fires for a `BlocksByRange` request, the `if let Some(root) = server.request_id_map.remove(&request_id)` branch is never taken, so the failure is logged but nothing else happens. If any batch in a long-range sync fails (network error, peer disconnect, timeout), the sync silently stops with no retry or fallback. The node remains stuck behind its peers with no automatic recovery until the next Status message happens to arrive.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
crates/net/p2p/src/req_resp/handlers.rs:141-151
**Uncapped gap triggers a near-infinite request loop**

`gap` is directly used as the `count` passed to `request_blocks_by_range_from_peer`, which loops `⌈gap / MAX_REQUEST_BLOCKS⌉` times sending requests. A malicious peer sending `Status { head.slot = u64::MAX }` would cause the loop to iterate ~1.8 × 10¹⁶ times — effectively hanging the node until the swarm channel closes (whose capacity determines how long that takes). Even a "legitimate" peer claiming to be 10 million slots ahead would immediately queue ~9,766 requests. There is no upper bound on how many batches are dispatched in a single call.

### Issue 2 of 4
crates/net/p2p/src/req_resp/handlers.rs:93-105
**BlocksByRange outbound failures are silently discarded with no recovery**

`request_id_map` is only populated for `BlocksByRoot` requests (see `fetch_block_from_peer`). When an `OutboundFailure` fires for a `BlocksByRange` request, the `if let Some(root) = server.request_id_map.remove(&request_id)` branch is never taken, so the failure is logged but nothing else happens. If any batch in a long-range sync fails (network error, peer disconnect, timeout), the sync silently stops with no retry or fallback. The node remains stuck behind its peers with no automatic recovery until the next Status message happens to arrive.

### Issue 3 of 4
crates/net/p2p/src/req_resp/handlers.rs:221-267
**O(head\_slot) chain traversal for old `start_slot` requests**

`canonical_blocks_by_range` always starts walking from `store.head()` and traverses backwards one header at a time until `header.slot < start_slot`. If a peer requests `start_slot = 0, count = 1024` on a chain whose head is at slot 1,000,000, the loop performs 1,000,000 `store.get_block_header` calls before collecting any of the 1024 requested blocks. Since `count` is bounded by `MAX_REQUEST_BLOCKS` but `start_slot` is not validated against the local chain, this becomes an unbounded-work request handler. A peer can exploit this to perform a cheap DoS by repeatedly requesting from `start_slot = 0`.

### Issue 4 of 4
crates/net/p2p/src/req_resp/handlers.rs:328-342
`handle_blocks_by_range_response` does not verify that the returned blocks' slots fall within the requested range. A misbehaving peer can inject arbitrary blocks (from a different range or a different chain) and they will be forwarded unconditionally to the blockchain layer. At minimum, the slot of each block should be cross-checked against the requested `[start_slot, start_slot + count)` window, which is available from the request context.

```suggestion
    if let Some(ref blockchain) = server.blockchain {
        for block in blocks {
            let block_root = block.message.hash_tree_root();
            let slot = block.message.slot;
            // TODO: validate block.message.slot is within the originally requested range.
            let _ = blockchain.new_block(block).inspect_err(|err| {
                error!(
                    %peer,
                    %slot,
                    block_root = %ethlambda_types::ShortRoot(&block_root.0),
                    %err,
                    "Failed to forward range-fetched block to blockchain"
                )
            });
        }
    }
```

Reviews (1): Last reviewed commit: "fix(clippy): use is_multiple_of for slot..." | Re-trigger Greptile

Comment thread crates/net/p2p/src/req_resp/handlers.rs
Comment thread crates/net/p2p/src/req_resp/handlers.rs
Comment thread crates/net/p2p/src/req_resp/handlers.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use BlocksByRange for long range syncing

1 participant