Skip to content

refactor(storage): dedupe PayloadBuffer proofs by subsumption#313

Open
MegaRedHand wants to merge 1 commit intomainfrom
payload-buffer-antichain-dedupe
Open

refactor(storage): dedupe PayloadBuffer proofs by subsumption#313
MegaRedHand wants to merge 1 commit intomainfrom
payload-buffer-antichain-dedupe

Conversation

@MegaRedHand
Copy link
Copy Markdown
Collaborator

Summary

Replace the equality-only dedup in PayloadBuffer::push with a subset-based dedup that maintains each data_root's proof list as an antichain. On push, if the incoming proof's participants are a subset (including equal) of any existing proof, the push is skipped; otherwise every existing proof whose participants are a strict subset of the incoming proof is removed before the insert.

Why

After an aggregator produces a new proof for a data_root, the child proofs merged into it are still in new_payloads with participant sets that are strict subsets of the new proof's. They add no coverage and just consume PayloadBuffer capacity, accelerating FIFO eviction of other data_roots. The same kind of redundancy accumulates in known_payloads through drain_new_to_known and block ingestion.

The leanSpec models this at leanSpec/src/lean_spec/subspecs/forkchoice/store.py:1048-1062 by wholesale-replacing latest_new_aggregated_payloads[D] with {new_proof}. We can't mimic that synchronously because (a) aggregation runs on a spawn_blocking worker while the actor keeps writing new_payloads via gossip ingestion, and (b) update_head reads known_payloads only, so any known-side pruning has to be carefully timed to avoid a fork-choice coverage gap.

Dedupe-on-push sidesteps both constraints:

  • Push never shrinks validator coverage: it only removes proofs whose validators are a strict subset of the added one's. Any update_head reading known_payloads between pushes sees monotonically-growing coverage, so there's no fork-choice timing risk.
  • Concurrent gossip-received aggregates that are incomparable with our locally produced aggregate are preserved; ones that are subsumed are absorbed.
  • No separate cleanup passes or wiring: apply_aggregated_group, on_gossip_aggregated_attestation, on_block_core, and promote_new_aggregated_payloads all benefit transparently.

Cost

O(n·v) per push (n = proofs currently in the entry, v = participants per proof) via HashSet<u64> membership. n is bounded by the max-antichain size per data_root (typically 1 in steady state, small when concurrent aggregates overlap). If profiling later shows push in a hot path, a bitwise-native subset op on AggregationBits is a natural follow-up.

Tests

  • 8 new unit tests on PayloadBuffer::push: superset removes subset, subset skipped, equal-participants skipped, incomparable proofs coexist, superset absorbs multiple singletons in one push, mixed kept/removed under same data_root, cross-data-root independence, FIFO eviction still uses total_proofs.
  • All 4 existing PayloadBuffer tests keep passing unchanged (they use pairwise-incomparable participant sets).

Test plan

  • cargo test -p ethlambda-storage — 30 tests pass
  • cargo test -p ethlambda-blockchain — 20 unit tests pass
  • cargo test -p ethlambda-blockchain --test forkchoice_spectests --release — 70 spec tests pass
  • cargo fmt --all clean
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • Local devnet smoke (post-review): with --is-aggregator, verify lean_latest_new_aggregated_payloads and lean_latest_known_aggregated_payloads stay bounded across interval 2 / interval 4 instead of growing monotonically

Maintain each `PayloadBuffer` entry as an antichain under the subset
relation on participants. On push:

- skip if incoming participants are a subset (incl. equal) of any
  existing proof (adds no coverage);
- otherwise remove existing proofs whose participants are a strict
  subset of the incoming one, then insert.

This generalises the prior equality-only dedup and eliminates
redundant proofs that otherwise accumulate after aggregation
produces a superset of its child proofs, or when a block embeds a
proof subsumed by one already in `known_payloads`. Validator
coverage is monotonic across pushes, so `update_head`'s reads of
`known_payloads` are never temporarily degraded.

Affects every insert path transparently — `apply_aggregated_group`,
`on_gossip_aggregated_attestation`, `on_block_core`, and
`promote_new_aggregated_payloads` — without touching their call
sites.
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Review: PR #313 - Antichain invariant for PayloadBuffer

The implementation correctly maintains an antichain of participant sets per data_root, ensuring no redundant subsets are stored. Logic is sound and well-tested.

Correctness & Logic

  • Lines 143-162: The subset checks correctly implement the antichain property. Early return on new_set.is_subset(&existing_set) (line 154) properly handles equality and strict superset cases.
  • Lines 165-168: Reverse-iteration with swap_remove safely removes subsumed proofs without invalidating indices.
  • Line 167: total_proofs accounting is correct (decrement per removal, increment on final push).

Performance

  • Line 143, 149: Allocating HashSet<u64> for every comparison is O(n*m) where n = existing proofs, m = participants. For large validator sets, consider using bitwise operations on AggregationBits directly to avoid allocations, though the FIFO capacity limits mitigate worst-case impact.

Consensus Safety

  • The antichain invariant is appropriate for attestation aggregation: superseding proofs replace subsets, preventing storage bloat from partial aggregations while preserving incomparable attestations for potential future aggregation.

Tests

  • Comprehensive coverage (lines 1792-2000) verifies subset absorption, superset rejection, incomparable coexistence, and FIFO eviction boundaries.

Nits

  • Line 157: Comment claims "strict subset" but code uses is_subset. This is correct because equality is filtered by line 154, but consider is_strict_subset if available for clarity.
  • Line 162: to_remove could use SmallVec<[usize; 4]> or similar to avoid heap allocation for the common case of few removals.

Verdict: LGTM. The logic is consensus-safe and the test suite validates the antichain behavior thoroughly.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  • Medium: The new antichain check allocates a fresh HashSet<u64> for the incoming proof and another one for every existing proof on each insert at crates/storage/src/store.rs:145 and crates/storage/src/store.rs:150. This sits on the aggregated-attestation ingest path at crates/blockchain/src/store.rs:333, so every valid peer-fed aggregate now pays repeated heap allocation and hash work proportional to stored_proofs * participants. In consensus code that is a real DoS surface. I’d avoid materializing sets here and do subset checks directly on AggregationBits, or cache a canonical participant set/bitmask once per stored proof.

No correctness or safety regression stood out beyond that. The subset/superset pruning itself is consistent with how these proofs are later used for fork-choice weight extraction, block building, and recursive aggregation.

I could not run the Rust tests here because cargo/rustup tried to write under read-only /home/runner/.rustup/tmp.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything I need for the review.


Review: refactor(storage): dedupe PayloadBuffer proofs by subsumption on push

Overview

Replaces the equality-only dedup in PayloadBuffer::push with an antichain-based dedup: on push, incoming proofs that are subsumed by an existing proof are skipped; incoming proofs that supersede existing ones cause the subsumed proofs to be removed before insert. 8 new unit tests are added.


Algorithm Correctness

The core algorithm is correct. A few invariants worth verifying explicitly:

The early return is always safe (nothing left half-done)

When new_set.is_subset(&existing_set) fires the early return, to_remove is provably empty at that point. If to_remove were non-empty — meaning some prior proof X had X_set ⊆ new_set — then X_set ⊆ new_set ⊆ existing_set, i.e., X ⊆ existing, which violates the antichain invariant on entry.proofs. Since the invariant is maintained inductively (base case: empty is trivially an antichain; inductive step: every push preserves it), the scenario cannot arise. The comment in the loop handles the mechanics but this invariant isn't mentioned anywhere. Consider a one-line comment at the top of the if let Some(entry) block.

Reverse-order swap_remove is correct

to_remove is built in ascending index order; rev() processes it highest-first. swap_remove(i) moves the last element to position i and shrinks the Vec by one. Processing in descending order ensures positions 0..i are untouched when index i is removed, making the original indices still valid for all subsequent (lower) removals. ✓

FIFO eviction skipped on early return

When the incoming proof is subsumed (return before the eviction loop), total_proofs is unchanged. Since the eviction loop in every prior push leaves total_proofs <= capacity, skipping eviction here is correct. ✓

Antichain invariant maintained inductively

After the removal loop, all surviving proofs satisfy existing ⊄ new (otherwise they'd be in to_remove) and new ⊄ existing (otherwise we'd have returned early). Combined with the pre-push antichain invariant on the survivors, the post-push state is a valid antichain. ✓


Issues

1. to_remove allocated on every push to an existing data_root

let mut to_remove: Vec<usize> = Vec::new();

Vec::new() doesn't heap-allocate until elements are pushed, so this is free when to_remove stays empty (the common steady-state case). It's fine as-is, but if this ever shows up in profiles, smallvec::SmallVec<[usize; 4]> or inline removal (tracking a write cursor) would eliminate the allocation entirely.

2. HashSet<u64> rebuilt per proof in the inner loop

The inner loop allocates a fresh HashSet<u64> for each existing proof:

for (i, p) in entry.proofs.iter().enumerate() {
    let existing_set: HashSet<u64> = p.participant_indices().collect();  // O(v) per proof

This is O(n × v) per push (n = antichain size, v = participants per proof). Acceptable given the antichain typically has n=1 in steady state, and acknowledged in the PR description. The natural follow-up — bitwise subset via AggregationBits — is already noted; no action required now.

3. Missing test: empty-participant proof

make_proof_for_validators(&[]) would push a proof with no participants. An empty new_set satisfies HashSet::is_subset against any set, so for an existing data_root the empty proof is always skipped; for a new data_root it is inserted. The behavior is probably correct (an empty proof is useless and should be subsumed by anything), but this edge case is untested. Low priority given the protocol likely never generates zero-participant proofs.

4. Eviction of the just-inserted data_root

Not currently tested, but can happen with capacity=1: push data_root A, push data_root B → total_proofs=2 > 1 → evict A, leaving only B. If the capacity is set to a value smaller than the expected antichain size per entry, a data_root could be evicted immediately after insertion if there are many other entries. This is a pre-existing design constraint rather than a bug, but worth a note in the capacity constants.


Style / Convention

  • The new make_proof_for_validators test helper is clean and reusable. The only nit: max + 1 panics if max is usize::MAX, but test helpers with u64 validator IDs in range [0, ~128] make this moot.
  • Log/tracing fields are not involved here.
  • The doc comment on push is thorough and accurate.

Test Coverage

The 8 new tests are well-named and cover the important paths. Particular highlights:

  • payload_buffer_push_superset_absorbs_multiple_subsets verifies order.len() == 1, protecting against FIFO state corruption after multi-proof removal. Good catch.
  • payload_buffer_push_fifo_eviction_uses_total_proofs correctly uses total_proofs (not data.len()) as the eviction trigger, matching the implementation.

All 4 pre-existing PayloadBuffer tests continue to pass unchanged (they use pairwise-incomparable sets throughout, so the new logic is a no-op for them). ✓


Summary

The implementation is correct and well-tested. The antichain invariant is sound, swap_remove in reverse order is the right pattern, and the FIFO accounting remains consistent. The main suggestion is a short comment documenting why the early-return path cannot leave to_remove non-empty (a non-obvious invariant that depends on the antichain property). Everything else is minor or optional.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR replaces the equality-only duplicate check in PayloadBuffer::push with a full antichain (subset/superset) dedup, ensuring that when a proof with a larger participant set arrives for a given data_root, all proofs it subsumes are removed and only the maximal proofs are retained. Eight new unit tests verify all key cases; existing tests pass unchanged.

Confidence Score: 5/5

Safe to merge; implementation is correct and well-tested.

The antichain invariant is correctly maintained: early-return when new_set ⊆ existing_set is sound (the antichain property guarantees no strict-subset peers exist simultaneously), and swap_remove in reverse index order provably keeps all lower indices at their original positions. total_proofs accounting is accurate across all branches. The only finding is a P2 style note about computing new_set before the branch where it is sometimes unused.

No files require special attention.

Important Files Changed

Filename Overview
crates/storage/src/store.rs Replaces equality-only dedup in PayloadBuffer::push with a full antichain/subsumption dedup; adds 8 targeted unit tests covering supersets, subsets, equality, incomparable pairs, multi-absorb, mixed keep/remove, cross-data-root independence, and FIFO eviction. Logic and accounting are correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[push called] --> B{data_root already\nin buffer?}
    B -- No --> C[Insert new entry\nadd to order\ntotal_proofs += 1]
    B -- Yes --> D[Build new_set from\nproof.participant_indices]
    D --> E{For each existing proof\nbuild existing_set}
    E --> F{new_set ⊆ existing_set?}
    F -- Yes --> G[Return: incoming is\nsubsumed, skip]
    F -- No --> H{existing_set ⊆ new_set?}
    H -- Yes --> I[Mark index for removal]
    H -- No --> E
    I --> E
    E -- loop done --> J[swap_remove marked indices\nin reverse order\ntotal_proofs -= removed count]
    J --> K[Push new proof\ntotal_proofs += 1]
    C --> L{total_proofs > capacity?}
    K --> L
    L -- Yes --> M[Pop oldest data_root\nfrom order, remove entry\ntotal_proofs -= entry.len]
    M --> L
    L -- No --> N[Done]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/storage/src/store.rs
Line: 144-145

Comment:
**Unconditional `new_set` allocation for new data_roots**

`new_set` is built before the `if let` branch, so it's allocated and populated even when `data_root` isn't in `self.data` (the `else` path), where it's never read. Moving the allocation inside the `if let Some(entry)` arm would avoid this work for fresh insertions.

```suggestion
        let (data_root, att_data) = hashed.into_parts();
```

Then compute `new_set` at the start of the `if let Some(entry)` block only.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "refactor(storage): dedupe PayloadBuffer ..." | Re-trigger Greptile

Comment on lines 144 to +145
let (data_root, att_data) = hashed.into_parts();
let new_set: HashSet<u64> = proof.participant_indices().collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unconditional new_set allocation for new data_roots

new_set is built before the if let branch, so it's allocated and populated even when data_root isn't in self.data (the else path), where it's never read. Moving the allocation inside the if let Some(entry) arm would avoid this work for fresh insertions.

Suggested change
let (data_root, att_data) = hashed.into_parts();
let new_set: HashSet<u64> = proof.participant_indices().collect();
let (data_root, att_data) = hashed.into_parts();

Then compute new_set at the start of the if let Some(entry) block only.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/src/store.rs
Line: 144-145

Comment:
**Unconditional `new_set` allocation for new data_roots**

`new_set` is built before the `if let` branch, so it's allocated and populated even when `data_root` isn't in `self.data` (the `else` path), where it's never read. Moving the allocation inside the `if let Some(entry)` arm would avoid this work for fresh insertions.

```suggestion
        let (data_root, att_data) = hashed.into_parts();
```

Then compute `new_set` at the start of the `if let Some(entry)` block only.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant