Skip to content

fix: subscribe non-aggregator validators to their attestation subnets#293

Merged
MegaRedHand merged 5 commits intomainfrom
fix-nonaggregator-attestation-subscription
Apr 16, 2026
Merged

fix: subscribe non-aggregator validators to their attestation subnets#293
MegaRedHand merged 5 commits intomainfrom
fix-nonaggregator-attestation-subscription

Conversation

@MegaRedHand
Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand commented Apr 16, 2026

Summary

  • Restore spec-compliant attestation-subnet subscription: every validator subscribes to its own subnet regardless of the --is-aggregator flag, so non-aggregator validator nodes join the gossipsub mesh and forward attestations instead of relying on fanout alone.
  • Process but don't store gossip signatures in non-aggregator nodes.
  • Restore the lean_attestation_committee_subnet metric to derive from validator subnets only, matching its pre-fix: do not subscribe to subnets when not aggregating #265 semantics (the metric reports this node's committee membership, not aggregator-only subscriptions).

Motivation

On a multi-client devnet (ethlambda_0 aggregator + ethlambda_1 + grandine_0 + gean_0, single attestation subnet), finalization stalled at slot ~50 with validators voting three different attestation targets. gean's attestations never reached ethlambda_0 once peer churn thinned the attestation subnet mesh: gean was publishing via fanout, but with only one subscriber (ethlambda_0) and transient peer drops, the mesh collapsed and could not recover.

libp2p-gossipsub (including the v1.2/v1.3 features in our fork) couples subscription with mesh participation. A node that does not subscribe() to a topic cannot be GRAFTed by peers and therefore cannot relay messages for it. flood_publish is publisher-side only and does not change this.

leanSpec (src/lean_spec/__main__.py) prescribes: any node with registered validators subscribes to its validator-derived subnets, so all validator peers contribute to the mesh. This PR aligns ethlambda with that rule.

Non-aggregator validator nodes were not joining the attestation mesh,
so their attestations could only propagate via gossipsub fanout. In
small devnets this collapses the mesh whenever peer churn thins out
the single subscriber — observed as a permanent finalization stall
with three validators voting three different targets.

Align with leanSpec (src/lean_spec/__main__.py): every validator
subscribes to its own subnet; aggregators additionally subscribe to
explicit aggregate-subnet-ids with a fallback to subnet 0 when they
have no validators of their own.

Also restores the pre-regression semantics of the
lean_attestation_committee_subnet metric, which should reflect this
node's committee membership and therefore derive from validator
subnets only, not from aggregator-only subscriptions.

Regression introduced by #265; prior correct behavior was in #249.
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Overall this is a clean refactor that aligns the implementation with the lean specification, but there are two issues to address:

1. Logic order bug in aggregator fallback (Line ~218-226)

The fallback to subnet 0 now occurs before adding explicit aggregate subnet IDs, which changes behavior from the original code:

// Current (new) logic:
if config.is_aggregator {
    if subscription_subnets.is_empty() {  // Check happens first
        subscription_subnets.insert(0);
    }
    if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
        subscription_subnets.extend(explicit_ids);  // Added after
    }
}

Previous behavior: An aggregator with no validators but with explicit aggregate_subnet_ids would not subscribe to subnet 0 (since the set wasn't empty after extending).

New behavior: The same aggregator now subscribes to subnet 0 plus the explicit subnets.

If the intent is to only fallback when there are neither validators nor explicit subnets, move the emptiness check after the extend:

if config.is_aggregator {
    if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
        subscription_subnets.extend(explicit_ids);
    }
    if subscription_subnets.is_empty() {
        subscription_subnets.insert(0);
    }
}

2. Division by zero risk (Line ~210)

vid % config.attestation_committee_count will panic if attestation_committee_count is 0. While likely validated elsewhere, defensive coding suggests either:

  • Using checked_rem with explicit error handling
  • Adding a debug assertion or comment noting that attestation_committee_count is guaranteed non-zero by config validation

Minor notes:

  • Behavioral change acknowledged: Non-aggregators now subscribe to their attestation subnets (previously they relied on gossipsub fanout). The comment correctly notes this is per leanSpec.
  • Metrics isolation: The metric calculation now correctly isolates validator membership from aggregator duties, preventing metric pollution from fallback/explicit aggregator subnets.
  • Resource safety: Using HashSet for subscription_subnets correctly prevents duplicate subscriptions when a validator's subnet overlaps with an explicit aggregator subnet.

The refactoring improves maintainability by consolidating the subscription logic into a single iteration loop.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context for a thorough review.


PR #293 Review — fix: subscribe non-aggregator validators to their attestation subnets

Overview

This PR restores spec-compliant gossipsub mesh participation by unconditionally subscribing all nodes with registered validators to their attestation subnets, rather than limiting subscriptions to aggregators only. It also fixes the lean_attestation_committee_subnet metric to reflect only validator-derived subnets. The PR description is excellent — the regression history and leanSpec rationale are clearly documented.


Code Quality

Correctness: Good

The refactored logic is sound. The old code had a subtle inconsistency: attestation_topics was built from validator_ids (and aggregate_subnet_ids) for all nodes, but subscribe() was only called for aggregators. This meant non-aggregator nodes held topic handles for subnets they hadn't joined — which was harmless for publishing (fanout handles it), but broke inbound mesh participation. The new code makes subscriptions and topic cache consistent: they're built together from a single subscription_subnets set.

Aggregator fallback (subnet 0) is preserved correctly

Old: fallback if aggregate_subnets.is_empty() — which is true iff validator_ids is empty and no explicit IDs are given.
New: fallback if subscription_subnets.is_empty() — identical condition since subscription_subnets starts as validator_subnets. The semantics are preserved.

Metric fix is correct

Computing metric_subnet from validator_subnets (before the aggregator-specific additions) ensures aggregator-only subnets don't skew the reported committee. Moving the metric assignment before subscriptions is fine — no correctness dependency on ordering here.


Issues

Pre-existing: division by zero if attestation_committee_count == 0

Line 216:

.map(|vid| vid % config.attestation_committee_count)

If attestation_committee_count is 0 (misconfigured), this panics. This existed before and is not introduced by this PR, but since the PR touches this exact line, it's a good opportunity to add a guard or assert upstream in SwarmConfig construction:

assert!(config.attestation_committee_count > 0, "attestation_committee_count must be > 0");

Potential: explicit_ids may contain subnets out of committee range

aggregate_subnet_ids are taken from CLI config and extended directly without a modulo check. An out-of-range ID would cause subscription to a non-existent topic. This is pre-existing and low risk (operator-controlled config), but worth noting.

Minor: aggregate_subnet_ids no longer added to attestation_topics for non-aggregators

In the old code, aggregate_subnet_ids always populated attestation_topics regardless of is_aggregator. In the new code they only do so for aggregators. This change is intentional and correct — but it's a behavior change that isn't called out explicitly in the PR. If a future code path consults attestation_topics for non-aggregators expecting these entries, it would silently miss them.


Style / Idioms

Everything aligns with CLAUDE.md conventions:

  • inspect_err not needed here (no error side-effects to log)
  • No unnecessary comments; the block comment correctly explains the why
  • HashSet iteration order is non-deterministic, which affects only log ordering — not correctness

Test Coverage

All unit, fork-choice, and STF spec tests pass. The signature spec test failure is confirmed pre-existing and unrelated. The critical validation — multi-client devnet finalization past slot 50 — is marked pending, which is appropriate given the networking nature of this fix. That test should be completed before merge.


Summary

The logic is correct and the refactor improves consistency. The only actionable items are:

  • The attestation_committee_count == 0 panic guard (hardening, not a regression).
  • Pending multi-client devnet validation.

The PR is otherwise clean and ready to merge once the devnet test is confirmed.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 16, 2026

Greptile Summary

This PR restores spec-compliant attestation-subnet subscription: every node with registered validators subscribes to its derived subnets unconditionally, enabling gossipsub mesh participation for non-aggregator peers. It also fixes the lean_attestation_committee_subnet metric to read from validator_subnets only, not the full subscription set.

Confidence Score: 4/5

Safe to merge; the core fix is correct and the one remaining finding is a P2 edge-case in the fallback ordering.

The primary change (subscribing non-aggregators to their validator subnets) is correct and well-motivated. The only finding is a minor ordering inconsistency in the aggregator fallback: the subnet-0 safety net fires before explicit aggregate_subnet_ids are merged, so an aggregator with no validators but with explicit subnet IDs picks up an extra spurious subscription to subnet 0. Harmless in practice but a minor semantic regression.

crates/net/p2p/src/lib.rs lines 225–231 (fallback ordering)

Important Files Changed

Filename Overview
crates/net/p2p/src/lib.rs Unconditionally subscribes validators to their derived attestation subnets; aggregators additionally extend with explicit IDs. Minor ordering issue: subnet-0 fallback fires before explicit aggregate_subnet_ids are merged, causing an extra subscription in the edge case of an aggregator with no validators but explicit subnet IDs configured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build_swarm called] --> B[Compute validator_subnets\nfrom validator_ids % committee_count]
    B --> C[Set metric_subnet = min validator_subnets\nunwrap_or 0]
    C --> D{is_aggregator?}
    D -- No --> G[subscription_subnets = validator_subnets]
    D -- Yes --> E{subscription_subnets empty?}
    E -- Yes --> F[Insert subnet 0]
    E -- No --> H[Extend with aggregate_subnet_ids]
    F --> H
    G --> I[For each subnet_id in subscription_subnets]
    H --> I
    I --> J[subscribe to gossipsub topic\nbuild attestation_topics cache\nlog Subscribed to attestation subnet]
    J --> K[Return BuiltSwarm]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/net/p2p/src/lib.rs
Line: 225-231

Comment:
**Fallback to subnet 0 fires before explicit IDs are merged**

The empty-check on `subscription_subnets` runs before `aggregate_subnet_ids` are folded in, so an aggregator configured with no `validator_ids` but with explicit `aggregate_subnet_ids` (e.g. `[5]`) will subscribe to **both** subnet 0 and subnet 5 instead of just subnet 5. The old code deferred the fallback until after extending with explicit IDs, so it only triggered when the set was still empty after both sources were considered. Moving the explicit-ID extension before the fallback check restores that behaviour:

```suggestion
    if config.is_aggregator {
        if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
            subscription_subnets.extend(explicit_ids);
        }
        // Aggregator with no validators and no explicit subnets: fallback to subnet 0
        if subscription_subnets.is_empty() {
            subscription_subnets.insert(0);
        }
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: subscribe non-aggregator validators..." | Re-trigger Greptile

Comment thread crates/net/p2p/src/lib.rs
Comment on lines 225 to 231
if config.is_aggregator {
let mut aggregate_subnets: HashSet<u64> = config
.validator_ids
.iter()
.map(|vid| vid % config.attestation_committee_count)
.collect();
if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
aggregate_subnets.extend(explicit_ids);
}
// Aggregator with no validators and no explicit subnets: fallback to subnet 0
if aggregate_subnets.is_empty() {
aggregate_subnets.insert(0);
if subscription_subnets.is_empty() {
subscription_subnets.insert(0);
}
for &subnet_id in &aggregate_subnets {
let topic = attestation_subnet_topic(subnet_id);
swarm.behaviour_mut().gossipsub.subscribe(&topic)?;
info!(subnet_id, "Subscribed to attestation subnet");
if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
subscription_subnets.extend(explicit_ids);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fallback to subnet 0 fires before explicit IDs are merged

The empty-check on subscription_subnets runs before aggregate_subnet_ids are folded in, so an aggregator configured with no validator_ids but with explicit aggregate_subnet_ids (e.g. [5]) will subscribe to both subnet 0 and subnet 5 instead of just subnet 5. The old code deferred the fallback until after extending with explicit IDs, so it only triggered when the set was still empty after both sources were considered. Moving the explicit-ID extension before the fallback check restores that behaviour:

Suggested change
if config.is_aggregator {
let mut aggregate_subnets: HashSet<u64> = config
.validator_ids
.iter()
.map(|vid| vid % config.attestation_committee_count)
.collect();
if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
aggregate_subnets.extend(explicit_ids);
}
// Aggregator with no validators and no explicit subnets: fallback to subnet 0
if aggregate_subnets.is_empty() {
aggregate_subnets.insert(0);
if subscription_subnets.is_empty() {
subscription_subnets.insert(0);
}
for &subnet_id in &aggregate_subnets {
let topic = attestation_subnet_topic(subnet_id);
swarm.behaviour_mut().gossipsub.subscribe(&topic)?;
info!(subnet_id, "Subscribed to attestation subnet");
if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
subscription_subnets.extend(explicit_ids);
}
if config.is_aggregator {
if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
subscription_subnets.extend(explicit_ids);
}
// Aggregator with no validators and no explicit subnets: fallback to subnet 0
if subscription_subnets.is_empty() {
subscription_subnets.insert(0);
}
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/net/p2p/src/lib.rs
Line: 225-231

Comment:
**Fallback to subnet 0 fires before explicit IDs are merged**

The empty-check on `subscription_subnets` runs before `aggregate_subnet_ids` are folded in, so an aggregator configured with no `validator_ids` but with explicit `aggregate_subnet_ids` (e.g. `[5]`) will subscribe to **both** subnet 0 and subnet 5 instead of just subnet 5. The old code deferred the fallback until after extending with explicit IDs, so it only triggered when the set was still empty after both sources were considered. Moving the explicit-ID extension before the fallback check restores that behaviour:

```suggestion
    if config.is_aggregator {
        if let Some(ref explicit_ids) = config.aggregate_subnet_ids {
            subscription_subnets.extend(explicit_ids);
        }
        // Aggregator with no validators and no explicit subnets: fallback to subnet 0
        if subscription_subnets.is_empty() {
            subscription_subnets.insert(0);
        }
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good bot

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. Medium: this change makes non-aggregator validators subscribe to attestation subnets, but the receive path still fully decompresses/decodes those messages and forwards them into blockchain code that immediately warns and drops them. See crates/net/p2p/src/lib.rs:224, crates/net/p2p/src/gossipsub/handler.rs:93, and crates/blockchain/src/lib.rs:473. On any validator-only node this becomes per-attestation log spam plus avoidable SSZ/snappy work, and it widens the DoS surface now that all validators join these meshes. If the intent is “subscribe for mesh health only,” the handler should short-circuit before decode/forward for !is_aggregator, or the blockchain side should silently ignore instead of warning.

  2. Low: publish_ignore_no_peers is still gated on server.is_aggregator, but after this PR non-aggregator validators are also subscribed to their own subnet. See crates/net/p2p/src/lib.rs:209 and crates/net/p2p/src/gossipsub/handler.rs:141. That means a validator-only node can now hit the same expected NoPeersSubscribedToTopic case that aggregators already suppress, but it will still be logged as a warning via crates/net/p2p/src/swarm_adapter.rs:152. The suppression condition should follow “publishing on a subscribed subnet” rather than “is aggregator”.

No consensus-core logic changed here, so I did not find fork-choice / finalization / state-transition issues in this PR. Main residual risk is missing coverage for the new non-aggregator subscription mode.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

…gregators

Non-aggregator nodes subscribe to their attestation subnet so they
participate in the gossipsub mesh. They will therefore receive
unaggregated attestations from peers, which is expected — only
aggregators need to store and aggregate them. The warning was
high-volume noise; the early return stays, but the log is removed.
leanSpec's on_gossip_attestation validates the attestation data and
verifies the XMSS signature on every node that receives the message.
Only the signature storage step (used later for aggregation at
interval 2) is gated by is_aggregator. Non-aggregators validate, verify,
and drop.

Previous commit short-circuited before validation to save XMSS cost;
aligning with the spec so bad signatures are still caught at the edge.
The is_aggregator flag now flows into store::on_gossip_attestation.
An aggregator with no validators but with --aggregate-subnet-ids was
picking up subnet 0 in addition to its explicit subnets, because the
emptiness check fired before aggregate_subnet_ids were folded in. Move
the fallback to run after the extend so it only applies when the set
is truly empty, matching the pre-#265 semantics.
@MegaRedHand MegaRedHand merged commit c9025b7 into main Apr 16, 2026
1 of 2 checks passed
@MegaRedHand MegaRedHand deleted the fix-nonaggregator-attestation-subscription branch April 16, 2026 21:27
MegaRedHand added a commit that referenced this pull request Apr 16, 2026
Since every validator now subscribes to its own attestation subnet
(#293), the transient "no remote peer knows of this subnet yet"
condition is no longer part of the expected publish path. If it
surfaces, it's a real signal (partition, misconfigured peer) and
deserves the warning.

Removes:
- SwarmHandle::publish_ignore_no_peers and the ignore_no_peers flag
  on SwarmCommand::Publish
- The is_aggregator field on BuiltSwarm / P2PServer (only used to
  pick between the two publish paths)
- The aggregator-only branch in publish_attestation

Publish errors are now logged uniformly via inspect_err on the
single publish path in the swarm adapter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants