Skip to content

api, node: add runtime aggregator role toggle#636

Merged
tcoratger merged 1 commit intoleanEthereum:mainfrom
ch4r10t33r:feat/api-aggregator-toggle
Apr 17, 2026
Merged

api, node: add runtime aggregator role toggle#636
tcoratger merged 1 commit intoleanEthereum:mainfrom
ch4r10t33r:feat/api-aggregator-toggle

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Collaborator

@ch4r10t33r ch4r10t33r commented Apr 17, 2026

Summary

Adds a new admin endpoint /lean/v0/admin/aggregator that allows operators to activate or deactivate a node's aggregator role at runtime. This enables rotating aggregation duties across nodes without restarting when an active aggregator becomes unhealthy / unrecoverable.

Endpoints

  • GET /lean/v0/admin/aggregator — returns {"is_aggregator": bool} reflecting the current role.
  • POST /lean/v0/admin/aggregator with body {"enabled": bool} — toggles the role and returns {"is_aggregator": <new>, "previous": <old>}.

Both endpoints return 503 Service Unavailable when no controller is wired (e.g. in tests that stand up a bare ApiServer). The POST handler returns 400 Bad Request for missing body, missing enabled field, or non-boolean values.

Implementation

  • New AggregatorController (in src/lean_spec/subspecs/api/aggregator_controller.py) holds references to SyncService and NetworkService and serializes toggles under an asyncio.Lock. Both flags are flipped together so observers see a consistent view.
  • Route registration in routes.py gains an ADMIN_ROUTES list for non-GET verbs. server.py registers both GET and admin routes via web.route(method, path, handler).
  • ApiServer grows an optional aggregator_controller field, mirroring the existing store_getter pattern.
  • Node.from_genesis builds the controller from sync_service and network_service and passes it to the API server when the API is enabled.
  • Docstring on NodeConfig.is_aggregator updated to remove the aspirational claim that the ENR advertises aggregator capability (no code path writes that key today) and to document the new runtime-toggle seeding semantics.

Why this works at runtime

Both SyncService.on_gossip_attestation / on_gossip_aggregated_attestation and Store.tick_interval read the is_aggregator flag fresh on every event/tick, so toggling takes effect from the next tick — no restart, no gossip re-subscribe, no handshake churn.

Scope and known limitations

This PR deliberately keeps the change small. It is explicitly not:

  • Persistent across restarts. The CLI --is-aggregator flag seeds the initial value on each start. Runtime toggles are in-process only.
  • Reflected in gossip subnet subscriptions. Subscriptions are decided once at startup in run_node() based on the CLI flag. Toggling on at runtime only imports gossip for subnets the node is already subscribed to.
    • Operational implication: standby aggregators should start with --is-aggregator (so subscriptions are in place) and be left OFF via this API until needed — hot-standby model.
  • Reflected in the local ENR. The live node does not build or sign a local ENR today; DiscoveryService exists as a library but is not wired into the runtime node path. ENR.is_aggregator is currently read-only — no writer exists. Even if re-publication were added, no peer-side logic in src/lean_spec currently reads a remote peer's is_aggregator ENR bit to make decisions (peer selection, gossip routing, req/resp policy), so advertisement is decorative until a consumer is added. Tracked as a follow-up.
  • Authenticated. Admin endpoints are currently unauthenticated, consistent with the rest of the HTTP API. If an auth layer is desired it should be applied uniformly, not just to this endpoint.

Test plan

  • uv run pytest --no-cov — 3240 passed
  • uvx tox -e all-checks — ruff + ty + codespell + mdformat clean
  • New unit tests in tests/lean_spec/subspecs/api/test_aggregator_controller.py cover read, activate, deactivate, idempotence, and concurrent toggles.
  • New endpoint tests in tests/lean_spec/subspecs/api/test_server.py (TestAggregatorAdminEndpoint) exercise 503, GET status, POST activate/deactivate, bad-request paths, and concurrent POSTs.
  • New conformance tests in tests/api/endpoints/test_aggregator.py verify baseline 503 behavior and 405 for unsupported verbs against the default conformance server.
  • Manual smoke test on a devnet node (to be done after PR review).

Open questions for reviewers

  • Route placement under /lean/v0/admin/ (alternatives: /lean/v0/validator/aggregator, /lean/v0/node/aggregator).
  • Whether aggregate_subnet_ids should also be mutable at runtime (requires gossip subscribe/unsubscribe plumbing — scoped out here).
  • Whether to open a follow-up issue to build a real ENR lifecycle (signed local ENR, ENR.mutate/re-sign helper, handshake refresh hook, and a peer-side consumer of the is_aggregator bit) so advertisement would actually influence peer behavior.

@ch4r10t33r ch4r10t33r marked this pull request as ready for review April 17, 2026 10:44
@ch4r10t33r ch4r10t33r requested review from tcoratger and unnawut April 17, 2026 10:44
Copy link
Copy Markdown
Collaborator

@tcoratger tcoratger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have test vectors as well for the client teams to consume?

Like this https://github.com/leanEthereum/leanSpec/tree/main/tests/consensus/devnet/api

@ch4r10t33r
Copy link
Copy Markdown
Collaborator Author

@tcoratger good call — pushed 364c8cc which adds api_endpoint test vectors for the new admin endpoint under tests/consensus/devnet/api/test_api_endpoints.py, following the same filler pattern as the existing health / fork-choice / finalized-state vectors.

To support a stateful endpoint, I extended the api_endpoint fixture format with three new fields (all optional, camelCased in JSON):

  • method — HTTP verb, defaults to "GET" for backward compatibility.
  • requestBody — JSON body for non-GET methods.
  • initialIsAggregator — seeds the node's aggregator role before the request is replayed. Clients should configure their node to start with this value (CLI flag or controller) before issuing the request.

Six new fixtures land under fixtures/consensus/api_endpoint/devnet/api/test_api_endpoints/:

  • test_aggregator_status_disabled / test_aggregator_status_enabled — GET returns the current role for both seed values.
  • test_aggregator_toggle_activate / test_aggregator_toggle_deactivate — POST flips the role in each direction and returns {is_aggregator, previous}.
  • test_aggregator_toggle_idempotent_enable / test_aggregator_toggle_idempotent_disable — POST with the same value is a no-op and echoes previous unchanged.

Regenerate locally with uv run fill --fork=Devnet --clean -n auto -k test_api_endpoints.

I deliberately kept error paths (400 on missing/non-bool enabled, 503 with no controller, 405 for unsupported verbs) out of the vectors since the exact error body is framework-specific (aiohttp's HTTPException vs whatever a Zig / Rust / Go server produces) and checking only the status code there isn't worth a fixture entry. Those cases stay covered by the in-tree unit and conformance tests. Happy to add them if you'd rather have them in the vector suite.

@ch4r10t33r ch4r10t33r requested a review from tcoratger April 17, 2026 12:12
@tcoratger tcoratger force-pushed the feat/api-aggregator-toggle branch 2 times, most recently from 7b21e2a to 97f327d Compare April 17, 2026 22:18
Adds a new admin endpoint /lean/v0/admin/aggregator that allows
operators to activate or deactivate a node's aggregator role at
runtime, enabling rotation of aggregation duties across nodes
without restarting when an active aggregator becomes unhealthy.

- GET /lean/v0/admin/aggregator returns the current role.
- POST /lean/v0/admin/aggregator with {"enabled": bool} toggles
  the role and returns the previous value.

A new AggregatorController serializes toggles under an asyncio
lock and keeps the sync and network services in sync. The Node
wires the controller into ApiServer when API is enabled.

When no controller is wired, the endpoints return 503.

Includes consensus test vectors for the aggregator admin endpoint
so client teams can validate their implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tcoratger tcoratger force-pushed the feat/api-aggregator-toggle branch from 97f327d to 48a4132 Compare April 17, 2026 22:21
@tcoratger tcoratger merged commit 9b89651 into leanEthereum:main Apr 17, 2026
13 checks passed
MegaRedHand added a commit to lambdaclass/ethlambda that referenced this pull request Apr 21, 2026
## Summary

Ports [leanSpec PR
#636](leanEthereum/leanSpec#636). Adds two admin
endpoints that let operators rotate aggregation duties across nodes at
runtime without a restart.

- `GET /lean/v0/admin/aggregator` returns `{"is_aggregator": bool}`
- `POST /lean/v0/admin/aggregator` with body `{"enabled": bool}` returns
`{"is_aggregator": <new>, "previous": <old>}`

Returns `503` when the controller is not wired, `400` on malformed body
or non-boolean `enabled` (JSON ints `0`/`1` are explicitly rejected,
matching the spec).

## Design

The CLI flag `--is-aggregator` now seeds a shared `AggregatorController`
(a thin `Arc<AtomicBool>` in `ethlambda-types`) instead of a plain bool.
The blockchain actor reads the flag fresh on every tick and every gossip
attestation, so runtime toggles take effect from the next tick with no
protocol-message churn.

`axum::Extension` is used to thread the controller into admin handlers
so existing store-backed routes don't need to change their signatures.

The `lean_is_aggregator` metric is updated from the actor's tick read
(with a cached `last_seen_aggregator`) rather than from the HTTP
handler, so the gauge reflects what the actor actually acted on.

## Scope limitations (intentional, matching spec)

| Scope | Status |
|---|---|
| Persistence across restart | ❌ in-process only |
| Gossip subnet re-subscription | ❌ frozen at startup |
| ENR advertisement | ❌ no ENR lifecycle |
| Authentication | ❌ consistent with rest of HTTP API |

**Operational model**: standby aggregators must boot with
`--is-aggregator=true` so gossip subscriptions are in place, then use
the admin endpoint to rotate duties (hot-standby model). A node booted
with `--is-aggregator=false` and toggled on later will have no extra
subnets to aggregate. This is documented in the CLI flag's docstring and
the `CLAUDE.md` gotcha section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants