-
Notifications
You must be signed in to change notification settings - Fork 10
Initial multicast support #847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Also pushes on the requisite extensions for us to fill in
This implements IPv4 and IPv6 multicast packet forwarding with three
replication modes (External, Underlay, All) for rack-wide multicast
delivery across VPCs.
Includes:
- M2P (Multicast-to-Physical) mappings with admin-scoped IPv6 underlay
- Per-port multicast group subscriptions for local delivery
- Multicast forwarding table with configurable replication strategies
- Geneve multicast option encoding for delivery mode signaling
- RX path loop prevention (packets marked Underlay skip re-relay)
- TX/RX path integration with flow table and encapsulation
- DTrace probes for multicast delivery observability
- API addition: set_mcast_fwd/clear_mcast_fwd for forwarding table management
- API addition: mcast_subscribe/mcast_unsubscribe for port group membership
- API addition: dump_mcast_fwd for observability
- Testing: XDE integration tests covering all replication modes, validation,
and edge cases
- Testing: oxide-vpc integration tests for Geneve encapsulation and parsing
- Enforce DEFAULT_MULTICAST_VNI (77) for all multicast traffic (groups
are fleet-side/cross-VPC) and validate admin-scoped underlay
addresses (ff04::/16, ff05::/16, ff08::/16).
FelixMcFelix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work so far. I haven't looked at the new multicast integration tests yet, but I have a bunch of questions from what I've looked at thus far.
Updates all-around for IPv4/IPv6 multicast support with control-plane APIs,
kernel TX/RX implementation, dtrace script, and documentation semantics.
Includes:
- Delivery semantics (leaf-node):
- Remove multicast relay logic; OPTE is always a leaf node in the
replication tree
- Same-sled delivery happens unconditionally on TX for local subscribers
- RX-path only handles packets destined to this sled (no forwarding)
- Perf (avoid management_lock in datapath):
- Move mcast_fwd lookups to per-entry state instead of hitting
exclusive management lock during TX replication
- Clone Arc references from per-CPU caches instead of holding per-port
RwLock guards across packet processing
- Use state.devs.read() for concurrent dataplane access
- Hold per-CPU copies of mcast_fwd for duration of TX replication
- Arbitrary VNI handling:
- Use DEFAULT_MULTICAST_VNI (77) for fleet-wide multicast delivery
- Remove per-VPC VNI checks in xde.rs; delegate validation to overlay layer
- Packets with VNI 77 delivered to all subscribers regardless of VPC
- Replication flag clarification:
- Replication enum specifies switch behavior on marked packets:
- External: Switch replicates to front panel ports (leaving underlay)
- Underlay: Switch replicates to sleds (within underlay)
- Both: Switch does both replications
- Used only on TX-path to inform switch behavior, not for RX-path
- Routing and MACs:
- Now, we set the right nexthop and routing for TX replication (the
switch unicast address)
- Use derived IPv6 multicast MAC for outer destination
- Route lookup determines underlay port selection via next_hop
- Simplified underlay routing for admin-scoped (ff04::/16)
addresses, matching Omicron currently
- Test infra:
- MulticastGroup: RAII cleanup for M2P/forwarding entries
- SnoopGuard: Prevent leaked snoop processes from holding DLPI devices
- Geneve packet verification with replication flag validation
- three_node_topology for multi-subscriber scenarios
- Proactive zone cleanup
- Standardized around updated semantics
- Additional refinements:
- Updated DTrace script (opte-mcast-delivery.d)
- Improved opteadm output formatting for multicast commands
- Added anyhow dependency to opte-test-utils
- Updated documentation clarifying multicast architecture
FelixMcFelix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't taken the time to go through the test harness changes yet, so sorry on that front. I think it generally feels like we're doing the right thing in most of the packet processing work, aside from the lock doctrine. And, apologies about some of the essay-length comments...
Implements end-to-end multicast networking across Omicron's control plane and sled-agent, integrated with IP pool extensions from #9084. Closes #8242. TL;DR: > Implements fleet-wide multicast groups across the control plane and sled-agent, integrated with IP pool extensions (#9084). Adds a reconciliation worker (RPW), inventory-based sled→switch-port mapping, a multi-switch multicast dataplane trait, and paired external/underlay groups for NAT and Source-Specific Multicast (SSM). Introduces fleet-scoped auth and a 3-state membership lifecycle; requires schema `v209` and sled-agent API `v7`; feature is disabled by default. *Highlights*: - An RPW for reconciling groups and instance members (ensuring dataplane state matches DB) - Inventory-based sled→switch-port mapping with validation tests - A multicast-focused dataplane trait separating control plane logic from Dendrite/DPD; works across multiple switches - Bifurcated architecture with paired external/underlay groups for NAT-based forwarding - 3-state instance member lifecycle ("Joining" → "Joined" → "Left") with reactivation support - Fleet-scoped authorization model allowing cross-project multicast - New DB tables: multicast_group, underlay_multicast_group, multicast_group_member - External groups: Customer-facing IPv4/IPv6 addresses from IP pools with SSM support - Underlay groups: Admin-scoped IPv6 (ff04::/16); default allocation from fixed ff04::/64 for internal rack forwarding - Feature flag and reconciler/cache settings exist and default to disabled/safe values - Member states: "Joining"/"Joined"/"Left" with soft-delete/mark-for-removal for instance lifecycle - Group states: "Creating"/"Active"/"Deleting"/"Deleted" for RPW processing - sled-agent: API `v7` with multicast join/leave endpoints - Multicast-aware instance management - Network interface configuration for multicast traffic - OPTE integration stubbed pending oxidecomputer/opte#847 - Inventory / Port correlation - Validates baseboard identifiers match between sleds and SPs - Required for multicast reconciler to map `sled_id` → rear switch-ports (backplane) for instances - `mvlan`: External groups support an optional Multicast VLAN for (eventual) upstream egress - Updates to instance sagas as Nexus passes memberships to sled-agent via `InstanceSledLocalConfig.multicast_groups` *API Endpoints*: - GET /v1/multicast-groups: List fleet multicast groups - POST /v1/multicast-groups: Create multicast group - GET /v1/multicast-groups/{group}: View group details - PUT /v1/multicast-groups/{group}: Update group (name, sources) - DELETE /v1/multicast-groups/{group}: Delete group - GET /v1/multicast-groups/{group}/members: List group members - POST /v1/multicast-groups/{group}/members: Add instance to group - DELETE /v1/multicast-groups/{group}/members/{instance}: Remove instance from group - GET /v1/instances/{instance}/multicast-groups: List groups for an instance - PUT /v1/instances/{instance}/multicast-groups/{group}: Join instance to group - DELETE /v1/instances/{instance}/multicast-groups/{group}: Leave group - GET /v1/system/multicast-groups/by-ip/{address}: Lookup group by IP address The instance-scoped endpoints provide an alternative interface for the same join/leave operations, and there's also the system-level IP lookup endpoint. *New Sagas*: - `multicast_group_dpd_ensure`: Ties together external/underlay creation of groups on all switches - `multicast_group_dpd_update`: Updates group configuration across switches *Breaking Changes*: - sled-agent API version bump from `v6` to `v7` - New required configuration in Nexus (multicast.enabled flag, reconciler period, and cache TTL settings) - Schema migration required (`v208` → `v209`) *Migration Notes*: - Multicast as a feature is disabled by default for safe rollout - Multicast endpoints are marked as "experimental" *References*: - RFD 488: https://rfd.shared.oxide.computer/rfd/488 - IP Pool extensions: #9084 - Dendrite PRs (based on recency): - oxidecomputer/dendrite#132 - oxidecomputer/dendrite#109 - oxidecomputer/dendrite#14
Updates:
- Removes unnecessary gateway outbound rules
- Moves VNI validation to `DecapAction`
- Refactors similar code around router predicates, removing unnecessary
checks, etc
- `mcast_fwd` (`KRwLock<Arc<McastForwardingTable>>`) now lives
in `XdeDev` (Tx path) only
- Reverted from Arc-cloning pattern to holding read locks:
* Tx: Acquire and hold per-port `DevMap` and `mcast_fwd` read locks for
duration of packet processing (lazy acquisition on first multicast packet)
* Rx: Hold per-CPU `DevMap mutex` for duration of packet processing
* `refresh_maps()` acquires write locks to update all per-port and per-CPU
snapshots, blocking until no Tx/Rx context holds read locks, ensuring
safe port teardown
* Removed per-CPU cache clearing in `clear_xde_underlay()`
* Removed EBUSY retry logic from test teardown (`Xde::drop`)
- Use `MulticastUnderlay` newtype across the codebase for mcast underlay
address types
- Refactored `find_mcast_option_offset()` to use ingot parsing:
* Uses `ValidGeneve::parse()`, `OxideOptions::from_raw()` instead of manual
byte parsing
* Implemented `HeaderLen` for `GeneveOptionParse` to enable
`opt.packet_length()`
- Use AF_INET and AF_INET6 constants in DTrace probes instead of hardcoded
values (2usize, 26usize)
- opteadm subscription management commands (mcast-subscribe,
mcast-unsubscribe, mcast-unsubscribe-all) with clap integration
- Added `McastUnsubscribeAll` API command and ioctl
- Documentation updates
FelixMcFelix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Zeeshan. I really appreciate the expanded use of the xde-tests work -- is everything running in parallel? And I guess as a side effect, is this why all of the test topologies need unique names for their zones? (If so, that's worth documenting...)
Includes:
- geneve_verify: Add `assert_geneve_packet` helper and multi-packet parsing
- geneve: Fix `packet_length()` for known options where body is consumed
- Add `HeaderLen` supertrait bound to `OptionCast`
- Implement `HeaderLen` for `Known<T>` delegating to T for known variants
- `GeneveOptionParse::packet_length()` now uses `option.packet_length()`
instead of relying solely on `body_remainder` which is empty for
known options after parsing consumes the body
- oxide-vpc geneve: Add `HeaderLen` impl for `ValidOxideOption`
- overlay: Rewrite inner dest MAC to RFC-compliant multicast MAC for Tx
- dhcpv6: Compute proper UDP checksum for IPv6
- ip: Add `multicast_mac()` methods with RFC 1112/2464 citations
- opteadm: Add `set-m2p`/`clear-m2p` commands for multicast-to-physical mappings
- xde-tests: Use typed `Ipv4Addr`/`Ipv6Addr` instead of `String` in dualstack setup
- xde-tests: Simplify topology helpers and naming conventions (tests run single-threaded)
- test.sh: Exercise driver teardown with `rem_drv` after tests
complete
- xde: Normalize inner dst MAC on Rx
- drop non‑multicast inner
- add mcast_rx_bad_inner_dst stat
- xde: Initialize `mcast_fwd` from token and add `RefreshScope` and
use scoped refresh across create/delete/subscribe/unsubscribe
Closes: #760 (among other needs).
XXX currently sketching out the bare minimum requirements around frame delivery, insertion of Geneve options, etc.
TODO:
Ipv6Addr->BTreeMap<(NextHopV6, Replication)>).Mcast2Physand above table via ioctl.