Skip to content

Add state duration to DDM FSM#689

Merged
taspelund merged 10 commits into
mainfrom
trey/ddm-track-session-duration
May 21, 2026
Merged

Add state duration to DDM FSM#689
taspelund merged 10 commits into
mainfrom
trey/ddm-track-session-duration

Conversation

@taspelund
Copy link
Copy Markdown
Contributor

@taspelund taspelund commented Mar 30, 2026

At its heart, this is a re-implementation of #180 which has been updated for the modern codebase and offers some cleanup in addition to timestamps.

First and foremost, this PR adds duration tracking for DDM FSM states. It does so by updating the FSM state enum variants to hold a std::time::Instant representing the timestamp when the FSM entered that state, and calculating a Duration using .elapsed() when the FSM state is queried (i.e. when ddmadm get-peers is called).

Part of this work was to cleanup the implementation of peer tracking: namely getting rid of the Db.peers table and instead tracking a peer in each FSM. Since the FSMs are instantiated per interface, this naturally gives us the interface-to-peer mapping structure so we don't actually need to maintain a separate table for the peers. Instead, we update/query the peers directly from the FSMs.

There are also a couple quality of life improvements: putting the state duration in a separate column and sorting the output of ddmadm get-peers.

Fixes: #54

@taspelund taspelund requested a review from rcgoodfellow March 30, 2026 15:58
@taspelund taspelund self-assigned this Mar 30, 2026
@taspelund taspelund added ddm Delay Driven Multipath rust Pull requests that update rust code labels Mar 30, 2026
@taspelund
Copy link
Copy Markdown
Contributor Author

I've manually tested this in a4x2 and proven that everything still works as expected. I've also manually verified that ddmadm expire-peer still works as expected as well.

The open question that I still have here: is PeerStatus::Expired worth keeping or should I remove it?
We do not maintain state for Expired peers (setting the Option to None when peer expires) and right now it's only really referenced in a From impl. I think we can safely remove it, but I didn't want to yank it out without discussion in case there were plans for it.

@taspelund
Copy link
Copy Markdown
Contributor Author

One other thing: I am also happy to add an FSM history buffer for DDM that mirrors what we have in BGP, especially if we don't believe that this PR fully addresses the visibility issue described in #54

taspelund added 4 commits May 18, 2026 13:56
Db.peers was only used as a cache for reporting peers to ddm-api clients
calling GET /peers, and the contents of the cache were a bit misleading.
For example, there is no such thing as a peer state in DDM. DDM FSMs
have states, but they are instantiated per interface not per peer. Not
only does the state reported by GET /peers not actually map to a peer,
but it doesn't even directly map to the FSM of the peer's interface.
Instead, that state was populated by manual db updates scattered
throughout the FSM implementation. This removes Db.peers and introduces
a new shared InterfaceState struct that consolidates peer/FSM info for a
given interface. GET /peers now reads from each InterfaceState and
passes along any peers it finds along with interface state.

Also fixes a pre-existing bug in oxstats where interface names
were always empty due to reading from a stale SmContext clone.
After adding PeerIdentity, the peer_address field of SessionStats is now
redundant (and also isn't a stat) so this removes it. The peer address
is tracked in PeerIdentity, so we can just rely on that field behind a
single mutex instead of locking/juggling both stats and identity.
Also adds a couple helper methods for transitioning FSM state and
updating the peer state.
@taspelund taspelund force-pushed the trey/ddm-track-session-duration branch from 67c8745 to e5795f3 Compare May 18, 2026 19:57
@taspelund
Copy link
Copy Markdown
Contributor Author

rebased atop main after #724

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Comment thread ddm/src/admin.rs Outdated
Copy link
Copy Markdown
Collaborator

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @taspelund comments follow.

Something I'd like to make sure we test with this is ddmd restart through svcadm restart mg-ddm to make sure that all the peers come back up how we expect. That was the original use case for the peer database. When we started putting interfaces in SMF that may completely obviate the peers part of the database, but I'd like to be sure with ddm-restart testing. The trio/quartet tests may actually cover this already, but I'd like to be sure.

Comment thread ddm/src/oxstats.rs Outdated
Comment thread ddm/src/sm.rs Outdated
Comment thread ddmadm/src/main.rs Outdated
taspelund added 4 commits May 19, 2026 19:54
Drop the local seconds-precision format_duration helper in favor of the
shared helper already used by mgadm for BGP FSM durations, so ddmadm and
mgadm display durations consistently.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Encapsulate the paired writes to if_index and if_name behind a single
method on InterfaceState so the two fields stay in sync at the one site
that updates them.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
The produce() loop locked peer.iface.if_name twelve times per peer per
scrape. Take the lock once at the top of the loop body and reuse the
cloned String for each sample.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
peer_status() internally locks fsm_state and last_fsm_state_change. Call
it before acquiring if_index and peer_identity in get_peers so it remains
deadlock-safe if it ever takes additional InterfaceState mutexes.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
@taspelund
Copy link
Copy Markdown
Contributor Author

@rcgoodfellow your feedback has been integrated. I will run this through a4x2 to explicitly test out the smf restart

Resolves conflicts from main's split of ddm/src/sm.rs into sm/mod.rs and
sm/state.rs, and ddm/src/discovery.rs into discovery/mod.rs and
discovery/runtime.rs. The branch's InterfaceState/FsmState/PeerIdentity
types and the iface field on SmContext now live in sm/mod.rs alongside
the other shared definitions; the state-machine transition calls and the
discovery::handler signature change move into the new submodules.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
@taspelund
Copy link
Copy Markdown
Contributor Author

Manual testing of svcadm restart in a4x2 was successful:

root@oxz_switch:~# ddmadm get-prefixes
Destination            Next Hop                Path
fdb0:a840:2500:1::/64  fe80::aa40:25ff:fe00:1  g0
fd00:17:1:1::/64       fe80::aa40:25ff:fe00:1  g0
fd00:17:1:d01::/64     fe80::aa40:25ff:fe00:1  g0
fdb0:a840:2500:3::/64  fe80::aa40:25ff:fe00:3  g1
fd00:17:1:2::/64       fe80::aa40:25ff:fe00:3  g1
fd00:17:1:d02::/64     fe80::aa40:25ff:fe00:3  g1
fdb0:a840:2500:7::/64  fe80::aa40:25ff:fe00:7  g3
fd00:17:1:d04::/64     fe80::aa40:25ff:fe00:7  g3
fd00:17:1:3::/64       fe80::aa40:25ff:fe00:5  g2
fd00:17:1:d03::/64     fe80::aa40:25ff:fe00:5  g2
fdb0:a840:2500:5::/64  fe80::aa40:25ff:fe00:5  g2
root@oxz_switch:~# svcadm restart mg-ddm
root@oxz_switch:~# ddmadm get-prefixes
Destination            Next Hop                Path
fdb0:a840:2500:5::/64  fe80::aa40:25ff:fe00:5  g2
fd00:17:1:d03::/64     fe80::aa40:25ff:fe00:5  g2
fd00:17:1:3::/64       fe80::aa40:25ff:fe00:5  g2
fd00:17:1:d04::/64     fe80::aa40:25ff:fe00:7  g3
fdb0:a840:2500:7::/64  fe80::aa40:25ff:fe00:7  g3
fd00:17:1:2::/64       fe80::aa40:25ff:fe00:3  g1
fdb0:a840:2500:3::/64  fe80::aa40:25ff:fe00:3  g1
fd00:17:1:d02::/64     fe80::aa40:25ff:fe00:3  g1
fdb0:a840:2500:1::/64  fe80::aa40:25ff:fe00:1  g0
fd00:17:1:1::/64       fe80::aa40:25ff:fe00:1  g0
fd00:17:1:d01::/64     fe80::aa40:25ff:fe00:1  g0
root@oxz_switch:~# ddmadm get-prefixes
Destination            Next Hop                Path
fdb0:a840:2500:7::/64  fe80::aa40:25ff:fe00:7  g3
fd00:17:1:d04::/64     fe80::aa40:25ff:fe00:7  g3
fd00:17:1:d03::/64     fe80::aa40:25ff:fe00:5  g2
fd00:17:1:3::/64       fe80::aa40:25ff:fe00:5  g2
fdb0:a840:2500:5::/64  fe80::aa40:25ff:fe00:5  g2
fdb0:a840:2500:3::/64  fe80::aa40:25ff:fe00:3  g1
fd00:17:1:2::/64       fe80::aa40:25ff:fe00:3  g1
fd00:17:1:d02::/64     fe80::aa40:25ff:fe00:3  g1
fdb0:a840:2500:1::/64  fe80::aa40:25ff:fe00:1  g0
fd00:17:1:d01::/64     fe80::aa40:25ff:fe00:1  g0
fd00:17:1:1::/64       fe80::aa40:25ff:fe00:1  g0
root@oxz_switch:~# ddmadm get-peers
Interface  Host  Address                 Kind    Status    Duration
5          g0    fe80::aa40:25ff:fe00:1  Server  Exchange  4s 590ms
6          g1    fe80::aa40:25ff:fe00:3  Server  Exchange  4s 589ms
7          g2    fe80::aa40:25ff:fe00:5  Server  Exchange  4s 589ms
8          g3    fe80::aa40:25ff:fe00:7  Server  Exchange  4s 588ms

@taspelund taspelund merged commit 561931c into main May 21, 2026
16 checks passed
@taspelund taspelund deleted the trey/ddm-track-session-duration branch May 21, 2026 17:23
zeeshanlakhani added a commit that referenced this pull request May 22, 2026
Brings in main, which carries #689 (PEER_DURATIONS at DDM API v2) from zl/mrib.

MULTICAST_SUPPORT shifts from v2 to v3. PEER_DURATIONS landed first and
owns v2 so the PeerInfo conversion chain is v3→v2→v1.

Peer state now comes from #689's per-interface InterfaceState (FSM-
driven) rather than the manually-populated `Db.peers` cache. GET /peers
reads identity and `if_name` directly off the per-interface state.

We drop the PUT /peer endpoint as omicron #10346 drives its DDM tests through
an in-process dropshot sim (DdmInstance with a set_peers back-door), not a
real ddmd, so the endpoint had no consumer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddm Delay Driven Multipath rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ddm: track peer establishment time

2 participants