We observed failures during initial manufacturing MUPdate of sled 22 in a rack. I'm leaving the issue title here open ended since I am relatively unfamiliar with the system. The things we'd like are (1) a mechanism to understand why a link (e.g. rear22/0) is in a Faulted state and (2) a mechanism (e.g. logs) to get this retrospectively - for example, after ignition cycle when links are back up.
See internal ticket for details here: https://github.com/oxidecomputer/mfg-troubleshooting/issues/926.
/staff/adam/dendrite-fault-rear22/ contains the output of swadm link history and the dendrite service logs.
adam@atrium ~ $ find /staff/adam/dendrite-fault-rear22/ -type f
/staff/adam/dendrite-fault-rear22/XXXXXXXX/swadm-link-history-rear220.txt
/staff/adam/dendrite-fault-rear22/XXXXXXXX/oxide-dendrite:default.log
/staff/adam/dendrite-fault-rear22/YYYYYYYY/oxide-dendrite:default.log
/staff/adam/dendrite-fault-rear22/YYYYYYYY/swadm-link-history-rear220.txt
MUPdate was hung downloading installinator because both links were down:
XXXXXXXX # dladm
LINK CLASS MTU STATE BRIDGE OVER
cxgbe0 phys 1500 down -- --
cxgbe1 phys 1500 down -- --
bootstrap\_stub0 etherstub 9000 up -- --
bootstrap0 vnic 1500 up -- bootstrap\_stub0
XXXXXXXX #
swadm link ls in the switch zones showed rear22/0 in a Faulted state.
support@oxz_switch0:~$ swadm link ls
...
rear22/0 Copper 100G RS true true Faulted xx:xx:xx:xx:xx:xx
...
We ignition cycled multiple times and XXXXXXXX / 22 rebooted with all links up:
XXXXXXXX # dladm
LINK CLASS MTU STATE BRIDGE OVER
cxgbe0 phys 9000 up -- --
cxgbe1 phys 9000 up -- --
XXXXXXXX #
Likewise, we lost the Faulted state in swadm link ls:
support@oxz_switch0:~$ swadm link ls
...
rear22/0 Copper 100G RS true true Up xx:xx:xx:xx:xx:xx
...
dendrite logs show the link transitioning out of Faulted:
20:17:41.822Z DEBG dpd: Link update
link_id = 0
new = Up
old = Faulted
port_id = rear22
state = LinkUp
unit = callback_handler
swadm link history showed loops of:
14241983 PortFSM Idle -
14242053 PortFSM Abort -
14247600 PortFSM WaitAutoNegLinkTrainingDone -
14250248 PortFSM WaitAutoNegDone -
until ignition cycles resulted in the link going up:
Time Class Subclass Channel Details
3746629 PortFSM LinkUp -
3746639 PortFSM LinkDown -
3747009 PortFSM WaitAutoNegLinkTrainingDone -
4197483 PortFSM WaitAutoNegDone -
4198486 PortFSM Idle -
We observed failures during initial manufacturing MUPdate of sled 22 in a rack. I'm leaving the issue title here open ended since I am relatively unfamiliar with the system. The things we'd like are (1) a mechanism to understand why a link (e.g.
rear22/0) is in aFaultedstate and (2) a mechanism (e.g. logs) to get this retrospectively - for example, after ignition cycle when links are back up.See internal ticket for details here: https://github.com/oxidecomputer/mfg-troubleshooting/issues/926.
/staff/adam/dendrite-fault-rear22/contains the output ofswadm link historyand thedendriteservice logs.MUPdate was hung downloading
installinatorbecause both links were down:swadm link lsin the switch zones showedrear22/0in aFaultedstate.We ignition cycled multiple times and
XXXXXXXX/22rebooted with all linksup:Likewise, we lost the
Faultedstate inswadm link ls:dendritelogs show the link transitioning out of Faulted:swadm link historyshowed loops of:until ignition cycles resulted in the link going up: