Skip to content

Faulted link observed during MUPdate with limited retrospective debugging #253

@adamlouis

Description

@adamlouis

We observed failures during initial manufacturing MUPdate of sled 22 in a rack. I'm leaving the issue title here open ended since I am relatively unfamiliar with the system. The things we'd like are (1) a mechanism to understand why a link (e.g. rear22/0) is in a Faulted state and (2) a mechanism (e.g. logs) to get this retrospectively - for example, after ignition cycle when links are back up.

See internal ticket for details here: https://github.com/oxidecomputer/mfg-troubleshooting/issues/926.

/staff/adam/dendrite-fault-rear22/ contains the output of swadm link history and the dendrite service logs.

adam@atrium ~ $ find /staff/adam/dendrite-fault-rear22/ -type f
/staff/adam/dendrite-fault-rear22/XXXXXXXX/swadm-link-history-rear220.txt
/staff/adam/dendrite-fault-rear22/XXXXXXXX/oxide-dendrite:default.log
/staff/adam/dendrite-fault-rear22/YYYYYYYY/oxide-dendrite:default.log
/staff/adam/dendrite-fault-rear22/YYYYYYYY/swadm-link-history-rear220.txt

MUPdate was hung downloading installinator because both links were down:

XXXXXXXX # dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
cxgbe0      phys      1500   down     --         --
cxgbe1      phys      1500   down     --         --
bootstrap\_stub0 etherstub 9000 up     --         --
bootstrap0  vnic      1500   up       --         bootstrap\_stub0
XXXXXXXX #

swadm link ls in the switch zones showed rear22/0 in a Faulted state.

support@oxz_switch0:~$ swadm link ls
...
rear22/0   Copper  100G   RS    true     true     Faulted  xx:xx:xx:xx:xx:xx
...

We ignition cycled multiple times and XXXXXXXX / 22 rebooted with all links up:

XXXXXXXX # dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
cxgbe0      phys      9000   up       --         --
cxgbe1      phys      9000   up       --         --
XXXXXXXX #

Likewise, we lost the Faulted state in swadm link ls:

support@oxz_switch0:~$ swadm link ls
...
rear22/0   Copper  100G   RS    true     true     Up       xx:xx:xx:xx:xx:xx
...

dendrite logs show the link transitioning out of Faulted:

20:17:41.822Z DEBG dpd: Link update
    link_id = 0
    new = Up
    old = Faulted
    port_id = rear22
    state = LinkUp
    unit = callback_handler

swadm link history showed loops of:

14241983  PortFSM  Idle                         -
14242053  PortFSM  Abort                        -
14247600  PortFSM  WaitAutoNegLinkTrainingDone  -
14250248  PortFSM  WaitAutoNegDone              -

until ignition cycles resulted in the link going up:

Time      Class    Subclass                     Channel  Details
3746629   PortFSM  LinkUp                       -
3746639   PortFSM  LinkDown                     -
3747009   PortFSM  WaitAutoNegLinkTrainingDone  -
4197483   PortFSM  WaitAutoNegDone              -
4198486   PortFSM  Idle                         -

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions