-
Notifications
You must be signed in to change notification settings - Fork 212
Description
Ry noticed odd behavior on Berlin, where BRM42220011 and BRM06240029 would sometimes not respond to state requests.
pilot -r berlin sp status BRM42220011
BRM42220011 ERROR: faux-mgs failure: run "/usr/bin/faux-mgs --interface berlin_sw0tp0 --discovery-addr [fe80::aa40:25ff:fe04:201]:11111 power-state": exit code 1: Nov 20 03:42:02.008 INFO creating SP handle on interface berlin_sw0tp0, component: faux-mgs
Nov 20 03:42:12.013 INFO initial discovery failed, error: RPC call failed (gave up after 5 attempts), addr: [fe80::aa40:25ff:fe04:201%12]:11111, interface: berlin_sw0tp0, socket: control-plane-agent, component: faux-mgs
Error: some operations failed
I saw the same behavior from the switch zone on BRM42220011, where it took three Discover messages to find the SP. This is interesting, because it eliminates the lab switches on the way: the path is just Switch Zone → Tofino → VSC7448 → [mess of PHYs] → SP
BRM42220011 # zlogin oxz_switch
[Connected to zone 'oxz_switch' pts/3]
Last login: Thu Nov 20 13:48:07 on pts/3
The illumos Project helios-2.0.23613 October 2025
root@oxz_switch0:~# pilot sp ls
CUBBY SERIAL TYPE IMAGE IP
0 - - - -
1 - - - -
2 - - - -
3 - - - -
4 - - - -
5 - - - -
6 - - - -
7 - - - -
8 - - - -
9 - - - -
10 - - - -
11 - - - -
12 - - - -
13 - - - -
14 BRM42220011 gimlet b495f9661cb8a009 fe80::aa40:25ff:fe04:201
15 BRM44220007 gimlet b495f9661cb8a009 fe80::aa40:25ff:fe04:482
16 BRM42220082 gimlet b495f9661cb8a009 fe80::aa40:25ff:fe04:180
17 BRM06240029 gimlet a92043cbfb801299 fe80::aa40:25ff:fe04:b44
18 - - - -
19 - - - -
20 - - - -
21 - - - -
22 - - - -
23 - - - -
24 - - - -
25 - - - -
26 - - - -
27 - - - -
28 - - - -
29 - - - -
30 - - - -
31 - - - -
root@oxz_switch0:~# faux-mgs --log-level=trace --interface=gimlet14 --discovery-addr='[fe80::aa40:25ff:fe04:201]:11111' state
Nov 20 13:56:56.678 INFO creating SP handle on interface gimlet14, component: faux-mgs
Nov 20 13:56:56.678 DEBG attempting initial SP discovery, discovery_addr: [fe80::aa40:25ff:fe04:201]:11111, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:56:56.679 TRCE sending request to SP, attempt: 1, request: Message { header: Header { version: 21, message_id: 1 }, kind: MgsRequest(Discover) }, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:56:56.679 DEBG refreshed scope ID for SP interface, discovery_addr: [fe80::aa40:25ff:fe04:201%50]:11111, interface: gimlet14, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:56:58.680 TRCE sending request to SP, attempt: 2, request: Message { header: Header { version: 21, message_id: 1 }, kind: MgsRequest(Discover) }, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:57:00.681 TRCE sending request to SP, attempt: 3, request: Message { header: Header { version: 21, message_id: 1 }, kind: MgsRequest(Discover) }, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:57:01.679 TRCE received response from SP, response: Discover(DiscoverResponse { sp_port: One }), header: Header { version: 21, message_id: 1 }, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:57:01.679 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:201%50]:11111, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:57:01.679 TRCE sending request to SP, attempt: 1, request: Message { header: Header { version: 21, message_id: 2 }, kind: MgsRequest(SpState) }, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:57:01.679 DEBG ignoring unexpected response, peer: [fe80::aa40:25ff:fe04:201%50]:11111, id: 1, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:57:01.680 TRCE sending request to SP, attempt: 2, request: Message { header: Header { version: 21, message_id: 2 }, kind: MgsRequest(SpState) }, interface: gimlet14, socket: control-plane-agent, component: faux-mgs
Nov 20 13:57:01.690 TRCE received response from SP, response: SpStateV2(SpStateV2 { hubris_archive_id: [180, 149, 249, 102, 28, 184, 160, 9], serial_number: [66, 82, 77, 52, 50, 50, 50, 48, 48, 49, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], model: [57, 49, 51, 45, 48, 48, 48, 48, 48, 49, 57, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], revision: 6, base_mac_address: [168, 64, 37, 4, 2, 1], power_state: A0, rot: Ok(RotStateV2 { active: A, persistent_boot_preference: A, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([201, 38, 171, 16, 31, 97, 178, 116, 67, 33, 108, 95, 6, 139, 242, 20, 52, 241, 22, 7, 167, 195, 189, 182, 90, 154, 240, 250, 149, 188, 201, 23]), slot_b_sha3_256_digest: Some([199, 73, 138, 93, 79, 203, 253, 141, 202, 61, 61, 211, 109, 72, 13, 175, 77, 83, 108, 29, 129, 66, 87, 70, 250, 45, 198, 129, 175, 244, 56, 245]) }) }), header: Header { version: 21, message_id: 2 }, interfacehubris archive: b495f9661cb8a009
:serial number: BRM42220011
gimlet14model: 913-0000019
revision: 6
, base MAC address: a8:40:25:04:02:01
socketpower state: A0
:rot: Ok(RotStateV2 {
active: A,
persistent_boot_preference: A,
pending_persistent_boot_preference: None,
transient_boot_preference: None,
slot_a_sha3_256_digest: Some("c926ab101f61b27443216c5f068bf21434f11607a7c3bdb65a9af0fa95bcc917"),
slot_b_sha3_256_digest: Some("c7498a5d4fcbfd8dca3d3dd36d480daf4d536c1d81425746fa2dc681aff438f5"),
}
)
control-plane-agent, component: faux-mgs
Nov 20 13:57:01.690 INFO V2(SpStateV2 { hubris_archive_id: [180, 149, 249, 102, 28, 184, 160, 9], serial_number: [66, 82, 77, 52, 50, 50, 50, 48, 48, 49, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], model: [57, 49, 51, 45, 48, 48, 48, 48, 48, 49, 57, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], revision: 6, base_mac_address: [168, 64, 37, 4, 2, 1], power_state: A0, rot: Ok(RotStateV2 { active: A, persistent_boot_preference: A, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([201, 38, 171, 16, 31, 97, 178, 116, 67, 33, 108, 95, 6, 139, 242, 20, 52, 241, 22, 7, 167, 195, 189, 182, 90, 154, 240, 250, 149, 188, 201, 23]), slot_b_sha3_256_digest: Some([199, 73, 138, 93, 79, 203, 253, 141, 202, 61, 61, 211, 109, 72, 13, 175, 77, 83, 108, 29, 129, 66, 87, 70, 250, 45, 198, 129, 175, 244, 56, 245]) }) }), component: faux-mgs
Note that it discovers the SP eventually, so it's not a permanent issue.
Getting the ringbufs, I see the following:
matt@castle ~ $ humility -i fe80::aa40:25ff:fe04:201%12 -a /staff/mfg/prod/media/images/gimlet/c/sp/build-gimlet-c-image-default-v1.0.49.zip ringbuf control_plane_agent::__RINGBUF
humility: connecting to fe80::aa40:25ff:fe04:201%12
humility: ring buffer task_control_plane_agent::__RINGBUF in control_plane_agent:
TOTAL VARIANT
84425658 Rx
75374571 MgsMessage(ComponentDetails)
2745191 MgsMessage(SpState)
709523 MgsMessage(Inventory)
312984 MgsMessage(HostPhase2Data)
308790 MgsMessage(UpdateChunk)
138538 MgsMessage(VersionedRotBootInfo)
53506 MgsMessage(ReadRotPage)
26905 MgsMessage(GetHostFlashHash)
13394 MgsMessage(ComponentGetActiveSlot)
9937 MgsMessage(Discovery)
1859 MgsMessage(UpdateStatus)
147 MgsMessage(GetPowerState)
100 MgsMessage(SerialConsoleWrite)
98 MgsMessage(SerialConsoleKeepAlive)
24 MgsMessage(StartHostFlashHash)
16 MgsMessage(SetPowerState)
9 MgsMessage(UpdatePrepare)
7 MgsMessage(SetStartupOptions)
6 MgsMessage(SerialConsoleAttach)
6 MgsMessage(SerialConsoleDetach)
4 MgsMessage(ComponentSetActiveSlot)
4 MgsMessage(SetIpccKeyValue)
1406689 SendError
312984 IpcRequest(FetchHostPhase2Data)
312984 IpcRequest(GetHostPhase2Data)
2 IpcRequest(GetInstallinatorImageId)
70 SerialConsoleSend
NDX LINE GEN COUNT PAYLOAD
10 906 29206 1 MgsMessage(ComponentDetails { component: SpComponent { id: [ 0x55, 0x35, 0x32, 0x34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ] } })
11 572 29206 1 Rx(UdpMetadata { addr: Ipv6(Ipv6Address([ 0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xaa, 0x40, 0x25, 0xff, 0xfe, 0x5, 0x1f, 0x2 ])), port: 0x56ce, size: 0x1e, vid: Sidecar2 })
12 906 29206 1 MgsMessage(ComponentDetails { component: SpComponent { id: [ 0x4a, 0x31, 0x39, 0x34, 0x2f, 0x55, 0x31, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ] } })
13 572 29206 1 Rx(UdpMetadata { addr: Ipv6(Ipv6Address([ 0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xaa, 0x40, 0x25, 0xff, 0xfe, 0x5, 0x1f, 0x2 ])), port: 0x56ce, size: 0x1e, vid: Sidecar2 })
14 906 29206 1 MgsMessage(ComponentDetails { component: SpComponent { id: [ 0x4d, 0x31, 0x35, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ] } })
15 572 29206 1 Rx(UdpMetadata { addr: Ipv6(Ipv6Address([ 0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xaa, 0x40, 0x25, 0xff, 0xfe, 0x5, 0x1f, 0x2 ])), port: 0x56ce, size: 0x1e, vid: Sidecar2 })
0 906 29207 1 MgsMessage(ComponentDetails { component: SpComponent { id: [ 0x55, 0x35, 0x36, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ] } })
1 572 29207 1 Rx(UdpMetadata { addr: Ipv6(Ipv6Address([ 0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xaa, 0x40, 0x25, 0xff, 0xfe, 0x5, 0x1f, 0x2 ])), port: 0x56ce, size: 0x1e, vid: Sidecar2 })
2 906 29207 1 MgsMessage(ComponentDetails { component: SpComponent { id: [ 0x4d, 0x33, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ] } })
3 572 29207 1 Rx(UdpMetadata { addr: Ipv6(Ipv6Address([ 0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xaa, 0x40, 0x25, 0xff, 0xfe, 0x5, 0x1f, 0x2 ])), port: 0x56ce, size: 0x1e, vid: Sidecar2 })
4 906 29207 1 MgsMessage(ComponentDetails { component: SpComponent { id: [ 0x4d, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ] } })
5 572 29207 1 Rx(UdpMetadata { addr: Ipv6(Ipv6Address([ 0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xaa, 0x40, 0x25, 0xff, 0xfe, 0x5, 0x1f, 0x2 ])), port: 0x56ce, size: 0xa, vid: Sidecar2 })
6 125 29207 1 MgsMessage(SpState)
7 572 29207 1 Rx(UdpMetadata { addr: Ipv6(Ipv6Address([ 0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xaa, 0x40, 0x25, 0xff, 0xfe, 0x5, 0x35, 0x2 ])), port: 0x56ce, size: 0x1e, vid: Sidecar1 })
8 906 29207 1 MgsMessage(ComponentDetails { component: SpComponent { id: [ 0x4a, 0x32, 0x31, 0x35, 0x2f, 0x55, 0x38, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ] } })
9 530 29207 1 SendError(QueueFull)
Taking multiple samples, someone is requesting inventory data about 100-200× per second. At first, I thought this was DOS'ing the SP, but I'm no longer as confident in that analysis.
More suspiciously, the ringbuf shows 1406689 SendError events! We don't break out specific SendError counters, but it turns out the only variants are QueueFull and ServerDied, and task generations are all at zero, so this must all be QueueFull messages.
At this point, I hopped over into the other switch zone (cubby 16, BRM42220082). From this switch zone, I could talk to the BRM42220011 reliably.
Unfortunately, the problem is sporadic and seems to have resolved itself: I retested on both castle and the BRM42220011 switch zone, and both were working fine. This means I can't say whether the BRM42220082 switch zone was always working, or whether the issue had fixed itself by the time I tested from there.
Seeing QueueFull errors reminds me of the watchdog we added in #1293 (to avoid smoltcp-rs/smoltcp#594). I'm going to add ringbuf traces for when this watchdog fires, so that we get a little more information, but don't have any other concrete ideas.