Skip to content

sled-agent doesn't always restart the switch zone after the Tofino restarts #10187

@jgallagher

Description

@jgallagher

Today on madrid, @Nieuwejaar ran into an issue where upon restarting the Tofino, sled-agent tore down the switch zone (as expected) but never restarted it (definitely not expected). Looking at the logs, we see that on startup, the HardwareMonitor was notified that a Tofino was available, so started the switch zone:

00:01:28.281Z INFO SledAgent (HardwareMonitor): tofino present and policy allows switch zone; will activate it
    file = sled-agent/src/hardware_monitor.rs:265
00:01:28.281Z INFO SledAgent (ServiceManager): Ensuring scrimlet services (enabling services)
    file = sled-agent/src/services.rs:3279
00:01:28.281Z INFO SledAgent (ServiceManager): Re-enabling running switch zone (new address)

And when the Tofino went away, the tofino-monitor process (from #9181) noticed, shut down the switch zone, and that HardwareMonitor received a TofinoUnavailable message:

Mar 30 17:00:12.630 DEBG back from contract ioctl, unit: tofino-monitor
Mar 30 17:00:12.632 INFO Got tofino removed notification, unit: tofino-monitor
Mar 30 17:00:12.650 INFO halting the switch zone, unit: tofino-monitor
Mar 30 17:00:14.369 INFO acknowledging the remove event, unit: tofino-monitor
Mar 30 17:00:16.369 DEBG entering contract ioctl, unit: tofino-monitor
Mar 30 17:00:16.369 DEBG back from contract ioctl, unit: tofino-monitor
Mar 30 17:00:16.369 INFO closing out the device contract, unit: tofino-monitor
Mar 30 17:00:16.370 INFO tofino monitor exiting, unit: tofino-monitor
17:00:16.374Z INFO SledAgent (HardwareManager): child exited with code: Some(0)
    file = sled-hardware/src/illumos/mod.rs:730
17:00:16.375Z INFO SledAgent (HardwareMonitor): Received hardware update message
    file = sled-agent/src/hardware_monitor.rs:154
    update = Ok(TofinoUnavailable)
17:00:16.375Z INFO SledAgent (HardwareMonitor): Hardware monitor got TofinoUnavailable message
    file = sled-agent/src/hardware_monitor.rs:201
17:00:16.375Z INFO SledAgent (ServiceManager): Disabling switch zone (was running)

However, when the Tofino came back, tofino-monitor noticed, but HardwareMonitor never got a TofinoAvailable message, so it never restarted the switch zone:

Mar 30 17:00:21.383 INFO tofino monitor online, unit: tofino-monitor
Mar 30 17:00:21.414 INFO tofino device found, unit: tofino-monitor
Mar 30 17:00:21.414 DEBG entering contract ioctl, unit: tofino-monitor

Most events HardwareMonitor receives come from the internal sled_hardware task that periodically polls devinfo. That task emits logs either noting "nothing changed" or a list of updates:

if updates.is_empty() {
debug!(log, "No updates from polling device tree");
}
for update in updates.into_iter() {
info!(log, "Update from polling device tree: {:?}", update);
let _ = tx.send(update);
}

I initially assumed from the Hardware monitor got TofinoUnavailable message log that this polling had noticed the Tofino had gone away, but that was wrong:

  1. There was no corresponding Update from polling device tree: log
  2. That TofinoUnavailable message was sent by the tofino-monitor machinery itself:
    let _ = tx.send(HardwareUpdate::TofinoUnavailable);

We did confirm via dtrace that the polling system was polling every 5 seconds and did know that the Tofino is available:

BRM42220081 # dtrace -n 'slog15282:::debug { printf("%s\n", copyinstr(arg0)); }' -q | grep sled_hardware
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:55:57.396110244Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:55:58.385531396Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:03.415312964Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:04.365675256Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:09.395331786Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:10.305673451Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:15.333921636Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:16.324512269Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:21.355846860Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:22.362137839Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}

My theory here is:

  • The polling task never completed a poll where the Tofino was unavailable (I'm not sure about the exact mechanics here - maybe it was a timing issue since we only poll every 5 seconds and the Tofino wasn't gone for very long, or maybe polling devinfo itself is affected somehow by the Tofino going away and coming back?)
  • Therefore, it never sent a TofinoAvailable message, because it itself never sent a TofinoUnavailable message

I'm very hesitant to suggest any quick fixes here, because this feels a bit like we've got two different paths of control that are interfering. Some ideas:

  • Should the tofino-monitor path send a TofinoAvailable message, akin to how it sends TofinoUnavailable?
  • Should the tofino-monitor path coordinate more closely with the polling path? (E.g., if the tofino-monitor path told the polling path that the tofino was unavailable instead of bypassing it to talk to HardwareMonitor directly, a future poll that saw the tofino would be able to know enough to send a TofinoAvailable message)
  • Should the poller -> HardwareMonitor communication be reworked to be level-triggered instead of edge triggered? (This might also require something like the previous bullet.)

I don't think this is a release blocker, but it does mean we're back in the "we can't ship updates containing sidecar FPGA changes because online-updating to that might knock out the switch zones" world we were before #9181 landed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions