sled-agent doesn't always restart the switch zone after the Tofino restarts

Today on madrid, @Nieuwejaar ran into an issue where upon restarting the Tofino, `sled-agent` tore down the switch zone (as expected) but never restarted it (definitely not expected). Looking at the logs, we see that on startup, the `HardwareMonitor` was notified that a Tofino was available, so started the switch zone:

```
00:01:28.281Z INFO SledAgent (HardwareMonitor): tofino present and policy allows switch zone; will activate it
    file = sled-agent/src/hardware_monitor.rs:265
00:01:28.281Z INFO SledAgent (ServiceManager): Ensuring scrimlet services (enabling services)
    file = sled-agent/src/services.rs:3279
00:01:28.281Z INFO SledAgent (ServiceManager): Re-enabling running switch zone (new address)
```

And when the Tofino went away, the `tofino-monitor` process (from #9181) noticed, shut down the switch zone, and that `HardwareMonitor` received a `TofinoUnavailable` message:

```
Mar 30 17:00:12.630 DEBG back from contract ioctl, unit: tofino-monitor
Mar 30 17:00:12.632 INFO Got tofino removed notification, unit: tofino-monitor
Mar 30 17:00:12.650 INFO halting the switch zone, unit: tofino-monitor
Mar 30 17:00:14.369 INFO acknowledging the remove event, unit: tofino-monitor
Mar 30 17:00:16.369 DEBG entering contract ioctl, unit: tofino-monitor
Mar 30 17:00:16.369 DEBG back from contract ioctl, unit: tofino-monitor
Mar 30 17:00:16.369 INFO closing out the device contract, unit: tofino-monitor
Mar 30 17:00:16.370 INFO tofino monitor exiting, unit: tofino-monitor
17:00:16.374Z INFO SledAgent (HardwareManager): child exited with code: Some(0)
    file = sled-hardware/src/illumos/mod.rs:730
17:00:16.375Z INFO SledAgent (HardwareMonitor): Received hardware update message
    file = sled-agent/src/hardware_monitor.rs:154
    update = Ok(TofinoUnavailable)
17:00:16.375Z INFO SledAgent (HardwareMonitor): Hardware monitor got TofinoUnavailable message
    file = sled-agent/src/hardware_monitor.rs:201
17:00:16.375Z INFO SledAgent (ServiceManager): Disabling switch zone (was running)
```

However, when the Tofino came back, `tofino-monitor` noticed, but `HardwareMonitor` never got a `TofinoAvailable` message, so it never restarted the switch zone:

```
Mar 30 17:00:21.383 INFO tofino monitor online, unit: tofino-monitor
Mar 30 17:00:21.414 INFO tofino device found, unit: tofino-monitor
Mar 30 17:00:21.414 DEBG entering contract ioctl, unit: tofino-monitor
```

Most events `HardwareMonitor` receives come from the internal `sled_hardware` task that periodically polls devinfo. That task emits logs either noting "nothing changed" or a list of updates: https://github.com/oxidecomputer/omicron/blob/dfcd9ac2d3321f7d286c6bde832faf0306b1e535/sled-hardware/src/illumos/mod.rs#L681-L688

I initially assumed from the `Hardware monitor got TofinoUnavailable message` log that this polling had noticed the Tofino had gone away, but that was wrong:

1. There was no corresponding `Update from polling device tree:` log
2. That `TofinoUnavailable` message was sent by the  `tofino-monitor` machinery itself: https://github.com/oxidecomputer/omicron/blob/dfcd9ac2d3321f7d286c6bde832faf0306b1e535/sled-hardware/src/illumos/mod.rs#L732

We did confirm via `dtrace` that the polling system was polling every 5 seconds and did know that the Tofino is available:

```
BRM42220081 # dtrace -n 'slog15282:::debug { printf("%s\n", copyinstr(arg0)); }' -q | grep sled_hardware
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:55:57.396110244Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:55:58.385531396Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:03.415312964Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:04.365675256Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:09.395331786Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:10.305673451Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:15.333921636Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:16.324512269Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":369},"level":"DEBUG","timestamp":"2026-03-30T17:56:21.355846860Z","message":"Found tofino node, with asic available","kv":{"component":"HardwareManager"}
{"ok":{"location":{"module":"sled_hardware::illumos","file":"sled-hardware/src/illumos/mod.rs","line":682},"level":"DEBUG","timestamp":"2026-03-30T17:56:22.362137839Z","message":"No updates from polling device tree","kv":{"component":"HardwareManager"}}}
```

My theory here is:

* The polling task never completed a poll where the Tofino was _unavailable_ (I'm not sure about the exact mechanics here - maybe it was a timing issue since we only poll every 5 seconds and the Tofino wasn't gone for very long, or maybe polling devinfo itself is affected somehow by the Tofino going away and coming back?)
* Therefore, it never sent a `TofinoAvailable` message, because it _itself_ never sent a `TofinoUnavailable` message

I'm very hesitant to suggest any quick fixes here, because this feels a bit like we've got two different paths of control that are interfering. Some ideas:

* Should the `tofino-monitor` path send a `TofinoAvailable` message, akin to how it sends `TofinoUnavailable`?
* Should the `tofino-monitor` path coordinate more closely with the polling path? (E.g., if the `tofino-monitor` path told the polling path that the tofino was unavailable instead of bypassing it to talk to `HardwareMonitor` directly, a future poll that saw the tofino would be able to know enough to send a `TofinoAvailable` message)
* Should the poller -> `HardwareMonitor` communication be reworked to be level-triggered instead of edge triggered? (This might also require something like the previous bullet.)

I don't think this is a release blocker, but it does mean we're back in the "we can't ship updates containing sidecar FPGA changes because online-updating to that might knock out the switch zones" world we were before #9181 landed.

	if updates.is_empty() {
	debug!(log, "No updates from polling device tree");
	}

	for update in updates.into_iter() {
	info!(log, "Update from polling device tree: {:?}", update);
	let _ = tx.send(update);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled-agent doesn't always restart the switch zone after the Tofino restarts #10187

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

sled-agent doesn't always restart the switch zone after the Tofino restarts #10187

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions