HardwareManager: Rework to use a watch channel instead of broadcasting updates by jgallagher · Pull Request #10194 · oxidecomputer/omicron

jgallagher · 2026-03-31T00:58:45Z

This is an attempt to make hardware monitoring somewhat more level-triggered and somewhat less edge-triggered. In #10187, we had a case where the two different sources of HardwareUpdate notifications got out of sync, resulting in the polling source failing to send a TofinoAvailable update even though it realized the Tofino was, in fact, available.

On this branch, we keep the current HardwareView in a watch channel; any changes made to it will result in a .changed() notification firing, removing the possibility of mismatched updates.

(Hopefully) fixes #10187 - I'll get this on a racklette for some testing before merging.

…g updates This is an attempt to make hardware monitoring somewhat more level-triggered and somewhat less edge-triggered. In #10187, we had a case where the two different sources of `HardwareUpdate` notifications got out of sync, resulting in the polling source failing to send a `TofinoAvailable` update even though it realized the Tofino was, in fact, available. On this branch, we keep the current `HardwareView` in a watch channel; any changes made to it will result in a `.changed()` notification firing, removing the possibility of mismatched updates.

jgallagher · 2026-03-31T01:01:56Z

            Err(e) => error!(&log, "failed to collect exit status: {e:?}"),
            Ok(s) => info!(&log, "child exited with code: {:?}", s.code()),
        }
-        let _ = tx.send(HardwareUpdate::TofinoUnavailable);


This is the main behavioral fix in the PR. Prior to this change, this tx.send(_) went directly to sled-agent's HardwareMonitor and bypassed the hardware_tracking_task() thread created in this module, leaving open a window where hardware_tracking_task() might never see the Tofino be unavailable (and therefore never send an update back to HardwareMonitor that it became available again).

Now, we still send a change here that HardwareMonitor picks up, but it's in a watch channel we share with hardware_tracking_task(), so the next time it runs, it will see that we marked the Tofino as unavailable - if it's become available again, it will correctly update the channel contents and fire another .changed() event to HardwareMonitor.

makes sense, a watch channel seems like a more correct primitive here.

jgallagher · 2026-03-31T01:03:46Z

-                    self.raw_disks_tx
-                        .add_or_update_raw_disk(disk.into(), &self.log);
-                }
-                HardwareUpdate::DiskRemoved(disk) => {


Losing the fine-grained updates is a bit of a bummer - now we only get notified when the entire set of (tofino, all_disks) changes in any way, so we go through the more heavyweight check_latest_hardware_snapshot() on any change. But (a) I think it's easier to understand, (b) the methods check_latest_hardware_snapshot() talks to are already themselves prepared to handle "nothing actually changed - don't do anything", and (c) we expect this event to fire very infrequently.

Agreed, the performance hit here seems really low (since changes are so uncommon) and the cost of missing a change is much worse.

smklein · 2026-03-31T15:56:34Z

+        tofino: polled_tofino,
+        disks: polled_disks,
+        baseboard: polled_baseboard,
+    } = polled_hw.clone();


Could we destructure polled_hw without cloning, and log these inner fields directly down below?

(I think that would help us skip an unnecessary clone)

I thought I needed the clone because the send_if_modified() takes ownership of these, but it only does that if there have actually been changes. 2c4d11b removes this clone, and adds some clones inside send_if_modified() (only in the "there are changes" path).

smklein · 2026-03-31T15:57:55Z

            Err(e) => error!(&log, "failed to collect exit status: {e:?}"),
            Ok(s) => info!(&log, "child exited with code: {:?}", s.code()),
        }
-        let _ = tx.send(HardwareUpdate::TofinoUnavailable);


makes sense, a watch channel seems like a more correct primitive here.

smklein · 2026-03-31T16:47:28Z

                    states[index].present = true;
                    true
                }
-                Operation::Remove(index) => {


Was this dead code? I was looking for a caller constructing an Operation::Remove but couldn't find one

No, this was constructed by a proptest, but since we got rid of remove_raw_disk() there's no way to implement this branch (nor any reason to try to proptest explicit removals).

smklein · 2026-03-31T16:49:28Z

-                    self.raw_disks_tx
-                        .add_or_update_raw_disk(disk.into(), &self.log);
-                }
-                HardwareUpdate::DiskRemoved(disk) => {


Agreed, the performance hit here seems really low (since changes are so uncommon) and the cost of missing a change is much worse.

jgallagher · 2026-03-31T19:11:20Z

Testing this on dublin looks good. sled-agent startup shows us finding the tofino and all the disks:

00:00:45.422Z INFO SledAgent (HardwareManager): Completed poll of device tree
    did_modify_baseboard = true
    did_modify_disks = true
    did_modify_tofino = true
    file = sled-hardware/src/illumos/mod.rs:594
00:00:45.422Z INFO SledAgent (HardwareManager): Updated tofino
    file = sled-hardware/src/illumos/mod.rs:601
    tofino = TofinoSnapshot { exists: true, available: true }
00:00:45.422Z INFO SledAgent (HardwareManager): Updated baseboard
    baseboard = Gimlet { identifier: "BRM23230018", model: "913-0000019", revision: 11 }
    file = sled-hardware/src/illumos/mod.rs:604
00:00:45.422Z INFO SledAgent (HardwareManager): Updated disks
    disks = { ... all the disks ... }

Later, I power cycled the Tofino. We see this sequence:

Last poll before the power cycle; no changes (switch zone is still up)

19:07:23.080Z INFO SledAgent (HardwareManager): Completed poll of device tree
    did_modify_baseboard = false
    did_modify_disks = false
    did_modify_tofino = false
    file = sled-hardware/src/illumos/mod.rs:594

tofino-monitor notices the Tofino go away; it shuts down the switch zone and updates the watch channel, which triggers HardwareMonitor to shut down the switch zone:

Mar 31 19:07:24.383 DEBG back from contract ioctl, unit: tofino-monitor
Mar 31 19:07:24.383 INFO Got tofino removed notification, unit: tofino-monitor
Mar 31 19:07:24.396 INFO halting the switch zone, unit: tofino-monitor
Mar 31 19:07:25.805 INFO acknowledging the remove event, unit: tofino-monitor
Mar 31 19:07:27.805 DEBG entering contract ioctl, unit: tofino-monitor
Mar 31 19:07:27.806 DEBG back from contract ioctl, unit: tofino-monitor
Mar 31 19:07:27.806 INFO closing out the device contract, unit: tofino-monitor
Mar 31 19:07:27.806 INFO tofino monitor exiting, unit: tofino-monitor
19:07:27.808Z INFO SledAgent (HardwareManager): child exited with code: Some(0)
    file = sled-hardware/src/illumos/mod.rs:653
19:07:27.808Z INFO SledAgent (HardwareMonitor): Received notification hardware view has changed
    file = sled-agent/src/hardware_monitor.rs:152
19:07:27.809Z INFO SledAgent (HardwareMonitor): Checking current full hardware snapshot
    file = sled-agent/src/hardware_monitor.rs:251
    snapshot = HardwareView { tofino: Real(TofinoSnapshot { exists: true, available: false }), disks: {... all the disks ...}, baseboard: Some(Gimlet { identifier: "BRM23230018", model: "913-0000019", revision: 11 }), online_processor_count: 128, usable_physical_pages: 265285376, usable_physical_ram_bytes: 1086608900096, cpu_family: AmdMilan }
19:07:27.809Z INFO SledAgent (ServiceManager): Disabling switch zone (was running)
    file = sled-agent/src/services.rs:4059

The next time HardwareManager polls the device tree, the Tofino is available again. This confirms we are hitting the race this branch fixes: the Tofino went away and came back in between two poll events. But now since the tofino-monitor updated the shared watch channel, the polling thread realizes there's been a change, triggering HardwareMonitor to start up the switch zone:

19:07:29.114Z INFO SledAgent (HardwareManager): Completed poll of device tree
    did_modify_baseboard = false
    did_modify_disks = false
    did_modify_tofino = true
    file = sled-hardware/src/illumos/mod.rs:594
19:07:29.114Z INFO SledAgent (HardwareManager): Updated tofino
    file = sled-hardware/src/illumos/mod.rs:601
    tofino = TofinoSnapshot { exists: true, available: true }
19:07:29.114Z INFO SledAgent (HardwareMonitor): Received notification hardware view has changed
    file = sled-agent/src/hardware_monitor.rs:152
19:07:29.114Z INFO SledAgent (HardwareMonitor): Checking current full hardware snapshot
    file = sled-agent/src/hardware_monitor.rs:251
    snapshot = HardwareView { tofino: Real(TofinoSnapshot { exists: true, available: true }), disks: {... all the disks ...}, baseboard: Some(Gimlet { identifier: "BRM23230018", model: "913-0000019", revision: 11 }), online_processor_count: 128, usable_physical_pages: 265285376, usable_physical_ram_bytes: 1086608900096, cpu_family: AmdMilan }
19:07:29.114Z INFO SledAgent (HardwareMonitor): tofino present and policy allows switch zone; will activate it
    file = sled-agent/src/hardware_monitor.rs:208
19:07:29.114Z INFO SledAgent (ServiceManager): Ensuring scrimlet services (enabling services)
    file = sled-agent/src/services.rs:3257
19:07:29.114Z INFO SledAgent (ServiceManager): Enabling switch zone (new)
    file = sled-agent/src/services.rs:3611
19:07:29.114Z INFO SledAgent (ServiceManager): Starting switch zone
    file = sled-agent/src/services.rs:4110

jgallagher requested review from Nieuwejaar, andrewjstone and smklein March 31, 2026 00:58

jgallagher commented Mar 31, 2026

View reviewed changes

smklein approved these changes Mar 31, 2026

View reviewed changes

only clone if there have been changes

2c4d11b

jgallagher enabled auto-merge (squash) March 31, 2026 19:48

jgallagher merged commit 313419c into main Mar 31, 2026
16 checks passed

jgallagher deleted the john/hardware-manager-split-brain branch March 31, 2026 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HardwareManager: Rework to use a watch channel instead of broadcasting updates#10194

HardwareManager: Rework to use a watch channel instead of broadcasting updates#10194
jgallagher merged 2 commits intomainfrom
john/hardware-manager-split-brain

jgallagher commented Mar 31, 2026

Uh oh!

jgallagher Mar 31, 2026

Uh oh!

smklein Mar 31, 2026

Uh oh!

jgallagher Mar 31, 2026

Uh oh!

smklein Mar 31, 2026

Uh oh!

smklein Mar 31, 2026

Uh oh!

jgallagher Mar 31, 2026

Uh oh!

smklein Mar 31, 2026

Uh oh!

smklein Mar 31, 2026

Uh oh!

jgallagher Mar 31, 2026

Uh oh!

smklein Mar 31, 2026

Uh oh!

jgallagher commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jgallagher commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants