Skip to content

HardwareManager: Rework to use a watch channel instead of broadcasting updates#10194

Merged
jgallagher merged 2 commits intomainfrom
john/hardware-manager-split-brain
Mar 31, 2026
Merged

HardwareManager: Rework to use a watch channel instead of broadcasting updates#10194
jgallagher merged 2 commits intomainfrom
john/hardware-manager-split-brain

Conversation

@jgallagher
Copy link
Copy Markdown
Contributor

This is an attempt to make hardware monitoring somewhat more level-triggered and somewhat less edge-triggered. In #10187, we had a case where the two different sources of HardwareUpdate notifications got out of sync, resulting in the polling source failing to send a TofinoAvailable update even though it realized the Tofino was, in fact, available.

On this branch, we keep the current HardwareView in a watch channel; any changes made to it will result in a .changed() notification firing, removing the possibility of mismatched updates.

(Hopefully) fixes #10187 - I'll get this on a racklette for some testing before merging.

…g updates

This is an attempt to make hardware monitoring somewhat more
level-triggered and somewhat less edge-triggered. In #10187, we had a
case where the two different sources of `HardwareUpdate` notifications
got out of sync, resulting in the polling source failing to send a
`TofinoAvailable` update even though it realized the Tofino was, in
fact, available.

On this branch, we keep the current `HardwareView` in a watch channel;
any changes made to it will result in a `.changed()` notification
firing, removing the possibility of mismatched updates.
Err(e) => error!(&log, "failed to collect exit status: {e:?}"),
Ok(s) => info!(&log, "child exited with code: {:?}", s.code()),
}
let _ = tx.send(HardwareUpdate::TofinoUnavailable);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main behavioral fix in the PR. Prior to this change, this tx.send(_) went directly to sled-agent's HardwareMonitor and bypassed the hardware_tracking_task() thread created in this module, leaving open a window where hardware_tracking_task() might never see the Tofino be unavailable (and therefore never send an update back to HardwareMonitor that it became available again).

Now, we still send a change here that HardwareMonitor picks up, but it's in a watch channel we share with hardware_tracking_task(), so the next time it runs, it will see that we marked the Tofino as unavailable - if it's become available again, it will correctly update the channel contents and fire another .changed() event to HardwareMonitor.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, a watch channel seems like a more correct primitive here.

self.raw_disks_tx
.add_or_update_raw_disk(disk.into(), &self.log);
}
HardwareUpdate::DiskRemoved(disk) => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Losing the fine-grained updates is a bit of a bummer - now we only get notified when the entire set of (tofino, all_disks) changes in any way, so we go through the more heavyweight check_latest_hardware_snapshot() on any change. But (a) I think it's easier to understand, (b) the methods check_latest_hardware_snapshot() talks to are already themselves prepared to handle "nothing actually changed - don't do anything", and (c) we expect this event to fire very infrequently.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the performance hit here seems really low (since changes are so uncommon) and the cost of missing a change is much worse.

Comment thread sled-hardware/src/illumos/mod.rs Outdated
tofino: polled_tofino,
disks: polled_disks,
baseboard: polled_baseboard,
} = polled_hw.clone();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we destructure polled_hw without cloning, and log these inner fields directly down below?

(I think that would help us skip an unnecessary clone)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I needed the clone because the send_if_modified() takes ownership of these, but it only does that if there have actually been changes. 2c4d11b removes this clone, and adds some clones inside send_if_modified() (only in the "there are changes" path).

Err(e) => error!(&log, "failed to collect exit status: {e:?}"),
Ok(s) => info!(&log, "child exited with code: {:?}", s.code()),
}
let _ = tx.send(HardwareUpdate::TofinoUnavailable);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, a watch channel seems like a more correct primitive here.

states[index].present = true;
true
}
Operation::Remove(index) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this dead code? I was looking for a caller constructing an Operation::Remove but couldn't find one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was constructed by a proptest, but since we got rid of remove_raw_disk() there's no way to implement this branch (nor any reason to try to proptest explicit removals).

self.raw_disks_tx
.add_or_update_raw_disk(disk.into(), &self.log);
}
HardwareUpdate::DiskRemoved(disk) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the performance hit here seems really low (since changes are so uncommon) and the cost of missing a change is much worse.

@jgallagher
Copy link
Copy Markdown
Contributor Author

Testing this on dublin looks good. sled-agent startup shows us finding the tofino and all the disks:

00:00:45.422Z INFO SledAgent (HardwareManager): Completed poll of device tree
    did_modify_baseboard = true
    did_modify_disks = true
    did_modify_tofino = true
    file = sled-hardware/src/illumos/mod.rs:594
00:00:45.422Z INFO SledAgent (HardwareManager): Updated tofino
    file = sled-hardware/src/illumos/mod.rs:601
    tofino = TofinoSnapshot { exists: true, available: true }
00:00:45.422Z INFO SledAgent (HardwareManager): Updated baseboard
    baseboard = Gimlet { identifier: "BRM23230018", model: "913-0000019", revision: 11 }
    file = sled-hardware/src/illumos/mod.rs:604
00:00:45.422Z INFO SledAgent (HardwareManager): Updated disks
    disks = { ... all the disks ... }

Later, I power cycled the Tofino. We see this sequence:

  1. Last poll before the power cycle; no changes (switch zone is still up)
19:07:23.080Z INFO SledAgent (HardwareManager): Completed poll of device tree
    did_modify_baseboard = false
    did_modify_disks = false
    did_modify_tofino = false
    file = sled-hardware/src/illumos/mod.rs:594
  1. tofino-monitor notices the Tofino go away; it shuts down the switch zone and updates the watch channel, which triggers HardwareMonitor to shut down the switch zone:
Mar 31 19:07:24.383 DEBG back from contract ioctl, unit: tofino-monitor
Mar 31 19:07:24.383 INFO Got tofino removed notification, unit: tofino-monitor
Mar 31 19:07:24.396 INFO halting the switch zone, unit: tofino-monitor
Mar 31 19:07:25.805 INFO acknowledging the remove event, unit: tofino-monitor
Mar 31 19:07:27.805 DEBG entering contract ioctl, unit: tofino-monitor
Mar 31 19:07:27.806 DEBG back from contract ioctl, unit: tofino-monitor
Mar 31 19:07:27.806 INFO closing out the device contract, unit: tofino-monitor
Mar 31 19:07:27.806 INFO tofino monitor exiting, unit: tofino-monitor
19:07:27.808Z INFO SledAgent (HardwareManager): child exited with code: Some(0)
    file = sled-hardware/src/illumos/mod.rs:653
19:07:27.808Z INFO SledAgent (HardwareMonitor): Received notification hardware view has changed
    file = sled-agent/src/hardware_monitor.rs:152
19:07:27.809Z INFO SledAgent (HardwareMonitor): Checking current full hardware snapshot
    file = sled-agent/src/hardware_monitor.rs:251
    snapshot = HardwareView { tofino: Real(TofinoSnapshot { exists: true, available: false }), disks: {... all the disks ...}, baseboard: Some(Gimlet { identifier: "BRM23230018", model: "913-0000019", revision: 11 }), online_processor_count: 128, usable_physical_pages: 265285376, usable_physical_ram_bytes: 1086608900096, cpu_family: AmdMilan }
19:07:27.809Z INFO SledAgent (ServiceManager): Disabling switch zone (was running)
    file = sled-agent/src/services.rs:4059
  1. The next time HardwareManager polls the device tree, the Tofino is available again. This confirms we are hitting the race this branch fixes: the Tofino went away and came back in between two poll events. But now since the tofino-monitor updated the shared watch channel, the polling thread realizes there's been a change, triggering HardwareMonitor to start up the switch zone:
19:07:29.114Z INFO SledAgent (HardwareManager): Completed poll of device tree
    did_modify_baseboard = false
    did_modify_disks = false
    did_modify_tofino = true
    file = sled-hardware/src/illumos/mod.rs:594
19:07:29.114Z INFO SledAgent (HardwareManager): Updated tofino
    file = sled-hardware/src/illumos/mod.rs:601
    tofino = TofinoSnapshot { exists: true, available: true }
19:07:29.114Z INFO SledAgent (HardwareMonitor): Received notification hardware view has changed
    file = sled-agent/src/hardware_monitor.rs:152
19:07:29.114Z INFO SledAgent (HardwareMonitor): Checking current full hardware snapshot
    file = sled-agent/src/hardware_monitor.rs:251
    snapshot = HardwareView { tofino: Real(TofinoSnapshot { exists: true, available: true }), disks: {... all the disks ...}, baseboard: Some(Gimlet { identifier: "BRM23230018", model: "913-0000019", revision: 11 }), online_processor_count: 128, usable_physical_pages: 265285376, usable_physical_ram_bytes: 1086608900096, cpu_family: AmdMilan }
19:07:29.114Z INFO SledAgent (HardwareMonitor): tofino present and policy allows switch zone; will activate it
    file = sled-agent/src/hardware_monitor.rs:208
19:07:29.114Z INFO SledAgent (ServiceManager): Ensuring scrimlet services (enabling services)
    file = sled-agent/src/services.rs:3257
19:07:29.114Z INFO SledAgent (ServiceManager): Enabling switch zone (new)
    file = sled-agent/src/services.rs:3611
19:07:29.114Z INFO SledAgent (ServiceManager): Starting switch zone
    file = sled-agent/src/services.rs:4110

@jgallagher jgallagher enabled auto-merge (squash) March 31, 2026 19:48
@jgallagher jgallagher merged commit 313419c into main Mar 31, 2026
16 checks passed
@jgallagher jgallagher deleted the john/hardware-manager-split-brain branch March 31, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sled-agent doesn't always restart the switch zone after the Tofino restarts

2 participants