HardwareManager: Rework to use a watch channel instead of broadcasting updates#10194
HardwareManager: Rework to use a watch channel instead of broadcasting updates#10194jgallagher merged 2 commits intomainfrom
Conversation
…g updates This is an attempt to make hardware monitoring somewhat more level-triggered and somewhat less edge-triggered. In #10187, we had a case where the two different sources of `HardwareUpdate` notifications got out of sync, resulting in the polling source failing to send a `TofinoAvailable` update even though it realized the Tofino was, in fact, available. On this branch, we keep the current `HardwareView` in a watch channel; any changes made to it will result in a `.changed()` notification firing, removing the possibility of mismatched updates.
| Err(e) => error!(&log, "failed to collect exit status: {e:?}"), | ||
| Ok(s) => info!(&log, "child exited with code: {:?}", s.code()), | ||
| } | ||
| let _ = tx.send(HardwareUpdate::TofinoUnavailable); |
There was a problem hiding this comment.
This is the main behavioral fix in the PR. Prior to this change, this tx.send(_) went directly to sled-agent's HardwareMonitor and bypassed the hardware_tracking_task() thread created in this module, leaving open a window where hardware_tracking_task() might never see the Tofino be unavailable (and therefore never send an update back to HardwareMonitor that it became available again).
Now, we still send a change here that HardwareMonitor picks up, but it's in a watch channel we share with hardware_tracking_task(), so the next time it runs, it will see that we marked the Tofino as unavailable - if it's become available again, it will correctly update the channel contents and fire another .changed() event to HardwareMonitor.
There was a problem hiding this comment.
makes sense, a watch channel seems like a more correct primitive here.
| self.raw_disks_tx | ||
| .add_or_update_raw_disk(disk.into(), &self.log); | ||
| } | ||
| HardwareUpdate::DiskRemoved(disk) => { |
There was a problem hiding this comment.
Losing the fine-grained updates is a bit of a bummer - now we only get notified when the entire set of (tofino, all_disks) changes in any way, so we go through the more heavyweight check_latest_hardware_snapshot() on any change. But (a) I think it's easier to understand, (b) the methods check_latest_hardware_snapshot() talks to are already themselves prepared to handle "nothing actually changed - don't do anything", and (c) we expect this event to fire very infrequently.
There was a problem hiding this comment.
Agreed, the performance hit here seems really low (since changes are so uncommon) and the cost of missing a change is much worse.
| tofino: polled_tofino, | ||
| disks: polled_disks, | ||
| baseboard: polled_baseboard, | ||
| } = polled_hw.clone(); |
There was a problem hiding this comment.
Could we destructure polled_hw without cloning, and log these inner fields directly down below?
(I think that would help us skip an unnecessary clone)
There was a problem hiding this comment.
I thought I needed the clone because the send_if_modified() takes ownership of these, but it only does that if there have actually been changes. 2c4d11b removes this clone, and adds some clones inside send_if_modified() (only in the "there are changes" path).
| Err(e) => error!(&log, "failed to collect exit status: {e:?}"), | ||
| Ok(s) => info!(&log, "child exited with code: {:?}", s.code()), | ||
| } | ||
| let _ = tx.send(HardwareUpdate::TofinoUnavailable); |
There was a problem hiding this comment.
makes sense, a watch channel seems like a more correct primitive here.
| states[index].present = true; | ||
| true | ||
| } | ||
| Operation::Remove(index) => { |
There was a problem hiding this comment.
Was this dead code? I was looking for a caller constructing an Operation::Remove but couldn't find one
There was a problem hiding this comment.
No, this was constructed by a proptest, but since we got rid of remove_raw_disk() there's no way to implement this branch (nor any reason to try to proptest explicit removals).
| self.raw_disks_tx | ||
| .add_or_update_raw_disk(disk.into(), &self.log); | ||
| } | ||
| HardwareUpdate::DiskRemoved(disk) => { |
There was a problem hiding this comment.
Agreed, the performance hit here seems really low (since changes are so uncommon) and the cost of missing a change is much worse.
|
Testing this on dublin looks good. sled-agent startup shows us finding the tofino and all the disks: Later, I power cycled the Tofino. We see this sequence:
|
This is an attempt to make hardware monitoring somewhat more level-triggered and somewhat less edge-triggered. In #10187, we had a case where the two different sources of
HardwareUpdatenotifications got out of sync, resulting in the polling source failing to send aTofinoAvailableupdate even though it realized the Tofino was, in fact, available.On this branch, we keep the current
HardwareViewin a watch channel; any changes made to it will result in a.changed()notification firing, removing the possibility of mismatched updates.(Hopefully) fixes #10187 - I'll get this on a racklette for some testing before merging.