Skip to content

execution driver for SP updates#7977

Merged
davepacheco merged 22 commits intomainfrom
dap/mgs-updater
Apr 16, 2025
Merged

execution driver for SP updates#7977
davepacheco merged 22 commits intomainfrom
dap/mgs-updater

Conversation

@davepacheco
Copy link
Collaborator

Depends on #7975. See #7976 for the bigger picture.

This PR adds an MgsUpdateDriver:

  • input: a watch::Receiver<PendingMgsUpdates> (with PendingMgsUpdates being the struct added to blueprints in add information about SP-related updates to blueprint #7975 to describe a set of pending SP updates (that happen via MGS)
  • input: some helpers (for getting MGS backends, artifacts, etc.)
  • exposes: watch::Receiver<DriverStatus> which shows the status of ongoing updates
  • behavior: performs SP updates to match live SP state match the configured PendingMgsUpdates

This also adds an interactive REPL reconfigurator-sp-updater for playing around with this.

@davepacheco
Copy link
Collaborator Author

Testing for this was all manual, using reconfigurator-sp-updater. I want to add some automated tests but that will require work to sp-sim (#7913).

Test plan:

  • basic success case: completed update
  • other success case: no-action-needed case
  • other success case: took over
  • other success case: watched an existing one (involves concurrent update attempts)
  • precondition failed (fails fast, reports error, retries on the next round)
  • failure to fetch artifact (reports error, retries on the next round)
  • induced failures
    • reset SP in the middle of this (correctly detected and reported, retried on the next round)
    • have an update get stuck (requiring abort) (correctly detected and reported, retried on the next round)
    • pstop MGS in the middle of this (reports timeout, tries the other one)
    • restart MGS in the middle of this (reports problem, retries)

I also wanted to test having our update aborted at the SP in the middle but this proved hard. I wasn't able to get it to do anything wrong but I'm not convinced it ever really saw this condition. I tried to test this by pstop'ing both MGS's and the updater, then aborting the update, but after that, the SP reported the update as complete (even via faux-mgs update-status). If that's what the updater saw, then it would have just waited for it and looked at it as a success.

One thing I observed in all this is that because of the preconditions, if something goes wrong that leaves the inactive slot in a bad state, the preconditions need to change. But that won't happen automatically. That will involve a reconfigurator planner lap to fix. This isn't awesome but I think it's the right tradeoff because the state of the SP in this case is indistinguishable from being in the middle of a successful update. I guess we could consider relaxing the "inactive slot" constraint to say that if it's invalid for an extended period we ignore that precondition? But I don't think this problem is bad enough to warrant that.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Mostly nits, with one more serious concern about cancel-safety.

Comment on lines +255 to +260
fn precheck<'a>(
&'a self,
log: &'a slog::Logger,
mgs_clients: &'a mut MgsClients,
update: &'a PendingMgsUpdate,
) -> BoxFuture<'a, Result<PrecheckStatus, PrecheckError>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to box these anymore?

Suggested change
fn precheck<'a>(
&'a self,
log: &'a slog::Logger,
mgs_clients: &'a mut MgsClients,
update: &'a PendingMgsUpdate,
) -> BoxFuture<'a, Result<PrecheckStatus, PrecheckError>>;
fn precheck(
&self,
log: &slog::Logger,
mgs_clients: &mut MgsClients,
update: &PendingMgsUpdate,
) -> impl Future<Output = Result<PrecheckStatus, PrecheckError>>;

This would make the trait no longer dyn compatible, so the helper functions in driver_update.rs would have to take a generic T: SpComponentUpdateHelper instead of a &dyn SpComponentUpdateHelper, but that seems okay? The implementors would get to write

    async fn precheck(
        &self,
        log: &slog::Logger,
        mgs_clients: &mut MgsClients,
        update: &PendingMgsUpdate,
    ) -> Result<PrecheckStatus, PrecheckError> {

instead of having to worry about async move { ... }.boxed(). (The latter is the only real concrete reason I'd suggest this; I don't think in practice the boxing vs generics matters much.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer trait objects here, though I don't feel strongly about it. I'm happy to revisit this later.

let caboose = mgs_clients
.try_all_serially(log, move |mgs_client| async move {
mgs_client
.sp_component_caboose_get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine and/or unavoidable given the current MGS API, but do we care or are we concerned about the multiple MGS checks in this function not being collected atomically? (E.g., it's impractical but not technically impossible that the device changes after we check it above but before we do this check?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the behavior should be okay:

  • I think any case where the caller is taking no action is fine because it will be re-checked again later
    • this covers the error cases and PrecheckStatus::UpdateComplete
  • That leaves PrecheckStatus::ReadyForUpdate. The process was intended so that two could come in and try to proceed at the same time. Only one will be able to create the update. The other will wait for it to complete.

It wouldn't hurt to model this more formally, particularly for more exotic sequences (e.g., a whole other update completes between doing precheck() and starting the update). I expect it's possible for two consumers with different configs to wind up bouncing back and forth a bit but that as long as both are updating their configs then they will converge to the new thing.

Certainly if the SP provided a richer API (e.g., return everything all at once, and also be able to start updates (and do resets) conditional on the current state), that'd be nice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potentially worse case is the use of precheck() to determine when the update is done. You could imagine:

  • A and B do precheck
  • A starts update
  • B sees it, enters wait_for_update_done(), invoking precheck() in a loop
  • A goes out to lunch
  • B takes over after 2 minutes and decides to send the reset
  • C comes in and starts doing an update
  • B sends the reset

You could definitely wind up having some transient failures and disruptions. But I believe this should always converge because we handle things like the SP resetting during the update.

}
MgsClients::from_clients(backends.iter().map(
|(backend_name, backend)| {
gateway_client::Client::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember progenitor clients being relatively expensive to construct. Should the qorb pool be handing out already-construct (and cached-in-the-pool) clients instead of just the addresses?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to take this as a follow-up.

@smklein Right now I'm having this thing accept a watch::Receiver<AllBackends> (what qorb resolvers provide) and then constructing the clients each time this function is called. is there some other interface I could be using here to maintain a pool of clients?

Base automatically changed from dap/sp-blueprint to main April 16, 2025 04:26
@davepacheco davepacheco merged commit e652934 into main Apr 16, 2025
18 checks passed
@davepacheco davepacheco deleted the dap/mgs-updater branch April 16, 2025 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants