execution driver for SP updates by davepacheco · Pull Request #7977 · oxidecomputer/omicron

davepacheco · 2025-04-15T03:42:37Z

Depends on #7975. See #7976 for the bigger picture.

This PR adds an MgsUpdateDriver:

input: a watch::Receiver<PendingMgsUpdates> (with PendingMgsUpdates being the struct added to blueprints in add information about SP-related updates to blueprint #7975 to describe a set of pending SP updates (that happen via MGS)
input: some helpers (for getting MGS backends, artifacts, etc.)
exposes: watch::Receiver<DriverStatus> which shows the status of ongoing updates
behavior: performs SP updates to match live SP state match the configured PendingMgsUpdates

This also adds an interactive REPL reconfigurator-sp-updater for playing around with this.

davepacheco · 2025-04-15T14:32:50Z

Testing for this was all manual, using reconfigurator-sp-updater. I want to add some automated tests but that will require work to sp-sim (#7913).

Test plan:

basic success case: completed update
other success case: no-action-needed case
other success case: took over
other success case: watched an existing one (involves concurrent update attempts)
precondition failed (fails fast, reports error, retries on the next round)
failure to fetch artifact (reports error, retries on the next round)
induced failures
- reset SP in the middle of this (correctly detected and reported, retried on the next round)
- have an update get stuck (requiring abort) (correctly detected and reported, retried on the next round)
- pstop MGS in the middle of this (reports timeout, tries the other one)
- restart MGS in the middle of this (reports problem, retries)

I also wanted to test having our update aborted at the SP in the middle but this proved hard. I wasn't able to get it to do anything wrong but I'm not convinced it ever really saw this condition. I tried to test this by pstop'ing both MGS's and the updater, then aborting the update, but after that, the SP reported the update as complete (even via faux-mgs update-status). If that's what the updater saw, then it would have just waited for it and looked at it as a success.

One thing I observed in all this is that because of the preconditions, if something goes wrong that leaves the inactive slot in a bad state, the preconditions need to change. But that won't happen automatically. That will involve a reconfigurator planner lap to fix. This isn't awesome but I think it's the right tradeoff because the state of the SP in this case is indistinguishable from being in the middle of a successful update. I guess we could consider relaxing the "inactive slot" constraint to say that if it's invalid for an extended period we ignore that precondition? But I don't think this problem is bad enough to warrant that.

jgallagher

This looks great. Mostly nits, with one more serious concern about cancel-safety.

jgallagher · 2025-04-15T18:24:14Z

nexus/mgs-updates/src/common_sp_update.rs

+    fn precheck<'a>(
+        &'a self,
+        log: &'a slog::Logger,
+        mgs_clients: &'a mut MgsClients,
+        update: &'a PendingMgsUpdate,
+    ) -> BoxFuture<'a, Result<PrecheckStatus, PrecheckError>>;


I think we don't need to box these anymore?

Suggested change

fn precheck<'a>(

&'a self,

log: &'a slog::Logger,

mgs_clients: &'a mut MgsClients,

update: &'a PendingMgsUpdate,

) -> BoxFuture<'a, Result<PrecheckStatus, PrecheckError>>;

fn precheck(

&self,

log: &slog::Logger,

mgs_clients: &mut MgsClients,

update: &PendingMgsUpdate,

) -> impl Future<Output = Result<PrecheckStatus, PrecheckError>>;

This would make the trait no longer dyn compatible, so the helper functions in driver_update.rs would have to take a generic T: SpComponentUpdateHelper instead of a &dyn SpComponentUpdateHelper, but that seems okay? The implementors would get to write

async fn precheck( &self, log: &slog::Logger, mgs_clients: &mut MgsClients, update: &PendingMgsUpdate, ) -> Result<PrecheckStatus, PrecheckError> {

instead of having to worry about async move { ... }.boxed(). (The latter is the only real concrete reason I'd suggest this; I don't think in practice the boxing vs generics matters much.)

I slightly prefer trait objects here, though I don't feel strongly about it. I'm happy to revisit this later.

jgallagher · 2025-04-15T18:31:02Z

nexus/mgs-updates/src/sp_updater.rs

+            let caboose = mgs_clients
+                .try_all_serially(log, move |mgs_client| async move {
+                    mgs_client
+                        .sp_component_caboose_get(


I think this is fine and/or unavoidable given the current MGS API, but do we care or are we concerned about the multiple MGS checks in this function not being collected atomically? (E.g., it's impractical but not technically impossible that the device changes after we check it above but before we do this check?)

I believe the behavior should be okay:

I think any case where the caller is taking no action is fine because it will be re-checked again later

this covers the error cases and PrecheckStatus::UpdateComplete

That leaves PrecheckStatus::ReadyForUpdate. The process was intended so that two could come in and try to proceed at the same time. Only one will be able to create the update. The other will wait for it to complete.

It wouldn't hurt to model this more formally, particularly for more exotic sequences (e.g., a whole other update completes between doing precheck() and starting the update). I expect it's possible for two consumers with different configs to wind up bouncing back and forth a bit but that as long as both are updating their configs then they will converge to the new thing.

Certainly if the SP provided a richer API (e.g., return everything all at once, and also be able to start updates (and do resets) conditional on the current state), that'd be nice.

A potentially worse case is the use of precheck() to determine when the update is done. You could imagine:

A and B do precheck

A starts update

B sees it, enters wait_for_update_done(), invoking precheck() in a loop

A goes out to lunch

B takes over after 2 minutes and decides to send the reset

C comes in and starts doing an update

B sends the reset

You could definitely wind up having some transient failures and disruptions. But I believe this should always converge because we handle things like the SP resetting during the update.

nexus/mgs-updates/src/driver.rs

nexus/mgs-updates/src/driver_update.rs

jgallagher · 2025-04-15T20:26:01Z

nexus/mgs-updates/src/driver_update.rs

+        }
+        MgsClients::from_clients(backends.iter().map(
+            |(backend_name, backend)| {
+                gateway_client::Client::new(


I vaguely remember progenitor clients being relatively expensive to construct. Should the qorb pool be handing out already-construct (and cached-in-the-pool) clients instead of just the addresses?

I'd like to take this as a follow-up.

@smklein Right now I'm having this thing accept a watch::Receiver<AllBackends> (what qorb resolvers provide) and then constructing the clients each time this function is called. is there some other interface I could be using here to maintain a pool of clients?

nexus/mgs-updates/src/driver_update.rs

davepacheco added 5 commits April 14, 2025 19:24

use daft in more update-related types

650865f

also update this inventory type

f69c0b9

add information about SP-related updates to blueprint

6e85921

missed some

fb40bc9

execution driver for SP updates

a5307d3

davepacheco requested a review from jgallagher April 15, 2025 03:42

davepacheco mentioned this pull request Apr 15, 2025

tracking issue: initial blueprint-driven SP update #7976

Closed

4 tasks

fix clippy

693245f

davepacheco mentioned this pull request Apr 15, 2025

incorporate MgsUpdateDriver into blueprint execution #7978

Merged

davepacheco added this to the 15 milestone Apr 15, 2025

davepacheco self-assigned this Apr 15, 2025

davepacheco added 3 commits April 15, 2025 06:50

fix tests

6b66dfe

Merge branch 'dap/sp-blueprint' into dap/mgs-updater

d7dfcef

this can manifest a different way

e9f1720

davepacheco added 6 commits April 15, 2025 09:29

Merge branch 'main' into dap/sp-blueprint

0889ee1

use IdMap instead

244c42b

Merge branch 'main' into dap/sp-blueprint

48f42a3

reference issue

f29f33f

Merge branch 'dap/sp-blueprint' into dap/mgs-updater

45b475b

fixes for changes in dependent PR

fdd173f

jgallagher approved these changes Apr 15, 2025

View reviewed changes

davepacheco added 7 commits April 15, 2025 14:44

use IdMap

2b3f39e

review feedback: document cancel-safety, clean up status lifetime

81bfe4b

sundry review feedback

24810fd

fix api

15aa85a

Merge branch 'dap/sp-blueprint' into dap/mgs-updater

3728cf3

fix test

e574c82

Merge branch 'dap/sp-blueprint' into dap/mgs-updater

93a6166

Base automatically changed from dap/sp-blueprint to main April 16, 2025 04:26

davepacheco merged commit e652934 into main Apr 16, 2025
18 checks passed

davepacheco deleted the dap/mgs-updater branch April 16, 2025 05:28

davepacheco mentioned this pull request Apr 21, 2025

SP/MGS: want better way to identify failure due to update in progress #8013

Open

Conversation

davepacheco commented Apr 15, 2025

Uh oh!

davepacheco commented Apr 15, 2025

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgallagher Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants