Conversation
|
Testing for this was all manual, using Test plan:
I also wanted to test having our update aborted at the SP in the middle but this proved hard. I wasn't able to get it to do anything wrong but I'm not convinced it ever really saw this condition. I tried to test this by pstop'ing both MGS's and the updater, then aborting the update, but after that, the SP reported the update as complete (even via One thing I observed in all this is that because of the preconditions, if something goes wrong that leaves the inactive slot in a bad state, the preconditions need to change. But that won't happen automatically. That will involve a reconfigurator planner lap to fix. This isn't awesome but I think it's the right tradeoff because the state of the SP in this case is indistinguishable from being in the middle of a successful update. I guess we could consider relaxing the "inactive slot" constraint to say that if it's invalid for an extended period we ignore that precondition? But I don't think this problem is bad enough to warrant that. |
jgallagher
left a comment
There was a problem hiding this comment.
This looks great. Mostly nits, with one more serious concern about cancel-safety.
| fn precheck<'a>( | ||
| &'a self, | ||
| log: &'a slog::Logger, | ||
| mgs_clients: &'a mut MgsClients, | ||
| update: &'a PendingMgsUpdate, | ||
| ) -> BoxFuture<'a, Result<PrecheckStatus, PrecheckError>>; |
There was a problem hiding this comment.
I think we don't need to box these anymore?
| fn precheck<'a>( | |
| &'a self, | |
| log: &'a slog::Logger, | |
| mgs_clients: &'a mut MgsClients, | |
| update: &'a PendingMgsUpdate, | |
| ) -> BoxFuture<'a, Result<PrecheckStatus, PrecheckError>>; | |
| fn precheck( | |
| &self, | |
| log: &slog::Logger, | |
| mgs_clients: &mut MgsClients, | |
| update: &PendingMgsUpdate, | |
| ) -> impl Future<Output = Result<PrecheckStatus, PrecheckError>>; |
This would make the trait no longer dyn compatible, so the helper functions in driver_update.rs would have to take a generic T: SpComponentUpdateHelper instead of a &dyn SpComponentUpdateHelper, but that seems okay? The implementors would get to write
async fn precheck(
&self,
log: &slog::Logger,
mgs_clients: &mut MgsClients,
update: &PendingMgsUpdate,
) -> Result<PrecheckStatus, PrecheckError> {instead of having to worry about async move { ... }.boxed(). (The latter is the only real concrete reason I'd suggest this; I don't think in practice the boxing vs generics matters much.)
There was a problem hiding this comment.
I slightly prefer trait objects here, though I don't feel strongly about it. I'm happy to revisit this later.
| let caboose = mgs_clients | ||
| .try_all_serially(log, move |mgs_client| async move { | ||
| mgs_client | ||
| .sp_component_caboose_get( |
There was a problem hiding this comment.
I think this is fine and/or unavoidable given the current MGS API, but do we care or are we concerned about the multiple MGS checks in this function not being collected atomically? (E.g., it's impractical but not technically impossible that the device changes after we check it above but before we do this check?)
There was a problem hiding this comment.
I believe the behavior should be okay:
- I think any case where the caller is taking no action is fine because it will be re-checked again later
- this covers the error cases and
PrecheckStatus::UpdateComplete
- this covers the error cases and
- That leaves
PrecheckStatus::ReadyForUpdate. The process was intended so that two could come in and try to proceed at the same time. Only one will be able to create the update. The other will wait for it to complete.
It wouldn't hurt to model this more formally, particularly for more exotic sequences (e.g., a whole other update completes between doing precheck() and starting the update). I expect it's possible for two consumers with different configs to wind up bouncing back and forth a bit but that as long as both are updating their configs then they will converge to the new thing.
Certainly if the SP provided a richer API (e.g., return everything all at once, and also be able to start updates (and do resets) conditional on the current state), that'd be nice.
There was a problem hiding this comment.
A potentially worse case is the use of precheck() to determine when the update is done. You could imagine:
- A and B do precheck
- A starts update
- B sees it, enters wait_for_update_done(), invoking precheck() in a loop
- A goes out to lunch
- B takes over after 2 minutes and decides to send the reset
- C comes in and starts doing an update
- B sends the reset
You could definitely wind up having some transient failures and disruptions. But I believe this should always converge because we handle things like the SP resetting during the update.
| } | ||
| MgsClients::from_clients(backends.iter().map( | ||
| |(backend_name, backend)| { | ||
| gateway_client::Client::new( |
There was a problem hiding this comment.
I vaguely remember progenitor clients being relatively expensive to construct. Should the qorb pool be handing out already-construct (and cached-in-the-pool) clients instead of just the addresses?
There was a problem hiding this comment.
I'd like to take this as a follow-up.
@smklein Right now I'm having this thing accept a watch::Receiver<AllBackends> (what qorb resolvers provide) and then constructing the clients each time this function is called. is there some other interface I could be using here to maintain a pool of clients?
Depends on #7975. See #7976 for the bigger picture.
This PR adds an
MgsUpdateDriver:watch::Receiver<PendingMgsUpdates>(withPendingMgsUpdatesbeing the struct added to blueprints in add information about SP-related updates to blueprint #7975 to describe a set of pending SP updates (that happen via MGS)watch::Receiver<DriverStatus>which shows the status of ongoing updatesPendingMgsUpdatesThis also adds an interactive REPL
reconfigurator-sp-updaterfor playing around with this.