Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reconfigurator] turn expunged sleds into expunged zones #5493

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Apr 10, 2024

It actually looks like this bit falls out nicely. Turns out we were already
disregarding Nexus zones on expunged sleds, so more of it works than I thought
it did!

Still to do:

  • Fix and add to blueprint diff tests
  • Split up handling of internal and external IPs while constructing a BlueprintBuilder
  • Add tests for this

Depends on #5488 and #5555.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
@sunshowers sunshowers marked this pull request as draft April 10, 2024 09:03
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers marked this pull request as ready for review April 12, 2024 06:04
@sunshowers sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-turn-expunged-sleds-into-expunged-zones to main April 12, 2024 06:08
@sunshowers sunshowers marked this pull request as draft April 14, 2024 06:32
Created using spr 1.3.6-beta.1
@morlandi7 morlandi7 added this to the 8 milestone Apr 16, 2024
Created using spr 1.3.6-beta.1
@@ -353,30 +366,26 @@ impl<'a> BlueprintBuilder<'a> {
// of Nexus instances), but wouldn't be ideal if we have many resources
// we need to skip. We could do something smarter here based on the sets
// of used resources we built above if needed.
let nexus_v4_ips = Box::new(
let nexus_v4_ips = AvailableIterator::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up not using the extra functionality provided by AvailableIterator, but this will be essential for testing #5552.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the extra functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets you look at which resources were considered in-use at the time the iterator was constructed.

@sunshowers sunshowers marked this pull request as ready for review April 17, 2024 20:27
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

nexus/reconfigurator/planning/src/blueprint_builder.rs Outdated Show resolved Hide resolved
nexus/reconfigurator/planning/src/planner.rs Outdated Show resolved Hide resolved
nexus/types/src/deployment/planning_input.rs Outdated Show resolved Hide resolved
nexus/types/src/deployment/planning_input.rs Outdated Show resolved Hide resolved
///
/// This method does not check whether the sled actually should be expunged
/// -- that is the responsibility of higher-level code.
pub fn build_pending_expunge_for_sled(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this separated into two steps? I'm wondering if that makes it possible to misuse it by accident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- the main reason is that I wanted to be able to unify the set of zones that should be expunged across all the possible sources of zone expungement, so it can be done in one go. The current ones are:

  • sled expungement => all zones on the sled
  • disk expungement => crucible zones backed by the disk

So the idea would be: first, build up the set of zones that should be expunged, then actually perform the expungement. It's a pattern I've generally found to be very good for introspection and testing against.

re misuse, one way is to add #[must_use] to a pending expunge, and I'll do so. But in general this should be caught pretty quickly by tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd get the same behavior from a function that was just let summary = builder.expunge_sled(sled_id, reason) because as far as I can tell, the caller always:

  • calls build_pending_expunge_for_sled() 0 or more times
  • for each returned value, call apply_pending_expunge

i.e., there's no merging or other processing of the returned values before calling apply().

I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.

It makes sense to use #[must_use] to ensure that the caller doesn't forget to call apply(). There's also the risk that they try to call it more than once. I think this is impossible today because apply() takes ownership and PendingExpunge doesn't impl Clone (though I can see somebody wanting to add Clone for some other reason and then introducing this risk). I think what I'm most worried about is actually that separating the computation of PendingExpunge from applying it means there's a window in which a caller could do any number of other operations that invalidate the PendingExpunge. What if they use other BlueprintBuilder functions to deploy more zones to that sled in the meantime? Or use sled_ensure_disks() or something like that. Or it might be easy to do something like:

// there are 3 nexus zones, one of which is on $sled_id
builder.build_pending_expunge_for_sled($sled_id, ...)
// go calculate how many Nexus zones should exist and erroneously include the one on $sled_id because it hasn't actually been expunged yet
builder.apply_pending_expunge()

Of course we can add validation to various functions to prevent this stuff but it just seems more complicated than structurally ensuring that external callers can't do anything in the meantime because they have no opportunity to.

That's my pitch but this isn't that important either way and I'm fine if you want to keep it as is!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.

I think I would vote for this, with the caveat that it doesn't look like we currently have any tests these two in isolation, so maybe the split might be premature even for this case vs joining them into one function that returns a summary as you suggested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., there's no merging or other processing of the returned values before calling apply().

So... this is how the code is today, but that's only because the only source of zone expungement is sled expungement. In the very near future (release 9 I believe), we'll have a second source of zone expungement: individual disk expungement. In that case, it is possible that a zone is expunged for multiple reasons -- e.g. for a crucible zone, more than one of the backing disks was expunged.

In that case, there are two ways to address this:

  1. Compute and apply each expungement directly to the builder, relying on the fact that expungement is idempotent.
  2. Build a set of pending expunges, merge all of them building up the reasons for each, and then apply all of them together.

From experience, absent other overriding concerns the solution I generally tend to gravitate towards in these situations is 2. 1 is definitely harder to misuse, but 2 allows for many things, including dry-runs and easier testing. Many, many times in my life I've rewritten code from 1 to 2, and so I just start from pattern 2 these days. (In this case, one of the immediate things pattern 2 does is make us able to produce better comments for the blueprint.)


But let me try and make the case along the same lines for pattern 1:

  • A blueprint itself is pattern 2, a dry-run. A blueprint builder is a dry-run of a dry-run. This is a dry-run^3, which maybe is too much dry-run.
  • Maybe the solution to providing better comments is to actually do the merging within the comment layer. So instead of storing raw strings, we store an operation log that we can merge intelligently, and then resolve to the list of comments while finishing up the blueprint.

That is on top of the undeniable argument that pattern 1 is harder to misuse. So given that both of you feel this way, I think I'll switch to pattern 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to pattern 1 in this case, though (as outlined above) I think it'll have some repercussions down the road that we'll have to deal with.

@@ -353,30 +366,26 @@ impl<'a> BlueprintBuilder<'a> {
// of Nexus instances), but wouldn't be ideal if we have many resources
// we need to skip. We could do something smarter here based on the sets
// of used resources we built above if needed.
let nexus_v4_ips = Box::new(
let nexus_v4_ips = AvailableIterator::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the extra functionality?

nexus/reconfigurator/planning/src/blueprint_builder.rs Outdated Show resolved Hide resolved
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the base branch from main to sunshowers/spr/main.reconfigurator-turn-expunged-sleds-into-expunged-zones April 18, 2024 08:13
@sunshowers
Copy link
Contributor Author

Want to add a few more tests tomorrow, but hopefully we can get some cycles going with this.

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
nexus/reconfigurator/planning/src/planner.rs Outdated Show resolved Hide resolved
///
/// This method does not check whether the sled actually should be expunged
/// -- that is the responsibility of higher-level code.
pub fn build_pending_expunge_for_sled(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.

I think I would vote for this, with the caveat that it doesn't look like we currently have any tests these two in isolation, so maybe the split might be premature even for this case vs joining them into one function that returns a summary as you suggested?

self.zones.iter_mut().find(|z| z.zone.id == *zone_id)
{
// Just check that the zone is still expungeable.
is_already_expunged(&zone.zone, zone.state)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't entirely follow what this is guarding here. In particular, what would a return value of Ok(true) mean? We already expunged the zone but forgot to mark the state as Modified somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making sure that any zones passed in at this level are in a state where marking them as expunged is valid. This is an unavoidable check that happens a second time. The list of zones to expunge has to be passed in -- it cannot be constructed here, because a BuilderZonesConfig is only created after we've decided to change the sled. If there are no zones to expunge then a BuilderZonesConfig shouldn't be created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but am I right that Ok(true) would be a nonsensical return value? Would it make sense to separate "validate that this zone is expungeable" from "has the zone already been marked expunged", and only call the first one here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok(true) indicates a no-op (idempotence). I changed the definition of is_already_expunged slightly, and especially with this new definition -- but even with the old one -- I don't think it makes sense to split them out. The simplest definition of a validate_expungeable function would be to just call is_already_expunged()?; and drop the return value.

(I went through this series of arguments myself while writing this code. Options I considered included renaming is_already_expunged to also include validation -- I decided that was too wordy, and already indicated by the fact that it returns a Result.)

sunshowers added a commit that referenced this pull request Apr 19, 2024
In #5493 we'd like to track invariants like: a zone should not be added
and expunged in the same blueprint. In order to do that, we need to track
this state. (There are probably other ways to do it, but this is the most
explicit method and I really like that.)

This lives in a submodule because I don't want the rest of the blueprint
builder to reach into the internals here.

I split this from #5493 because it became somewhat complex in its own
right, with its own tests.
labbott and others added 2 commits April 19, 2024 16:22
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-turn-expunged-sleds-into-expunged-zones to main April 19, 2024 23:38
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @sunshowers. I tested this on a4x2 and it seems to properly mark zones as expunged. I suggest we get it in asap and build upon it.

@sunshowers
Copy link
Contributor Author

Thanks @andrewjstone! Going to land this and we can take care of any remaining comments in followups.

@sunshowers sunshowers merged commit ffd72fd into main Apr 20, 2024
26 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/reconfigurator-turn-expunged-sleds-into-expunged-zones branch April 20, 2024 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants