Skip to content

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Nov 19, 2024

Builds and is staged on top of #7104. This introduces a BlueprintStorageEditor that handles disks and datasets together, and attempts to address the API issues described by #7080:

  • Adding a zone also adds its datasets
  • Adding a disk also adds its Debug and Zone Root datasets
  • Expunging a disk expunges any datasets that were on it
  • Expunging a dataset expunges any zones that were on it

This should allow BlueprintBuilder clients who don't or can't call sled_ensure_{disks,datasets} (like reconfigurator-cli and some tests) to construct valid blueprints.

Changes the API style to be more imperative. One behavioral change is
that "generation 1" is only ever the set of empty disks; if any disks
are added, that becomes "generation 2". This is consistent with how we
define the OmicronZonesConfig generation, but is a change from what's on
main.
Comment on lines 859 to 874
for (zone, _) in self.zones.current_sled_zones(
sled_id,
BlueprintZoneFilter::ShouldBeRunning,
) {
if let Some(fs_zpool) = &zone.filesystem_pool {
if *fs_zpool == expunged_zpool {
zones_to_expunge.insert(zone.id);
}
}
if let Some(dataset) = zone.zone_type.durable_dataset()
{
if dataset.dataset.pool_name == expunged_zpool {
zones_to_expunge.insert(zone.id);
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be factored out, to an expunge_zones_using_pool function? Seems like it could be a method on self.zones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not super easily, I don't think? This gets to storage and zones being separate things, which I'd like to fix but is more work.

I did factor out a chunk of this into a zones_using_zpool() method in 73b209f

// Expunging a zpool necessarily requires also expunging any zones that
// depended on it.
for zone_id in zones_to_expunge {
self.sled_expunge_zone(sled_id, zone_id)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little weird that we could be expunging zones, but that doesn't propagate through the EnsureMultiple result because that's assumed to be referring to "disks", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; this got major rework, mostly in 8573d08

self.sled_expunge_zone(sled_id, zone_id)?;
}

if added == 0 && removed == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if added == 0 && removed == 0 {
if added == 0 && removed == 0 && updated == 0 {

(Maybe EnsureMultiple should have a constructor that automatically makes the NotNeeded variant if all zeroes are passed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked in 8573d08

for zone in removed_zones {
sled_storage.expunge_zone_datasets(zone);
}
mem::drop(sled_storage);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Opinion, feel free to argue with me)

WDYT about adding a method that consumes self on sled_storage, and documents a little more clearly what we're doing here? Something like fn commit_to_blueprint(self), or fn bump_blueprint_generation_numbers(sefl)?

The explicit drop calls are dropping the Dataset / Disk editors, and their drop impls bump generation numbers if anything has changed, but that's kinda only "easy to see" because I just reviewed that code. Outside of that context, it requires jumping through a few files to infer what's happening here, and why this call is necessary.

Same for the other mem::drop callsite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I am not thinking of these drop calls as behavioral (i.e., bumping generation numbers); I think it would be fine if that didn't happen right here and happened later when this would naturally be dropped. The drop is here to appease the borrow checker: we can't move on and call self.record_operation while we still have a mutable borrow back into self.storage (which sled_storage is).

That said - I added a finalize() method to the datasets editor based on your feedback in #7104, so I'm certainly not opposed to it! Let me take a stab at this in addition to the other comments around Ensure.

Comment on lines 113 to 125
pub fn ensure_disk(&mut self, disk: BlueprintPhysicalDiskConfig) -> Ensure {
let zpool = ZpoolName::new_external(disk.pool_id);

let result = self.disks.ensure_disk(disk);

// We ignore the result of possibly adding or updating the disk's
// dataset. Do we care to log if they've changed even if the disk
// doesn't?
self.datasets.ensure_debug_dataset(zpool.clone());
self.datasets.ensure_zone_root_dataset(zpool);

result
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do care, especially for the purposes of keeping blueprint diffs accurate.

Would it be unreasonable to return:

struct DiskEnsure {
  disk: Ensure,
  dataset: Ensure,
}

? To basically propagate information about "what disks changed" separately from "what datasets changed"?

(... this also makes me think that Ensure should be typed, like our UUIDs, but that's a whole 'nother can of worms)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to get rid of Ensure altogether and instead return the EditCounts recently added in PR 2/5 when build or finalize is called.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that's a better name for it, but I still think there's the question of scope (e.g., did you edit a disk, dataset, zone, etc?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that's a better name for it, but I still think there's the question of scope (e.g., did you edit a disk, dataset, zone, etc?)

Agreed. I think returning multiple EditCounts would make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do care, especially for the purposes of keeping blueprint diffs accurate.

Can you say more about this? I think the return values of these functions are only really used in logging and some tests (which assert on the specific counts); diffs already do their own independent analysis of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do care, especially for the purposes of keeping blueprint diffs accurate.

Can you say more about this? I think the return values of these functions are only really used in logging and some tests (which assert on the specific counts); diffs already do their own independent analysis of changes.

Which is not to say I'm opposed! Returning a combined edit counts thingamajig seems good, but it will need to be from a finalize(self) -> AllEditCounts kind of method, I think. Will give that a shot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I suppose not for the diffs, but the point of passing back the ensure structs in the first place was to keep track of "what changed", as you say, for tests, comments in the blueprint, etc.

(I'll take a look at the AllEditCounts stuff, but it sounds like we're on the same page here)

self.config
.disks
.insert(PhysicalDiskUuid::from_untyped_uuid(disk.id), disk)
pub fn ensure_disk(&mut self, disk: BlueprintPhysicalDiskConfig) -> Ensure {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in Sean and my comments in #7104. Can we get rid of the Ensure type and track the counts in SledDisksEditor and SledDatasetsEditor themselves?

self.disks.current_sled_disks(sled_id)
}

pub fn into_builders(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than returning builder types and then having the BlueprintBuilder call build on both of them what if we avoid creating the wrapper types and instead make this a build method that takes the in service sled ids and returns the results as a pair?

That should make the caller code and this code simpler and I think meets the goal of not allowing separate builders to be exposed for datasets and disks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually had it that way originally and it was sort of awkward to require the sled IDs iterator to be cloneable. But I can change it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 83a49b7

Comment on lines 113 to 125
pub fn ensure_disk(&mut self, disk: BlueprintPhysicalDiskConfig) -> Ensure {
let zpool = ZpoolName::new_external(disk.pool_id);

let result = self.disks.ensure_disk(disk);

// We ignore the result of possibly adding or updating the disk's
// dataset. Do we care to log if they've changed even if the disk
// doesn't?
self.datasets.ensure_debug_dataset(zpool.clone());
self.datasets.ensure_zone_root_dataset(zpool);

result
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to get rid of Ensure altogether and instead return the EditCounts recently added in PR 2/5 when build or finalize is called.

@jgallagher jgallagher force-pushed the john/reconfigurator-storage-2 branch from ab94020 to 552d8bc Compare November 20, 2024 18:52
Base automatically changed from john/reconfigurator-storage-2 to main November 20, 2024 21:11
@jgallagher
Copy link
Contributor Author

Thanks for the quick reviews! Merging with the changes with EditCounts etc. in #7104 and then changing this PR to be more in that style was nontrivial - maybe worth a second look? Particularly 2eef785 and 8573d08

impl Drop for SledDisksEditor<'_> {
fn drop(&mut self) {
if self.changed {
if self.counts != EditCounts::default() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely minor nitpick, but maybe this is worth an EditCounts::has_changes() -> bool method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems reasonable. I might go with has_nonzero_counts() since this thing itself is just counts, and only the bigger context makes "changes" meaningful.

@jgallagher jgallagher merged commit ee0db08 into main Nov 21, 2024
17 checks passed
@jgallagher jgallagher deleted the john/reconfigurator-storage-3 branch November 21, 2024 18:14
andrewjstone pushed a commit that referenced this pull request Nov 22, 2024
…ol, it must be from a disk present in the blueprint (#7106)

Builds and is staged on top of #7105.

The intended change here is in the first commit
(8274174): In
`BlueprintBuilder::sled_select_zpool()`, instead of only looking at the
`PlanningInput`, we also look at the disks present in the blueprint, and
only select a zpool that the planning input says is in service and that
we have in the blueprint.

This had a surprisingly-large blast radius in terms of tests - we had
_many_ tests which were adding zones (which implicitly selects a zpool)
from a `BlueprintBuilder` where there were no disks configured at all,
causing them to emit invalid blueprints. These should all be fixed as of
this PR, but I'm a little worried about test fragility in general,
particularly with an eye toward larger changes like #7078. Nothing to do
about that at the moment, but something to keep an eye on.

Fixes #7079.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants