Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Dec 14, 2021

This PR re-works the disk creation/deletion pathways to allocate Crucible Downstairs regions via sagas.

Nexus

  • Modifies the disk creation API to actually allocate regions, ensure they exist.
  • Similarly, modifies the disk deletion API to actually remove those regions.
  • Converts "disk creation" and "disk deletion" to use sagas.

Datastore

  • Expose APIs to idempotently allocate regions to back a disk.
  • Moves "check if disk attached" error into the project_delete_disk API, to avoid the documented race condition.

Sled Agent

  • Create a simulated Crucible Agent service, with hooks into the simulated Sled Agent.

Tests

  • Splits the disk integration tests into smaller tests
  • Adds datastore-specific tests for region allocation
  • Adds disk integration tests for region allocation, interaction with crucible agent, and undoing saga actions.

type TxnError = TransactionError<RegionAllocateError>;
let params: params::DiskCreate = params.clone();
self.pool()
.transaction(move |conn| {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I felt the complication of looking up regions + datasets + other auxiliary data was fairly complex, and rather than optimizing a CTE for it (especially as the allocation algorithm might change) I figured I'd start with something "easy-to-understand, but less optimized".

Is that okay?

Region::new(
dataset.id(),
disk_id,
params.block_size().try_into().unwrap(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update the type of them to ByteCount. I admittedly am still inclined to leave "extent_count" as something other than a ByteCount, since it is a count, not a size.

fn saga_disk_delete() -> SagaTemplate<SagaDiskDelete> {
let mut template_builder = SagaTemplateBuilder::new();

template_builder.append(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be critical to modify the disk record before anything else - if we don't update the disk record first, couldn't other concurrent operations poke and prod at the disk while we're tearing down the backing storage?

In the "disk creation" saga, we include a final step to "finalize" the disk, which basically exposes it for access. I figured the most important invariant was "if a disk has been created, it should be backed by functioning regions at all time" - so taking it out of the rotation felt like the most important step, before taking out regions.

@smklein
Copy link
Collaborator Author

smklein commented Jan 20, 2022

Thanks for the reviews, y'all. I appreciate the help, and know this PR is a lot to get through.

With this most recent change, I've gotten rid of FULL SCANs by storing some usage info in the Dataset table.

Nice! This is a big piece of work and looks good.

How do you feel about the saga? My read is that the saga action is working just as we hoped. It's fairly straightforward to reason about the individual actions. And looking at what's involved, it feels untenable to try to write this code by hand (like, without sagas or something like it) and have it correctly unwind state on failure and work correctly after a crash, etc.

I think this is largely correct. Breaking the work into smaller units definitely makes reasoning about it more tractable, and it's nice to have the automated support for calling the "unwind" functions. However, I keep on struggling to be sure that the operations I'm writing are idempotent.

I'd really like to write more tests for these conditions (repeating each action, undoing from each node, repeating the undo actions) because they feel fairly easy to miss, and yet pretty dangerous if ultimately wrong.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all my concerns have been addressed. You have my approval, for what it's worth :)

Base automatically changed from explain to main January 21, 2022 15:57
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making a bunch of those changes!

timeseries_client,
};

/* TODO-cleanup all the extra Arcs here seems wrong */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "saga_type" => "recovery" mean? Is it that this saga has been recovered as opposed to having been created by an API call handled by this process?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating the other logger in execute_saga, I created the logger with the key template_name.

In this recovery setup, however, we create a SagaContext object before we know what templates we'll be processing.

So basically, yeah: I wanted some way to distinguish the "saga context for recovery" vs "the normal sagas".

This is totally arbitrary though, if we'd prefer different keys, this could change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. It'll be good to eventually get the template name in those too but we can do that later!

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! HTTP entrypoint functions for simulating the storage agent API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. How do we make sure this stays in sync with the real Crucible? How would someone modifying Crucible even know that they have to update this too?

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Simulated sled agent storage implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. It's just a little confusing -- in general, "disks" and "storage" are sort of synonymous. It sounds like the real difference between one of these is that one is virtualized -- one is an RFD 4 Disk and the other is a general-purpose lower-level storage subsystem.

@smklein smklein merged commit a75149c into main Jan 24, 2022
@smklein smklein deleted the regions branch January 24, 2022 17:40
smklein added a commit that referenced this pull request Jan 25, 2022
#633)

#511 made disk allocation more "real" - disks are allocated from a group of datasets.

Even for the Simulated Sled Agent, Crucible Regions may be allocated atop a Crucible Dataset (though the data plane won't exist).

However, this wasn't the default when running the "simulated sled agent" binary. This PR adds a default for the simulated sled agent: "pretend you have 10 zpools (representing U.2 storage), each with 1 TB".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants