-
Notifications
You must be signed in to change notification settings - Fork 62
Disk creation/deletion allocates Crucible regions via sagas #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| type TxnError = TransactionError<RegionAllocateError>; | ||
| let params: params::DiskCreate = params.clone(); | ||
| self.pool() | ||
| .transaction(move |conn| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I felt the complication of looking up regions + datasets + other auxiliary data was fairly complex, and rather than optimizing a CTE for it (especially as the allocation algorithm might change) I figured I'd start with something "easy-to-understand, but less optimized".
Is that okay?
nexus/src/db/datastore.rs
Outdated
| Region::new( | ||
| dataset.id(), | ||
| disk_id, | ||
| params.block_size().try_into().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll update the type of them to ByteCount. I admittedly am still inclined to leave "extent_count" as something other than a ByteCount, since it is a count, not a size.
| fn saga_disk_delete() -> SagaTemplate<SagaDiskDelete> { | ||
| let mut template_builder = SagaTemplateBuilder::new(); | ||
|
|
||
| template_builder.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be critical to modify the disk record before anything else - if we don't update the disk record first, couldn't other concurrent operations poke and prod at the disk while we're tearing down the backing storage?
In the "disk creation" saga, we include a final step to "finalize" the disk, which basically exposes it for access. I figured the most important invariant was "if a disk has been created, it should be backed by functioning regions at all time" - so taking it out of the rotation felt like the most important step, before taking out regions.
…s still acting like crucible-agent)
|
Thanks for the reviews, y'all. I appreciate the help, and know this PR is a lot to get through. With this most recent change, I've gotten rid of
I think this is largely correct. Breaking the work into smaller units definitely makes reasoning about it more tractable, and it's nice to have the automated support for calling the "unwind" functions. However, I keep on struggling to be sure that the operations I'm writing are idempotent. I'd really like to write more tests for these conditions (repeating each action, undoing from each node, repeating the undo actions) because they feel fairly easy to miss, and yet pretty dangerous if ultimately wrong. |
leftwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all my concerns have been addressed. You have my approval, for what it's worth :)
davepacheco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making a bunch of those changes!
| timeseries_client, | ||
| }; | ||
|
|
||
| /* TODO-cleanup all the extra Arcs here seems wrong */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "saga_type" => "recovery" mean? Is it that this saga has been recovered as opposed to having been created by an API call handled by this process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating the other logger in execute_saga, I created the logger with the key template_name.
In this recovery setup, however, we create a SagaContext object before we know what templates we'll be processing.
So basically, yeah: I wanted some way to distinguish the "saga context for recovery" vs "the normal sagas".
This is totally arbitrary though, if we'd prefer different keys, this could change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. It'll be good to eventually get the template name in those too but we can do that later!
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! HTTP entrypoint functions for simulating the storage agent API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. How do we make sure this stays in sync with the real Crucible? How would someone modifying Crucible even know that they have to update this too?
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! Simulated sled agent storage implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. It's just a little confusing -- in general, "disks" and "storage" are sort of synonymous. It sounds like the real difference between one of these is that one is virtualized -- one is an RFD 4 Disk and the other is a general-purpose lower-level storage subsystem.
#633) #511 made disk allocation more "real" - disks are allocated from a group of datasets. Even for the Simulated Sled Agent, Crucible Regions may be allocated atop a Crucible Dataset (though the data plane won't exist). However, this wasn't the default when running the "simulated sled agent" binary. This PR adds a default for the simulated sled agent: "pretend you have 10 zpools (representing U.2 storage), each with 1 TB".
This PR re-works the disk creation/deletion pathways to allocate Crucible Downstairs regions via sagas.
Nexus
Datastore
project_delete_diskAPI, to avoid the documented race condition.Sled Agent
Tests