Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teach Nexus about snapshots #1475

Merged
merged 28 commits into from
Aug 22, 2022
Merged

Teach Nexus about snapshots #1475

merged 28 commits into from
Aug 22, 2022

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Jul 21, 2022

Create a "snapshot_create" saga, and fill out snapshot related code:

  • allow snapshots to be taken of disks attached to instances
  • stub out the path for taking snapshots of unattached disks
  • allow disks to be created from snapshots
  • implement related HTTP endpoint functions

Note that Nexus will allow you to create multiple disks out of a single
snapshot, and this currently will send multiple activations to a single
downtairs set, causing a race where only one Upstairs can win. This will
be addressed by future Crucible changes.

Create a "snapshot_create" saga, and fill out snapshot related code:

- allow snapshots to be taken of disks attached to instances
- stub out the path for taking snapshots of unattached disks
- allow disks to be created from snapshots
- implement related HTTP endpoint functions

Note that Nexus will allow you to create multiple disks out of a single
snapshot, and this currently will send multiple activations to a single
downtairs set, causing a race where only one Upstairs can win. This will
be addressed by future Crucible changes.
@jmpesp jmpesp requested review from jclulow and smklein July 21, 2022 19:56
Copy link
Contributor Author

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

Comment on lines +429 to +433
// For a disk not attached to an instance, implementation requires
// constructing a volume and performing a snapshot through some other
// means. Currently unimplemented.
todo!();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jclulow this is where I imagine something like your new service plugging in - it has to construct the volume and therefore requires the volume construction request.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we propagate an error upwards, rather than crashing here? This is an endpoint exposed to clients; if for some horrible reason it doesn't make it into v1, we shouldn't be crashing when this happens

sled-agent/src/sled_agent.rs Outdated Show resolved Hide resolved
nexus/src/app/disk.rs Show resolved Hide resolved
Comment on lines +505 to +510
// This also requires solving how to clean up the associated resources
// (on-disk snapshots, running read-only downstairs) because disks
// *could* still be using them (if the snapshot has not yet been turned
// into a regular crucible volume). It will involve some sort of
// reference counting for volumes, and probably means this needs to
// instead be a saga.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment in Datastore::project_delete_snapshot that's throwing me off a bit - it says:

A snapshot can be deleted in any state. It's never attached to an instance, and any disk launched from it will copy and modify the volume construction request it's based on

But this comment implies doing so is unsafe.

To be clear, the current implementation is leaking, by not reference counting / deleting volumes, correct?

Just verifying my understanding that this is a question of "when should we delete things that we can delete" (which sucks, but is okay to punt), rather than "the current implementation risks a use-after-free" (which we probably should not merge).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry yeah, that is a little confusing. "A snapshot can be deleted in any state." refers to the snapshot states creating, ready, faulted, destroyed. Simply deleting it won't affect any running instance.

The current implementation leaks on-disk snapshots and associated read-only running downstairs, yes. If I create a snapshot X out of disk A, create disk B from snapshot X, there will be three volumes in the database:

thing volume record
disk A volume 1
snapshot X volume 2
disk B volume 3

Volume 2 will be a copy of volume 1 (with modifications). Volume 3 will include a copy of volume 2 as a read-only parent.

With this code, a delete of snapshot X will delete the snapshot record and the associated volume record (volume 2). It will not delete the on-disk snapshot or associated running read-only downstairs, which are (if it is attached to an instance) being used by volume 3.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - the TL;DR is that the resource being leaked is "volumes associated with snapshots". If we don't have one already, could we file an issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, opened #1632

nexus/src/app/sagas/disk_create.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/snapshot_create.rs Outdated Show resolved Hide resolved
nexus/src/db/datastore/sled.rs Outdated Show resolved Hide resolved
nexus/src/db/datastore/snapshot.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/snapshot_create.rs Show resolved Hide resolved
nexus/src/app/sagas/snapshot_create.rs Show resolved Hide resolved
nexus/src/app/sagas/snapshot_create.rs Show resolved Hide resolved
nexus/src/app/sagas/snapshot_create.rs Outdated Show resolved Hide resolved
@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 16, 2022

Pretty sure this PR addresses all the points in #735

@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 18, 2022

This should be good for a re-review now.

@smklein smklein self-assigned this Aug 22, 2022
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work on this. I think there are a handful of issues we want to follow-up on, but we can certainly do those iteratively, rather than continuing to block this PR (which I imagine is already a pain to rebase at the 3500+ LoC mark).

Thanks again for all the tests!

@jmpesp jmpesp merged commit fff2462 into oxidecomputer:main Aug 22, 2022
@jmpesp jmpesp deleted the snapshots branch August 23, 2022 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants