Teach Nexus about snapshots #1475

jmpesp · 2022-07-21T19:56:10Z

Create a "snapshot_create" saga, and fill out snapshot related code:

allow snapshots to be taken of disks attached to instances
stub out the path for taking snapshots of unattached disks
allow disks to be created from snapshots
implement related HTTP endpoint functions

Note that Nexus will allow you to create multiple disks out of a single
snapshot, and this currently will send multiple activations to a single
downtairs set, causing a race where only one Upstairs can win. This will
be addressed by future Crucible changes.

Create a "snapshot_create" saga, and fill out snapshot related code: - allow snapshots to be taken of disks attached to instances - stub out the path for taking snapshots of unattached disks - allow disks to be created from snapshots - implement related HTTP endpoint functions Note that Nexus will allow you to create multiple disks out of a single snapshot, and this currently will send multiple activations to a single downtairs set, causing a race where only one Upstairs can win. This will be addressed by future Crucible changes.

jmpesp

Some comments:

jmpesp · 2022-07-21T19:57:54Z

sled-agent/src/sled_agent.rs

+        // For a disk not attached to an instance, implementation requires
+        // constructing a volume and performing a snapshot through some other
+        // means. Currently unimplemented.
+        todo!();
+    }


@jclulow this is where I imagine something like your new service plugging in - it has to construct the volume and therefore requires the volume construction request.

Nit: can we propagate an error upwards, rather than crashing here? This is an endpoint exposed to clients; if for some horrible reason it doesn't make it into v1, we shouldn't be crashing when this happens

sled-agent/src/sled_agent.rs

nexus/src/app/disk.rs

smklein · 2022-07-25T15:49:37Z

nexus/src/app/disk.rs

+        // This also requires solving how to clean up the associated resources
+        // (on-disk snapshots, running read-only downstairs) because disks
+        // *could* still be using them (if the snapshot has not yet been turned
+        // into a regular crucible volume). It will involve some sort of
+        // reference counting for volumes, and probably means this needs to
+        // instead be a saga.


There's a comment in Datastore::project_delete_snapshot that's throwing me off a bit - it says:

A snapshot can be deleted in any state. It's never attached to an instance, and any disk launched from it will copy and modify the volume construction request it's based on

But this comment implies doing so is unsafe.

To be clear, the current implementation is leaking, by not reference counting / deleting volumes, correct?

Just verifying my understanding that this is a question of "when should we delete things that we can delete" (which sucks, but is okay to punt), rather than "the current implementation risks a use-after-free" (which we probably should not merge).

Sorry yeah, that is a little confusing. "A snapshot can be deleted in any state." refers to the snapshot states creating, ready, faulted, destroyed. Simply deleting it won't affect any running instance.

The current implementation leaks on-disk snapshots and associated read-only running downstairs, yes. If I create a snapshot X out of disk A, create disk B from snapshot X, there will be three volumes in the database:

thing volume record

disk A volume 1

snapshot X volume 2

disk B volume 3

Volume 2 will be a copy of volume 1 (with modifications). Volume 3 will include a copy of volume 2 as a read-only parent.

With this code, a delete of snapshot X will delete the snapshot record and the associated volume record (volume 2). It will not delete the on-disk snapshot or associated running read-only downstairs, which are (if it is attached to an instance) being used by volume 3.

Got it - the TL;DR is that the resource being leaked is "volumes associated with snapshots". If we don't have one already, could we file an issue for this?

Definitely, opened #1632

nexus/src/app/sagas/disk_create.rs

nexus/src/app/sagas/snapshot_create.rs

nexus/src/db/datastore/sled.rs

nexus/src/db/datastore/snapshot.rs

nexus/src/app/sagas/snapshot_create.rs

…ests/

update sled agent openapi

make it "ensure" instead

Update Snapshot Create saga for new steno changes

fmt

jmpesp · 2022-08-16T15:29:31Z

Pretty sure this PR addresses all the points in #735

jmpesp · 2022-08-18T18:40:21Z

This should be good for a re-review now.

smklein

Thanks for all the hard work on this. I think there are a handful of issues we want to follow-up on, but we can certainly do those iteratively, rather than continuing to block this PR (which I imagine is already a pain to rebase at the 3500+ LoC mark).

Thanks again for all the tests!

jmpesp requested review from jclulow and smklein July 21, 2022 19:56

jmpesp commented Jul 21, 2022

View reviewed changes

smklein reviewed Jul 25, 2022

View reviewed changes

jmpesp added 11 commits July 25, 2022 16:00

copy paste error, change to Region B

7488c6e

change "grabbing" messages to debug

39f8496

docstrings for different types of snapshot requests

75ed1b5

only one pair of saga action / undo needs noop pattern

77f9929

all warn to debug

d7e3c3f

do not expect snapshot gen == 0, that makes the saga node non-idempotent

9962ad8

debug gets compiled out during release, use info

fcc1335

import info

556a154

CRDB's random returns float, not text!

99f346e

clippy and fmt

ef96b99

Merge remote-tracking branch 'upstream/main' into snapshots

40f59c6

andrewjstone reviewed Jul 25, 2022

View reviewed changes

nexus/src/app/sagas/snapshot_create.rs Outdated Show resolved Hide resolved

jmpesp added 6 commits July 25, 2022 17:11

typo: valiate -> validate

7d0365b

fix snapshot tests after upstream merge

48fbe03

delete snapshot and volume in transaction

d2fb15d

WIP test_create_snapshot_record_idempotency

8a60aea

correct comment

d017b1b

move test code from snapshot_create.rs into nexus/tests/integration_t…

b2b3a4e

…ests/

jmpesp mentioned this pull request Aug 8, 2022

Taking a snapshot of a detached disk needs to be implemented #1570

Open

jmpesp added 6 commits August 8, 2022 15:48

use re-exported propolis_client::api::VolumeConstructionRequest

4a67792

update sled agent openapi

test that snapshot creation is idempotent

a8bcf1e

make it "ensure" instead

further ensure project_delete_snapshot is idempotent

e149e67

creating a snapshot from a disk involves more steps

3b2d2a6

Merge remote-tracking branch 'upstream/main' into snapshots

058099d

Update Snapshot Create saga for new steno changes

properly recurse down read-only parent during create_snapshot_from_disk

5210b71

jmpesp added 4 commits August 12, 2022 15:54

randomize IDs for each disk

5157451

fmt

simulated crucible agent should be idempotent

794eea1

Test the various ways Nexus can reject a disk created from a snapshot

3837f0f

update crucible and propolis revs

e0edc5a

smklein self-assigned this Aug 22, 2022

smklein approved these changes Aug 22, 2022

View reviewed changes

jmpesp merged commit fff2462 into oxidecomputer:main Aug 22, 2022

jmpesp deleted the snapshots branch August 23, 2022 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teach Nexus about snapshots #1475

Teach Nexus about snapshots #1475

jmpesp commented Jul 21, 2022

jmpesp left a comment

jmpesp Jul 21, 2022

smklein Aug 22, 2022

smklein Jul 25, 2022

jmpesp Jul 25, 2022

smklein Aug 22, 2022

jmpesp Aug 22, 2022

jmpesp commented Aug 16, 2022

jmpesp commented Aug 18, 2022

smklein left a comment

Teach Nexus about snapshots #1475

Teach Nexus about snapshots #1475

Conversation

jmpesp commented Jul 21, 2022

jmpesp left a comment

Choose a reason for hiding this comment

jmpesp Jul 21, 2022

Choose a reason for hiding this comment

smklein Aug 22, 2022

Choose a reason for hiding this comment

smklein Jul 25, 2022

Choose a reason for hiding this comment

jmpesp Jul 25, 2022

Choose a reason for hiding this comment

smklein Aug 22, 2022

Choose a reason for hiding this comment

jmpesp Aug 22, 2022

Choose a reason for hiding this comment

jmpesp commented Aug 16, 2022

jmpesp commented Aug 18, 2022

smklein left a comment

Choose a reason for hiding this comment