Skip to content

Conversation

@just-be-dev
Copy link
Contributor

@just-be-dev just-be-dev commented Oct 25, 2022

Fixes #812

Wires up the instance disk create flow by directly appending the disk create saga as a sub saga. I also corrected the behavior of the attach action by ensuring it doesn't spawn a saga in its action but rather just updates the relevant data in the datastore. There's a TODO to make that functionality more robust, but I covered everything that's handled currently by instance attach as defined in nexus... which admittedly isn't much.

While I was in here I also removed some of the unnecessary behavior around looping through the max number of disks and passing the index down as apart of the saga params. Now the disk create or attach params themselves are passed down and we only loop as many times as is needed. This was an artifact of having static sagas previously.

There's still more cleanup to do for NICs/external IPs which I haven't tackled and of course more robust tests are required. I'm going to leave the latter up to @smklein as he's been working on that for disk create already.

@just-be-dev just-be-dev marked this pull request as ready for review October 25, 2022 20:24
@just-be-dev just-be-dev requested a review from smklein October 25, 2022 20:24
Comment on lines -182 to +177
subsaga_builder: steno::DagBuilder,
subsaga_dag: steno::Dag,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On @andrewjstone's recommendation I updated this function to take a dag instead of a builder. It's easy enough to call the builder at the append callsite and this lets us reuse the disk create saga's make_saga_dag function.

@just-be-dev just-be-dev marked this pull request as draft October 27, 2022 03:15
@just-be-dev just-be-dev marked this pull request as ready for review October 27, 2022 17:46
Comment on lines +749 to +754
let (.., authz_disk, _db_disk) = LookupPath::new(&opctx, &datastore)
.project_id(project_id)
.disk_name(&disk_name)
.fetch()
.await
.map_err(ActionError::action_failed)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really shouldn't be looking up disks by name here, but that's all we have. Technically I could do a special lookup for disks that were previously created and grab their IDs, but for disks that are only being attached all we have is what the API provides.

This is a bit of an aside to this issue, but I believe that all names should be resolved inside of http_entrypoints. Once calls flow further down in the system the should be operating solely on IDs. Whenever we begin to implement RFD-322 we should probably consider that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 is that not worth doing here? If not, I wonder if we should keep the comment that was here before (the TODO-security TODO-correctness one that references omicron#1536).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can re-add the comment. I still generally believe that we shouldn't just handle this in a one-off way, but more as a holistic solution. I know that's a non-trivial amount of work, but I still believe it to be important.

@just-be-dev just-be-dev requested a review from smklein October 27, 2022 17:52
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zephraph This looks great! It's really nice to see you creating just enough saga nodes necessary to create and attach disks, and not the MAX. I only had one nit and one question.

This is probably safe to merge, although, I'd like a few others to review. Presumably you tested this all locally and you can now create disks during instance creation :)

}
if attached {
datastore
.instance_attach_disk(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, as I'm not super familiar with this code: Why change the call from using osagactx.nexus().instance_XXX to datastore.instance_XXX?

It seems you basically duplicated the methods from nexus/src/app/instance.rs

Copy link
Contributor Author

@just-be-dev just-be-dev Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a misunderstanding on my part. I'd mistakenly thought that functions off of osagactx.nexus() spawned sagas when that's clearly just an implementation detail. I'll change it back to what it was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a correction to that: nexus().instance_attach_disk isn't the best form. It takes names and does lookups which are resolved before calling the datastore method. You're right that we should reuse instance_attach_disk in /nexus/src/app/instance but ideally we'd have a different form that takes pre-resolved arguments like the datastore method.

Ideally we want to resolve as many names as possible at the very start of a saga (or better, before the saga is invoked).

Do folks think I should change it back or is this form acceptable until we're able to follow up with a better interface for attaching? cc @smklein, @davepacheco.

This sort of plays into my previous comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I'd personally err on the side of "doing less name lookups" - the boundary between nexus and datastore is really thin from the point-of-view of sagas. The datastore itself should provide necessary encapsulation for safe DB access, but forcing sagas to go through nexus isn't adding much IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  • Yes, I think at some point we discussed having the Nexus-level functions accept LookupPath objects rather than "names" so that we could use them from places that had ids or names.
  • Yes, I think sagas should use the datastore freely when that's the right level of abstraction.
  • At the same time, it might also be appropriate to use the Nexus-level functions when there's real application-level logic that has to happen. For example, I think it'd be better for sagas that need to update instance state to use the app-level instance_set_runtime(). In this specific case, it seems like the app-level wrapper doesn't do anything but the name resolution and then call the datastore function. So it seems fine.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at the CI failures? I'm not sure if they're flakes, or related to this PR

}
if attached {
datastore
.instance_attach_disk(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I'd personally err on the side of "doing less name lookups" - the boundary between nexus and datastore is really thin from the point-of-view of sagas. The datastore itself should provide necessary encapsulation for safe DB access, but forcing sagas to go through nexus isn't adding much IMO.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Such a huge improvement, not just to what's here but the API and (I expect) the console, too.

I know @smklein is still working on examples for doing unit tests of sagas. But I think we could still have end-to-end tests that exercise a few key cases here:

  • an instance create that creates a disk, attaches another disk, and succeeds
  • an instance create that creates one disk, attaches another disk, and fails...we should at least verify after this that the created disk was destroyed and the other disk is detached. It'd be nice if we can make sure it fails in a way that did create the first disk and/or attach the other disk so that we're exercising the undo actions.

Even if we also had unit tests already I think we'd still want these basic end-to-end tests.

Comment on lines +749 to +754
let (.., authz_disk, _db_disk) = LookupPath::new(&opctx, &datastore)
.project_id(project_id)
.disk_name(&disk_name)
.fetch()
.await
.map_err(ActionError::action_failed)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 is that not worth doing here? If not, I wonder if we should keep the comment that was here before (the TODO-security TODO-correctness one that references omicron#1536).

}
if attached {
datastore
.instance_attach_disk(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  • Yes, I think at some point we discussed having the Nexus-level functions accept LookupPath objects rather than "names" so that we could use them from places that had ids or names.
  • Yes, I think sagas should use the datastore freely when that's the right level of abstraction.
  • At the same time, it might also be appropriate to use the Nexus-level functions when there's real application-level logic that has to happen. For example, I think it'd be better for sagas that need to update instance state to use the app-level instance_set_runtime(). In this specific case, it seems like the app-level wrapper doesn't do anything but the name resolution and then call the datastore function. So it seems fine.

@just-be-dev
Copy link
Contributor Author

Have you looked at the CI failures? I'm not sure if they're flakes, or related to this PR

I've looked at them, yeah. A snapshot integration test is timing out. I'm trying to dig into what's going on there.

@just-be-dev
Copy link
Contributor Author

an instance create that creates one disk, attaches another disk, and fails...we should at least verify after this that the created disk was destroyed and the other disk is detached. It'd be nice if we can make sure it fails in a way that did create the first disk and/or attach the other disk so that we're exercising the undo actions.

@davepacheco I think the last test I added tests this case sufficiently. It pre-creates two disks, one of which is faulted. It creates the boot disk, attaches the first good disk, and then tries to attach the faulted disk which results in the saga failing. The test tries to ensure that the same number of disks exist in the expected state post cleanup.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - only a couple minor suggestions for the tests!

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have a comment on the tests saying what they're testing. Either way, this looks good!

Comment on lines +1959 to +1964
let builder =
RequestBuilder::new(client, http::Method::POST, &url_instances)
.body(Some(&instance_params))
.expect_status(Some(http::StatusCode::CREATED));

let response = NexusRequest::new(builder)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can replace most of this with let response = NexusRequest::objects_post(client, &url_instances, &instance_params)) (it's equivalent -- not better or worse aside from less code)

@just-be-dev just-be-dev enabled auto-merge (squash) November 2, 2022 05:07
@just-be-dev just-be-dev merged commit 2b3a643 into main Nov 2, 2022
@just-be-dev just-be-dev deleted the instance-disk-create branch November 2, 2022 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement disk creation during instance creation

6 participants