Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Nov 30, 2021

This PR adds two endpoints:

  1. In the Sled Agent, a POST endpoint has been added with the /update path. This can be used to instruct Sled Agent to "please download and apply an artifact".
  2. In Nexus, a GET endpoint has been added with the /artifacts/{path} path. This can be used (by the Sled Agent) to grab an artifact that has been cached in Nexus.

By themselves, these endpoints don't do much, but they should hook into changes worked on by @iliana .

  • If we notice new artifacts advertised by TUF, we can instruct Sled Agents to get 'em by POST-ing to their /update endpoints.
  • When the Sled Agents try to download those artifacts (by GET-ing back to Nexus), we can lazily fetch whatever update artifacts we need to a cache in Nexus.

let path = vec![path.into_inner().path];

let handler = async {
for component in &path {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can find a common logic maybe this can be shared with find_file in console_api

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing something a little quirky here - basically, lazily fetching files that might not (yet) exist locally. Dunno if that' easy to dedup or not, but I'd be interested in finding commonalities between the two paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see. We bail if we get to a path segment that doesn't exist, while you can keep going through and building up the path regardless. Hard to see how to share that logic in a way that doesn't make both use cases harder to understand.

@smklein smklein changed the title [WIP] Serving files from Nexus -> Sled Agent for the update system Serving files from Nexus -> Sled Agent for the update system Dec 1, 2021
@smklein smklein requested a review from iliana December 1, 2021 19:46
@smklein smklein marked this pull request as ready for review December 1, 2021 19:46
Comment on lines +28 to +42
// TODO: De-duplicate this struct with the one in iliana's PR?
//
// This should likely be a wrapper around that type.
#[derive(Clone, Debug, Deserialize, JsonSchema)]
pub struct UpdateArtifact {
pub name: String,
pub version: i64,
pub kind: UpdateArtifactKind,
}

// TODO: De-dup me too.
#[derive(Clone, Debug, Deserialize, JsonSchema)]
pub enum UpdateArtifactKind {
Zone,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means moving these out into common?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehhh it they're present in an endpoint, I can use them from the autogenerated crate.

(if not.. yeah, common seems like the right spot).

Comment on lines +75 to +77
// We download the file to a location named "<artifact-name>-<version>".
// We then rename it to "<artifact-name>" after it has successfully
// downloaded, to signify that it is ready for usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strikes me as odd -- should future parts of sled-agent be aware of what artifact version it's trying to apply? i feel like there's a possible race here where nexus tells the sled agent to apply version N+1 of an artifact in the middle of it applying version N

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the day, the sled agent needs to set "something" as active, right? Whether that's "flash the firmware", "write the OS partition", or "update a file".

You have a good point - we should probably add concurrency control around the updates to make them safe amid concurrent requests - but that's kinda what this renaming silliness is all about:

  1. Download to a "unique" spot. I dunno if the version aspect really matters here, I just wanna make sure stream all necessary data to the sled before using it.
  2. "Apply" the update atomically

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review here -- I'm not sure what the status is, if we're still doing this, or if this was just for the demo or what. Feel free to ignore if that makes sense.

I'm a little worried about adding both local storage to Nexus and an API to download files from Nexus. A lot of our scalability and HA goals are contingent on Nexus having no local persistent state outside of CockroachDB. And the file server feels like a pretty big surface area for possible attack.

Dataset,
Disk,
DiskAttachment,
DownloadArtifact,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're adding / have added a bunch of stuff here that's only appropriate to the internal API. I'm guessing that's because of the way we try to construct external::Errors from diesel errors using the helper function that takes a resource type. Maybe we should have a different helper that uses a different enum for internal stuff? Doesn't have to be done here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Filed #532 to track.

"Directory download not supported".to_string(),
));
}
let body = nexus.download_artifact(&entry).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a TOCTOU race here, if the path has changed since we checked it?

I think the traditional approach is to use open(2) and then fstat(2). I haven't dug into it but it looks like you can do this in Rust with File::open and then File::metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm looking at the file server - I think that whole example may need a re-work, in addition to this "final entry" check. We also check that the intermediate paths are not symlinks, and I believe this suffers from the same TOCTTOU issue - they could be replaced by symlinks, since we're re-checking the whole path again.

In C, I would normally use openat to resolve this issue, but I don't see this function exposed from the Rust standard library (though I suppose I could use libc directly, or one of the wrapper crates).

@smklein
Copy link
Collaborator Author

smklein commented Dec 20, 2021

Sorry for the late review here -- I'm not sure what the status is, if we're still doing this, or if this was just for the demo or what. Feel free to ignore if that makes sense.

So this PR is currently in a decidedly "demo-ish" state, but this had been the plan for the long-term too.

I'm a little worried about adding both local storage to Nexus and an API to download files from Nexus. A lot of our scalability and HA goals are contingent on Nexus having no local persistent state outside of CockroachDB. And the file server feels like a pretty big surface area for possible attack.

We are overdue for an RFD here (maybe I can work with @iliana to get one posted?) but Josh / iliana / myself chatted about this (context: 2:00 minutes into https://drive.google.com/file/d/1jzCgku-ZUjpibU66COLkjQk4BNH0tOuR/view ).

Effectively:

  • The only Nexus-local storage would be for immutable artifacts
  • Nexus' storage is acting like a cache
  • The implementation should continue operating correctly if we lose / destroy these files arbitrarily
  • We can (and should) re-verify the artifacts before applying them, on the Sled

The goal was, for example, if we're applying a download of a 1GB file:

  • In the normal case, we should only download the file over the external network once
  • After this, sleds can just request the file from Nexus, which should act as a simple cache

@davepacheco
Copy link
Collaborator

Sorry for the late review here -- I'm not sure what the status is, if we're still doing this, or if this was just for the demo or what. Feel free to ignore if that makes sense.

So this PR is currently in a decidedly "demo-ish" state, but this had been the plan for the long-term too.

I'm a little worried about adding both local storage to Nexus and an API to download files from Nexus. A lot of our scalability and HA goals are contingent on Nexus having no local persistent state outside of CockroachDB. And the file server feels like a pretty big surface area for possible attack.

We are overdue for an RFD here (maybe I can work with @iliana to get one posted?) but Josh / iliana / myself chatted about this (context: 2:00 minutes into https://drive.google.com/file/d/1jzCgku-ZUjpibU66COLkjQk4BNH0tOuR/view ).

Effectively:

* The only Nexus-local storage would be for immutable artifacts

* Nexus' storage is acting like a cache

* The implementation should continue operating correctly if we lose / destroy these files arbitrarily

* We can (and should) re-verify the artifacts before applying them, on the Sled

The goal was, for example, if we're applying a download of a 1GB file:

* In the normal case, we should only download the file over the external network once

* After this, sleds can just request the file from Nexus, which should act as a simple cache

Thanks for that context. Yes, an RFD would be really helpful. Those constraints do help, but I think there remain important details to be worked out around how we manage that cache (e.g., to avoid exhausting local storage). And in general, if an operation makes two requests to Nexus, we should assume that the second request may hit a different Nexus than the first one. It's not clear to me how this approach will work in that case.

None of this addresses the security surface area of having a file server in Nexus. Maybe that's not something we're worried about? On the other hand, we've already seen that that's tricky to get right.

@smklein
Copy link
Collaborator Author

smklein commented Dec 21, 2021

I'm migrating the discussion to RFD 183, which I just committed to: https://github.com/oxidecomputer/rfd/pull/272/commits/c6c7af8984fdded1381b42b6c0faa4ce7f0d48cb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants