Serving files from Nexus -> Sled Agent for the update system #457

smklein · 2021-11-30T21:40:49Z

This PR adds two endpoints:

In the Sled Agent, a POST endpoint has been added with the /update path. This can be used to instruct Sled Agent to "please download and apply an artifact".
In Nexus, a GET endpoint has been added with the /artifacts/{path} path. This can be used (by the Sled Agent) to grab an artifact that has been cached in Nexus.

By themselves, these endpoints don't do much, but they should hook into changes worked on by @iliana .

If we notice new artifacts advertised by TUF, we can instruct Sled Agents to get 'em by POST-ing to their /update endpoints.
When the Sled Agents try to download those artifacts (by GET-ing back to Nexus), we can lazily fetch whatever update artifacts we need to a cache in Nexus.

david-crespo · 2021-11-30T21:56:25Z

nexus/src/internal_api/http_entrypoints.rs

+    let path = vec![path.into_inner().path];
+
+    let handler = async {
+        for component in &path {


if we can find a common logic maybe this can be shared with find_file in console_api

We're doing something a little quirky here - basically, lazily fetching files that might not (yet) exist locally. Dunno if that' easy to dedup or not, but I'd be interested in finding commonalities between the two paths.

Yeah, I see. We bail if we get to a path segment that doesn't exist, while you can keep going through and building up the path regardless. Hard to see how to share that logic in a way that doesn't make both use cases harder to understand.

nexus/src/internal_api/http_entrypoints.rs

iliana · 2021-12-01T23:12:34Z

sled-agent/src/updates.rs

+// TODO: De-duplicate this struct with the one in iliana's PR?
+//
+// This should likely be a wrapper around that type.
+#[derive(Clone, Debug, Deserialize, JsonSchema)]
+pub struct UpdateArtifact {
+    pub name: String,
+    pub version: i64,
+    pub kind: UpdateArtifactKind,
+}
+
+// TODO: De-dup me too.
+#[derive(Clone, Debug, Deserialize, JsonSchema)]
+pub enum UpdateArtifactKind {
+    Zone,
+}


this means moving these out into common?

ehhh it they're present in an endpoint, I can use them from the autogenerated crate.

(if not.. yeah, common seems like the right spot).

iliana · 2021-12-01T23:14:40Z

sled-agent/src/updates.rs

+        // We download the file to a location named "<artifact-name>-<version>".
+        // We then rename it to "<artifact-name>" after it has successfully
+        // downloaded, to signify that it is ready for usage.


this strikes me as odd -- should future parts of sled-agent be aware of what artifact version it's trying to apply? i feel like there's a possible race here where nexus tells the sled agent to apply version N+1 of an artifact in the middle of it applying version N

At the end of the day, the sled agent needs to set "something" as active, right? Whether that's "flash the firmware", "write the OS partition", or "update a file".

You have a good point - we should probably add concurrency control around the updates to make them safe amid concurrent requests - but that's kinda what this renaming silliness is all about:

Download to a "unique" spot. I dunno if the version aspect really matters here, I just wanna make sure stream all necessary data to the sled before using it.

"Apply" the update atomically

davepacheco

Sorry for the late review here -- I'm not sure what the status is, if we're still doing this, or if this was just for the demo or what. Feel free to ignore if that makes sense.

I'm a little worried about adding both local storage to Nexus and an API to download files from Nexus. A lot of our scalability and HA goals are contingent on Nexus having no local persistent state outside of CockroachDB. And the file server feels like a pretty big surface area for possible attack.

davepacheco · 2021-12-10T00:20:37Z

common/src/api/external/mod.rs

    Dataset,
    Disk,
    DiskAttachment,
+    DownloadArtifact,


It looks like we're adding / have added a bunch of stuff here that's only appropriate to the internal API. I'm guessing that's because of the way we try to construct external::Errors from diesel errors using the helper function that takes a resource type. Maybe we should have a different helper that uses a different enum for internal stuff? Doesn't have to be done here.

Agreed. Filed #532 to track.

davepacheco · 2021-12-10T00:21:59Z

nexus/src/internal_api/http_entrypoints.rs

+            "Directory download not supported".to_string(),
+        ));
+    }
+    let body = nexus.download_artifact(&entry).await?;


Isn't there a TOCTOU race here, if the path has changed since we checked it?

I think the traditional approach is to use open(2) and then fstat(2). I haven't dug into it but it looks like you can do this in Rust with File::open and then File::metadata.

... this may need an update: https://github.com/oxidecomputer/dropshot/blob/main/dropshot/examples/file_server.rs#L132

So I'm looking at the file server - I think that whole example may need a re-work, in addition to this "final entry" check. We also check that the intermediate paths are not symlinks, and I believe this suffers from the same TOCTTOU issue - they could be replaced by symlinks, since we're re-checking the whole path again.

In C, I would normally use openat to resolve this issue, but I don't see this function exposed from the Rust standard library (though I suppose I could use libc directly, or one of the wrapper crates).

smklein · 2021-12-20T21:20:18Z

Sorry for the late review here -- I'm not sure what the status is, if we're still doing this, or if this was just for the demo or what. Feel free to ignore if that makes sense.

So this PR is currently in a decidedly "demo-ish" state, but this had been the plan for the long-term too.

I'm a little worried about adding both local storage to Nexus and an API to download files from Nexus. A lot of our scalability and HA goals are contingent on Nexus having no local persistent state outside of CockroachDB. And the file server feels like a pretty big surface area for possible attack.

We are overdue for an RFD here (maybe I can work with @iliana to get one posted?) but Josh / iliana / myself chatted about this (context: 2:00 minutes into https://drive.google.com/file/d/1jzCgku-ZUjpibU66COLkjQk4BNH0tOuR/view ).

Effectively:

The only Nexus-local storage would be for immutable artifacts
Nexus' storage is acting like a cache
The implementation should continue operating correctly if we lose / destroy these files arbitrarily
We can (and should) re-verify the artifacts before applying them, on the Sled

The goal was, for example, if we're applying a download of a 1GB file:

In the normal case, we should only download the file over the external network once
After this, sleds can just request the file from Nexus, which should act as a simple cache

davepacheco · 2021-12-20T22:32:21Z

Sorry for the late review here -- I'm not sure what the status is, if we're still doing this, or if this was just for the demo or what. Feel free to ignore if that makes sense.

So this PR is currently in a decidedly "demo-ish" state, but this had been the plan for the long-term too.

I'm a little worried about adding both local storage to Nexus and an API to download files from Nexus. A lot of our scalability and HA goals are contingent on Nexus having no local persistent state outside of CockroachDB. And the file server feels like a pretty big surface area for possible attack.

We are overdue for an RFD here (maybe I can work with @iliana to get one posted?) but Josh / iliana / myself chatted about this (context: 2:00 minutes into https://drive.google.com/file/d/1jzCgku-ZUjpibU66COLkjQk4BNH0tOuR/view ).

Effectively:
* The only Nexus-local storage would be for immutable artifacts

* Nexus' storage is acting like a cache

* The implementation should continue operating correctly if we lose / destroy these files arbitrarily

* We can (and should) re-verify the artifacts before applying them, on the Sled
The goal was, for example, if we're applying a download of a 1GB file:
* In the normal case, we should only download the file over the external network once

* After this, sleds can just request the file from Nexus, which should act as a simple cache

Thanks for that context. Yes, an RFD would be really helpful. Those constraints do help, but I think there remain important details to be worked out around how we manage that cache (e.g., to avoid exhausting local storage). And in general, if an operation makes two requests to Nexus, we should assume that the second request may hit a different Nexus than the first one. It's not clear to me how this approach will work in that case.

None of this addresses the security surface area of having a file server in Nexus. Maybe that's not something we're worried about? On the other hand, we've already seen that that's tricky to get right.

smklein · 2021-12-21T19:54:02Z

I'm migrating the discussion to RFD 183, which I just committed to: https://github.com/oxidecomputer/rfd/pull/272/commits/c6c7af8984fdded1381b42b6c0faa4ce7f0d48cb

smklein added 8 commits November 23, 2021 11:26

WIP Nexus download endpoint

fc70559

WIP writing files

80250e1

Merge branch 'main' into update-file-server

3ac2ddb

Merge branch 'main' into update-file-server

9357831

free function now method

67fa74c

Merge branch 'main' into update-file-server

d7728a1

sorting out errs

7900929

Merge branch 'main' into update-file-server

a3f8121

david-crespo reviewed Nov 30, 2021

View reviewed changes

smklein added 2 commits December 1, 2021 11:18

Merge branch 'main' into update-file-server

fad83b5

Add tests, fix bugs

4523cbd

smklein changed the title ~~[WIP] Serving files from Nexus -> Sled Agent for the update system~~ Serving files from Nexus -> Sled Agent for the update system Dec 1, 2021

smklein requested a review from iliana December 1, 2021 19:46

smklein marked this pull request as ready for review December 1, 2021 19:46

smklein added 2 commits December 1, 2021 14:50

I guess compiling code is better than the alternative

e82885c

I EXPECTORATEd that my JSON would be more up-to-date

dc5897a

iliana approved these changes Dec 1, 2021

View reviewed changes

Merge branch 'main' into update-file-server

e2af470

davepacheco reviewed Dec 10, 2021

View reviewed changes

smklein added 2 commits December 20, 2021 12:11

Merge branch 'main' into update-file-server

e22c736

Keep merging

9116b07

This was referenced Dec 20, 2021

API: Distinguish between internal/external resources, especially in the context of errors #532

Open

Updated file serving example for streaming oxidecomputer/dropshot#222

Merged

iliana mentioned this pull request Mar 3, 2022

[v2] TUF integration in Nexus + update artifact fetching by sled-agent #717

Merged

smklein closed this Mar 22, 2022

smklein deleted the update-file-server branch March 22, 2022 20:04

Serving files from Nexus -> Sled Agent for the update system #457

Serving files from Nexus -> Sled Agent for the update system #457

Uh oh!

Conversation

smklein commented Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein commented Dec 20, 2021

Uh oh!

davepacheco commented Dec 20, 2021

Uh oh!

smklein commented Dec 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

smklein commented Nov 30, 2021 •

edited

Loading