-
Notifications
You must be signed in to change notification settings - Fork 62
Serving files from Nexus -> Sled Agent for the update system #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let path = vec![path.into_inner().path]; | ||
|
|
||
| let handler = async { | ||
| for component in &path { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can find a common logic maybe this can be shared with find_file in console_api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing something a little quirky here - basically, lazily fetching files that might not (yet) exist locally. Dunno if that' easy to dedup or not, but I'd be interested in finding commonalities between the two paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see. We bail if we get to a path segment that doesn't exist, while you can keep going through and building up the path regardless. Hard to see how to share that logic in a way that doesn't make both use cases harder to understand.
| // TODO: De-duplicate this struct with the one in iliana's PR? | ||
| // | ||
| // This should likely be a wrapper around that type. | ||
| #[derive(Clone, Debug, Deserialize, JsonSchema)] | ||
| pub struct UpdateArtifact { | ||
| pub name: String, | ||
| pub version: i64, | ||
| pub kind: UpdateArtifactKind, | ||
| } | ||
|
|
||
| // TODO: De-dup me too. | ||
| #[derive(Clone, Debug, Deserialize, JsonSchema)] | ||
| pub enum UpdateArtifactKind { | ||
| Zone, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means moving these out into common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehhh it they're present in an endpoint, I can use them from the autogenerated crate.
(if not.. yeah, common seems like the right spot).
| // We download the file to a location named "<artifact-name>-<version>". | ||
| // We then rename it to "<artifact-name>" after it has successfully | ||
| // downloaded, to signify that it is ready for usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this strikes me as odd -- should future parts of sled-agent be aware of what artifact version it's trying to apply? i feel like there's a possible race here where nexus tells the sled agent to apply version N+1 of an artifact in the middle of it applying version N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end of the day, the sled agent needs to set "something" as active, right? Whether that's "flash the firmware", "write the OS partition", or "update a file".
You have a good point - we should probably add concurrency control around the updates to make them safe amid concurrent requests - but that's kinda what this renaming silliness is all about:
- Download to a "unique" spot. I dunno if the version aspect really matters here, I just wanna make sure stream all necessary data to the sled before using it.
- "Apply" the update atomically
davepacheco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review here -- I'm not sure what the status is, if we're still doing this, or if this was just for the demo or what. Feel free to ignore if that makes sense.
I'm a little worried about adding both local storage to Nexus and an API to download files from Nexus. A lot of our scalability and HA goals are contingent on Nexus having no local persistent state outside of CockroachDB. And the file server feels like a pretty big surface area for possible attack.
| Dataset, | ||
| Disk, | ||
| DiskAttachment, | ||
| DownloadArtifact, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're adding / have added a bunch of stuff here that's only appropriate to the internal API. I'm guessing that's because of the way we try to construct external::Errors from diesel errors using the helper function that takes a resource type. Maybe we should have a different helper that uses a different enum for internal stuff? Doesn't have to be done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Filed #532 to track.
| "Directory download not supported".to_string(), | ||
| )); | ||
| } | ||
| let body = nexus.download_artifact(&entry).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a TOCTOU race here, if the path has changed since we checked it?
I think the traditional approach is to use open(2) and then fstat(2). I haven't dug into it but it looks like you can do this in Rust with File::open and then File::metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... this may need an update: https://github.com/oxidecomputer/dropshot/blob/main/dropshot/examples/file_server.rs#L132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm looking at the file server - I think that whole example may need a re-work, in addition to this "final entry" check. We also check that the intermediate paths are not symlinks, and I believe this suffers from the same TOCTTOU issue - they could be replaced by symlinks, since we're re-checking the whole path again.
In C, I would normally use openat to resolve this issue, but I don't see this function exposed from the Rust standard library (though I suppose I could use libc directly, or one of the wrapper crates).
So this PR is currently in a decidedly "demo-ish" state, but this had been the plan for the long-term too.
We are overdue for an RFD here (maybe I can work with @iliana to get one posted?) but Josh / iliana / myself chatted about this (context: 2:00 minutes into https://drive.google.com/file/d/1jzCgku-ZUjpibU66COLkjQk4BNH0tOuR/view ). Effectively:
The goal was, for example, if we're applying a download of a 1GB file:
|
Thanks for that context. Yes, an RFD would be really helpful. Those constraints do help, but I think there remain important details to be worked out around how we manage that cache (e.g., to avoid exhausting local storage). And in general, if an operation makes two requests to Nexus, we should assume that the second request may hit a different Nexus than the first one. It's not clear to me how this approach will work in that case. None of this addresses the security surface area of having a file server in Nexus. Maybe that's not something we're worried about? On the other hand, we've already seen that that's tricky to get right. |
|
I'm migrating the discussion to RFD 183, which I just committed to: https://github.com/oxidecomputer/rfd/pull/272/commits/c6c7af8984fdded1381b42b6c0faa4ce7f0d48cb |
This PR adds two endpoints:
POSTendpoint has been added with the/updatepath. This can be used to instruct Sled Agent to "please download and apply an artifact".GETendpoint has been added with the/artifacts/{path}path. This can be used (by the Sled Agent) to grab an artifact that has been cached in Nexus.By themselves, these endpoints don't do much, but they should hook into changes worked on by @iliana .
POST-ing to their/updateendpoints.GET-ing back to Nexus), we can lazily fetch whatever update artifacts we need to a cache in Nexus.