-
Notifications
You must be signed in to change notification settings - Fork 58
[1/n] create a nexus-lockstep service #8983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I focused mostly on the OmicronZoneType
(and related) changes and skimmed the rest - looks good to me. I'm always a bit nervous changing these bits - probably worth testing a before -> mupdate -> after to make sure we didn't miss anything w.r.t. serialized ledgers? (Also fine to defer that testing to one of the follow PRs that build on this, if we want to stack a few and land them all together.)
bind_address = "10.1.2.3:4568" | ||
default_request_body_max_bytes = 1024 | ||
[deployment.dropshot_debug] | ||
bind_address = "10.1.2.3:4569" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows a full socketaddr, but the DNS helper function only allows supplying a debug port. Is the expectation that the IP for this must match dropshot_internal
's? (Do we check that at runtime? Still reading the PR...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This object is a dropshot::ConfigDropshot
, so yeah unfortunately that’s the way it is. I think I check that they have the same IP at runtime but I should double-check all the calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Nexus now should fail to start if the IPs don't match.
use sled_hardware_types::{Baseboard, SledCpuFamily}; | ||
|
||
/// Identity and basic status information about this sled agent | ||
#[derive(Deserialize, Serialize, JsonSchema)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooof on this whole file. Thanks for going through this tedium.
assert_eq!(new_srv.target, new_zone_host.fqdn()); | ||
assert_eq!(changed.len(), 2); | ||
assert_eq!(changed[0].0, ServiceName::NexusDebug.dns_name()); | ||
assert_eq!(changed[1].0, ServiceName::Nexus.dns_name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the order of changed
guaranteed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the underlying map is a BTreeMap. (As I wrote this my sense for flakes was tingling as well so I checked!)
/// Dropshot configuration for lockstep API server. | ||
#[schemars(skip)] // TODO we're protected against dropshot changes | ||
pub dropshot_lockstep: ConfigDropshot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this might be a flag day between Nexus and Sled Agent (because it writes out this config file). But I guess it's okay because Sled Agent is always updated before Nexus, so it will just start supplying this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would be. I think if we're not planning on running Nexus-driven update on dogfood before this lands, we're fine, but otherwise I could make this an Option
and bubble that all the way through? Would be obnoxious but theoretically fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's basically fine as-is because we always update sled agent before Nexus and the change is backwards-compatible.
internal_address: SocketAddrV6, | ||
/// The port at which the internal lockstep server is reachable. This | ||
/// shares the same IP address with `internal_address`. | ||
#[serde(default = "default_nexus_lockstep_port")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've usually avoided this sort of thing. I guess to avoid it you'd need to define a separate struct here with a From
impl, which may not be worth it. But can we at least figure out when it would be safe to remove this (if ever?), document that, and file an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We similarly used #[serde(default)]
to add the image_source
field to OmicronZoneConfig
as a non-breaking change.
It would be nice to clean both up but I'm unsure when it's safe. Presumably after an upgrade to R17 and setting the target version, all sled-agent ledgers will gain the field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not positive. Would you mind filing an issue and raising it on the update call today? (Hopefully a brief discussion.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this file, and is it a problem that it's changed? I'm guessing it's okay because there's a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t really have any idea what this file is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the JSON schema for the on-disk, ledgered zones config from before we did the work to combine zones+datasets+disks into OmicronSledConfig
. I think this change is fine:
a) it has a default so we can still deserialize old versions missing this field
b) I think we've updated all systems past the point where we might have any of these left behind anyway?
The only use of this is in a test for the config-reconciler's "convert from legacy ledgers" stuff.
First step of #8902. It's enough work to get Nexus to stand up another HTTP service that this is worth its own PR ahead of moving APIs out of nexus-internal and into nexus-lockstep.
First step of #8902. It's enough work to get Nexus to stand up another HTTP service that this is worth its own PR ahead of moving APIs out of nexus-internal and into nexus-lockstep.