Skip to content

Conversation

iliana
Copy link
Contributor

@iliana iliana commented Sep 3, 2025

First step of #8902. It's enough work to get Nexus to stand up another HTTP service that this is worth its own PR ahead of moving APIs out of nexus-internal and into nexus-lockstep.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I focused mostly on the OmicronZoneType (and related) changes and skimmed the rest - looks good to me. I'm always a bit nervous changing these bits - probably worth testing a before -> mupdate -> after to make sure we didn't miss anything w.r.t. serialized ledgers? (Also fine to defer that testing to one of the follow PRs that build on this, if we want to stack a few and land them all together.)

bind_address = "10.1.2.3:4568"
default_request_body_max_bytes = 1024
[deployment.dropshot_debug]
bind_address = "10.1.2.3:4569"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows a full socketaddr, but the DNS helper function only allows supplying a debug port. Is the expectation that the IP for this must match dropshot_internal's? (Do we check that at runtime? Still reading the PR...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This object is a dropshot::ConfigDropshot, so yeah unfortunately that’s the way it is. I think I check that they have the same IP at runtime but I should double-check all the calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Nexus now should fail to start if the IPs don't match.

use sled_hardware_types::{Baseboard, SledCpuFamily};

/// Identity and basic status information about this sled agent
#[derive(Deserialize, Serialize, JsonSchema)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooof on this whole file. Thanks for going through this tedium.

assert_eq!(new_srv.target, new_zone_host.fqdn());
assert_eq!(changed.len(), 2);
assert_eq!(changed[0].0, ServiceName::NexusDebug.dns_name());
assert_eq!(changed[1].0, ServiceName::Nexus.dns_name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the order of changed guaranteed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the underlying map is a BTreeMap. (As I wrote this my sense for flakes was tingling as well so I checked!)

@iliana iliana changed the title [1/n] create a nexus-debug service [1/n] create a nexus-lockstep service Sep 16, 2025
Comment on lines +177 to +179
/// Dropshot configuration for lockstep API server.
#[schemars(skip)] // TODO we're protected against dropshot changes
pub dropshot_lockstep: ConfigDropshot,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this might be a flag day between Nexus and Sled Agent (because it writes out this config file). But I guess it's okay because Sled Agent is always updated before Nexus, so it will just start supplying this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be. I think if we're not planning on running Nexus-driven update on dogfood before this lands, we're fine, but otherwise I could make this an Option and bubble that all the way through? Would be obnoxious but theoretically fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's basically fine as-is because we always update sled agent before Nexus and the change is backwards-compatible.

internal_address: SocketAddrV6,
/// The port at which the internal lockstep server is reachable. This
/// shares the same IP address with `internal_address`.
#[serde(default = "default_nexus_lockstep_port")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've usually avoided this sort of thing. I guess to avoid it you'd need to define a separate struct here with a From impl, which may not be worth it. But can we at least figure out when it would be safe to remove this (if ever?), document that, and file an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We similarly used #[serde(default)] to add the image_source field to OmicronZoneConfig as a non-breaking change.

It would be nice to clean both up but I'm unsure when it's safe. Presumably after an upgrade to R17 and setting the target version, all sled-agent ledgers will gain the field?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not positive. Would you mind filing an issue and raising it on the update call today? (Hopefully a brief discussion.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this file, and is it a problem that it's changed? I'm guessing it's okay because there's a default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t really have any idea what this file is.

Copy link
Contributor

@jgallagher jgallagher Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the JSON schema for the on-disk, ledgered zones config from before we did the work to combine zones+datasets+disks into OmicronSledConfig. I think this change is fine:

a) it has a default so we can still deserialize old versions missing this field
b) I think we've updated all systems past the point where we might have any of these left behind anyway?

The only use of this is in a test for the config-reconciler's "convert from legacy ledgers" stuff.

@iliana iliana merged commit e76962a into main Sep 23, 2025
19 checks passed
@iliana iliana deleted the iliana/nexus-internal-debug branch September 23, 2025 22:56
iliana added a commit that referenced this pull request Sep 24, 2025
leftwo pushed a commit that referenced this pull request Sep 26, 2025
First step of #8902. It's enough work to get Nexus to stand up another
HTTP service that this is worth its own PR ahead of moving APIs out of
nexus-internal and into nexus-lockstep.
leftwo pushed a commit that referenced this pull request Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants