Skip to content

[RSS] Move switch_port table population to a Nexus bg task#10370

Merged
jgallagher merged 4 commits into
mainfrom
john/populate-switch-ports-rpw
May 14, 2026
Merged

[RSS] Move switch_port table population to a Nexus bg task#10370
jgallagher merged 4 commits into
mainfrom
john/populate-switch-ports-rpw

Conversation

@jgallagher
Copy link
Copy Markdown
Contributor

Please start reviewing with the module-level comments in nexus/src/app/background/tasks/populate_switch_ports.rs; they attempt to explain the history and motivation of this change. It's probably the most important thing to review - if any of that reasoning is wrong, the change I'm making in this PR probably is too!

This is (preemptively) removing one of the internal-to-sled-agent uses of the early_networking module, in the hopes that we can eventually remove it entirely. Also closes #3069.

@jgallagher jgallagher added the networking Related to the networking. label May 4, 2026
@internet-diglett
Copy link
Copy Markdown
Contributor

Taking a look at this now

Copy link
Copy Markdown
Contributor

@internet-diglett internet-diglett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread nexus/src/app/background/tasks/populate_switch_ports.rs Outdated
/// fresh populator and drives it directly so we can observe the
/// only-switch0-then-both transition.
#[nexus_test(server = crate::Server, extra_sled_agents = 1)]
async fn test_switch1_comes_up_late(cptestctx: &ControlPlaneTestContext) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test!

Comment thread nexus/src/app/rack.rs
@@ -297,69 +296,6 @@ impl super::Nexus {
Some(IpNet::from(rack_network_config.rack_subnet).into());
self.datastore().update_rack_subnet(opctx, &rack).await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we might be also closing #3602 with this change? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's mostly true, yeah. We did get rid of ExternalPortDiscovery::Static. I didn't want to add a delay to every test to wait for us to discover the ports, though, so we still inject qsfp0 for both switches on test startup:

omicron/nexus/src/lib.rs

Lines 398 to 418 in 5c70790

// In the real rack init handoff, the `rack_network_config` will contain
// at least one configured port, and handoff will only complete
// successfully if the `populate_switch_ports` background task has
// completed successfully to populate the corresponding qsfp* values in
// the `switch_port` table. We could block here until that task
// activates successfully, contacting whatever `dpd` instance has been
// stood up for this test, but in the interest of streamlining this
// handoff (which happens for _every_ Nexus test, including those
// completely uninterested in switch port interaction), we'll insert a
// single qsfp0 for each switch to the db directly.
for which_switch in SwitchSlot::iter() {
datastore
.switch_port_create(
&opctx,
config.deployment.rack_id,
which_switch,
nexus_db_model::Name("qsfp0".parse().unwrap()),
)
.await
.expect("inserted qsfp0");
}

Do you think it's fine to close the issue despite that? Or should we change the test startup path here to wait until the bg task has talked to dendrite the way real RSS does (or at least see how badly that slows down the tests)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to close it since we've taken it out of the production code path

@jgallagher
Copy link
Copy Markdown
Contributor Author

Made one minor change to not stop on first error in 77605f3

@jgallagher jgallagher merged commit 99b2319 into main May 14, 2026
18 checks passed
@jgallagher jgallagher deleted the john/populate-switch-ports-rpw branch May 14, 2026 17:49
sunshowers added a commit that referenced this pull request May 15, 2026
Working on extending the ls-apis analysis to treat RSS separately from
Sled Agent
(#10318 (comment)),
we realized that certain assumptions that hold for RSS (namely that the
rack has a consistent version) are not upheld for early networking.

Split early networking into its own crate, and make the main
omicron-sled-agent crate depend on it. (Due to #10370, RSS no longer
depends on the new early networking crate.) This means that after the
work to treat RSS separately, any clients depended on by the new
sled-agent-early-networking crate are attributed only to Sled Agent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

networking Related to the networking.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discover switch ports via Dendrite instead of hard coding

2 participants