Skip to content

Conversation

@leftwo
Copy link
Contributor

@leftwo leftwo commented Sep 23, 2025

Crucible changes are:
update to latest vergen (#1770)
Update rand dependencies, and fallout from that. (#1764) [crucible-downstairs] migrate to API traits (#1768) [crucible-agent] migrate to API trait (#1766)
[crucible-pantry] migrate to API trait (#1767)
Add back job delays in the downstairs with the --lossy flag (#1761)

Propolis changes are:
Crucible update plus a few other dependency changes. (#948) [2/n] [propolis-server] switch to API trait (#946) [1/n] add a temporary indent to propolis server APIs (#945) Handle Intel CPUID leaves 4 and 18h, specialize CPUID for VM shape (#941) Increase viona receive queue length to 2048 (#935) Expand viona header pad to account for options (#937) fix linux p9fs multi message reads (#932)
add a D script to report VMs' CPUID queries (#934) Update GH actions
Re-enable viona packet data loaning

Alan Hanson added 2 commits September 23, 2025 00:30
Crucible changes are:
update to latest `vergen` (#1770)
Update rand dependencies, and fallout from that. (#1764)
[crucible-downstairs] migrate to API traits (#1768)
[crucible-agent] migrate to API trait (#1766)
[crucible-pantry] migrate to API trait (#1767)
Add back job delays in the downstairs with the --lossy flag (#1761)

Propolis changes are:
Crucible update plus a few other dependency changes. (#948)
[2/n] [propolis-server] switch to API trait (#946)
[1/n] add a temporary indent to propolis server APIs (#945)
Handle Intel CPUID leaves 4 and 18h, specialize CPUID for VM shape (#941)
Increase viona receive queue length to 2048 (#935)
Expand viona header pad to account for options (#937)
fix linux p9fs multi message reads (#932)
add a D script to report VMs' CPUID queries (#934)
Update GH actions
Re-enable viona packet data loaning
@morlandi7 morlandi7 added this to the 17 milestone Sep 25, 2025
smklein and others added 16 commits September 26, 2025 00:29
- Actually update nexus generation within the top-level blueprint and
Nexus zones
- Deploy new and old nexus zones concurrently

# Blueprint Planner
- Automatically determine nexus generation when provisioning new Nexus
zones, based on existing deployed zones
- Update the logic for provisioning nexus zones, to deploy old and new
nexus images side-by-side
- Update the logic for expunging nexus zones, to only do so when running
from a "newer" nexus
- Add a planning stage to bump the top-level "nexus generation", if
appropriate, which would trigger the old Nexuses to quiesce.

Fixes #8843,
#8854
…es being known (#8921)

Expand the set of gates for adds/updates to include the fact that zone
image sources should be known. Add tests for this:

* `cmds-mupdate-update-flow` contains the bulk of testing for this
scenario.
* I had to make tweaks to some tests, particularly to
`cmds-target-release.txt`, in order to start running the test in earnest
from the Artifact state rather than the InstallDataset state.
Planning reports are contained in `Blueprint`s, which have an ID. Prior
to this PR we duplicated the containing blueprint's ID. This bit
@davepacheco and me in a couple different (admittedly unusual) testing
contexts where we were duplicating blueprints and making changes, not
realizing we produced a new blueprint with a different ID but carrying a
report that still pointed to the original blueprint's ID.

The only thing we lose here is that the display output of the planning
report can no longer say what blueprint it's for, but I think that's
fine - all the places where we want to display a report, we already know
the blueprint ID.
First step of #8902. It's enough work to get Nexus to stand up another
HTTP service that this is worth its own PR ahead of moving APIs out of
nexus-internal and into nexus-lockstep.
Finishes the `target-release` `reconfigurator-test`, showing the
simulate update walking through the process of starting new Nexus zones,
waiting for handoff, then expunging the old Nexus zones.

Has two tweaks:

* Fixes a planning report off-by-one bug where we'd claim a zone was
both out of date and expunged (or updated) within the same plan.
* Adds a `set active-nexus-gen N` command to `reconfigurator-cli` to
control Nexus handoff instead of assuming it completes instantly.

Closes #8478

---------

Co-authored-by: Sean Klein <sean@oxidecomputer.com>
Fixes #9047

---------

Co-authored-by: Alex Plotnick <alex@oxidecomputer.com>
This PR completes the first version of the sans-io trust quorum protocol
implementation.

LRTQ upgrade can now be started via
`Node::coordinate_upgrade_from_lrtq`.
This triggers the coordinating node to start collecting the LRTQ key
shares so that they can be used to construct the LRTQ rack secret via
the bootstore code. After this occurs, a Prepare message is sent out
with this old rack secret encrypted in a manner identical to a normal
reconfiguration. The prepare and commit paths remain the same.

The cluster proptest was updated to sometimes start out with an
existing LRTQ configuration and then to upgrade from there. Like normal
reconfigurations it allows aborting and pre-empting of the LRTQ upgrade
with a new attempt at a higher epoch. In production this is how we
"retry"
if the coordinating node crashes prior to commit, or more accurately, if
nexus can't talk to the coordinating node for some period of time and
just moves on. After the LRTQ upgrade commits, normal reconfigurations
are run.

We also remove unnecessary config related messages in this commit.
Since a `Configuration` does not contain sensitive information it can be
retrieved when Nexus polls the coordinator before it commits. Then Nexus
can save this info and send it in `PrepareAndCommit` messages rather
than having the receiving node try to find a live peer with the config
prior to collecting shares. This is a nice optimization that reduces
protocol complexity a bit. This removal allowed removing the TODO in the
message `match` statement in `Node::handle` and completing the protocol.
…tion is 1 (#9066)

For customers that are going to continue relying on MUPdate, the planner
should act the same way as it did before self-service update existed. We
ascertain this by looking at whether a target release has ever been set.

Most of the tests no longer require the
`add_zones_with_mupdate_override` config, so add a new
reconfigurator-cli script which specifically tests that config.
Fixes #8912

Should be merged after the rest of Nexus quiesce/handoff is complete.
(This also includes #9077 to avoid failing CI.)

---------

Co-authored-by: iliana etaoin <iliana@oxide.computer>
…t even if no target release is set (#9082)

Currently:

* if a target release is set, we go ahead and clear the
remove-mupdate-override instruction from blueprints, regardless of
whether artifacts match
* if no target release is set, we don't do that

This behavior is inconsistent. We shouldn't gate the mupdate override
part of the state machine on a target release not being set.
See #9071 for context; this is the short/medium-term fix proposed in
that issue.
@leftwo
Copy link
Contributor Author

leftwo commented Sep 26, 2025

Oh what the heck.. this was suppose to be a merge with main, and now it's pandemonium...

@leftwo
Copy link
Contributor Author

leftwo commented Sep 26, 2025

Okay, there we go, now the diffs are just what I wanted.

@leftwo leftwo merged commit d71f5f9 into main Sep 26, 2025
18 checks passed
@leftwo leftwo deleted the alan/crucible-update-propolis branch September 26, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants