Skip to content

Prune potential sled reservations that cannot work#10353

Merged
jmpesp merged 4 commits intooxidecomputer:mainfrom
jmpesp:sled_reservation_stuck
May 1, 2026
Merged

Prune potential sled reservations that cannot work#10353
jmpesp merged 4 commits intooxidecomputer:mainfrom
jmpesp:sled_reservation_stuck

Conversation

@jmpesp
Copy link
Copy Markdown
Contributor

@jmpesp jmpesp commented May 1, 2026

Allocating local storage for disks currently occurs during sled reservation: the decision of placing a VMM on a physical sled has to take into account the available storage for disks backed by local storage.

This is done by taking a snapshot of information related to those zpools, figuring out an assignment of requested local storage to those zpools (according to some constraints, like available space), and attempting the sled reservation insert query, which will check that all constraints are met before inserting both the VMM record and the local storage allocations.

For sled reservation where local storage allocations are required, an iterator is created for each sled target that will result in Nexus trying all possible mappings of requested local storage to zpools available for that sled target. Once all possible combinations have been tried, Nexus will try the next potential sled target, or return that the sled reservation cannot succeed.

In the uncontended case, the first attempted sled reservation insert will succeed. In the contended case however, another sled reservation could have succeeded, allocating local storage and invalidating the snapshot taken that the iterator is using to return all potential allocation mappings. Prior to this commit, if no possible allocation mapping will work (such as if all zpools no longer have any space), Nexus will end up trying them all anyway: the iterator was working with out-of-date information. This resulted in instance-start sagas that looked stuck, but in reality were just taking a long time exploring all possible permutations.

This commit adds a pruning step: each iteration, Nexus will request a new snapshot of information related to a sled's zpools, and use that to prune the search space of all possible combinations. A test was also added (local_storage_allocation_full_rack_concurrent) that successfully reproduced the "stuck" scenario.

Allocating local storage for disks currently occurs during sled
reservation: the decision of placing a VMM on a physical sled has to
take into account the available storage for disks backed by local
storage.

This is done by taking a snapshot of information related to those
zpools, figuring out an assignment of requested local storage to those
zpools (according to some constraints, like available space), and
attempting the sled reservation insert query, which will check that all
constraints are met before inserting both the VMM record and the local
storage allocations.

For sled reservation where local storage allocations are required, an
iterator is created for each sled target that will result in Nexus
trying all possible mappings of requested local storage to zpools
available for that sled target. Once all possible combinations have been
tried, Nexus will try the next potential sled target, or return that the
sled reservation cannot succeed.

In the uncontended case, the first attempted sled reservation insert
will succeed. In the contended case however, another sled reservation
could have succeeded, allocating local storage and invalidating the
snapshot taken that the iterator is using to return all potential
allocation mappings. Prior to this commit, if no possible allocation
mapping will work (such as if all zpools no longer have any space),
Nexus will end up trying them all anyway: the iterator was working with
out-of-date information. This resulted in `instance-start` sagas that
looked stuck, but in reality were just taking a long time exploring all
possible permutations.

This commit adds a pruning step: each iteration, Nexus will request a
new snapshot of information related to a sled's zpools, and use that to
prune the search space of all possible combinations. A test was also
added (`local_storage_allocation_full_rack_concurrent`) that
successfully reproduced the "stuck" scenario.
@jmpesp jmpesp requested a review from smklein May 1, 2026 00:49
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nitpicks, but otherwise, nice fix!

Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment on lines +1292 to +1300
let allocations = match complete_allocation_lists.next() {
Some(allocations) => allocations,

None => {
// All done searching, nothing worked. Try another
// sled!
break;
}
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let allocations = match complete_allocation_lists.next() {
Some(allocations) => allocations,
None => {
// All done searching, nothing worked. Try another
// sled!
break;
}
};
let Some(allocations) = complete_allocation_lists.next() else {
// All done searching, nothing worked. Try another
// sled!
break;
};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logctx.cleanup_successful();
}

/// Ensure that a full rack can have one VMM take all the U2s on each sled,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Ensure that a full rack can have one VMM take all the U2s on each sled,
/// Ensure that a full rack can have one VMM take all the U.2s on each sled,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


impl ZpoolGetForSledReservationResult {
/// Does this Zpool have room for additional bytes to be allocated to it?
pub fn has_room_for_allocation(&self, additional_size: i64) -> bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factoring this out is nice, good call!

// An incomplete allocation list has a set of local storage
// allocations that were matched to zpools with available space:
//
// | A -> Z | A -> Z | A -> Z |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could maybe explain what A -> Z means...using context clauses i have inferred that it's "allocation to zpool" and not saying "alphabetical order" which is a little nonsensical

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in 90f44ae, lmk

Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
@askfongjojo askfongjojo added this to the 19 milestone May 1, 2026
Copy link
Copy Markdown
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

@jmpesp jmpesp enabled auto-merge (squash) May 1, 2026 16:36
@jmpesp jmpesp merged commit 2647983 into oxidecomputer:main May 1, 2026
16 checks passed
iliana pushed a commit that referenced this pull request May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants