Prune potential sled reservations that cannot work#10353
Merged
jmpesp merged 4 commits intooxidecomputer:mainfrom May 1, 2026
Merged
Prune potential sled reservations that cannot work#10353jmpesp merged 4 commits intooxidecomputer:mainfrom
jmpesp merged 4 commits intooxidecomputer:mainfrom
Conversation
Allocating local storage for disks currently occurs during sled reservation: the decision of placing a VMM on a physical sled has to take into account the available storage for disks backed by local storage. This is done by taking a snapshot of information related to those zpools, figuring out an assignment of requested local storage to those zpools (according to some constraints, like available space), and attempting the sled reservation insert query, which will check that all constraints are met before inserting both the VMM record and the local storage allocations. For sled reservation where local storage allocations are required, an iterator is created for each sled target that will result in Nexus trying all possible mappings of requested local storage to zpools available for that sled target. Once all possible combinations have been tried, Nexus will try the next potential sled target, or return that the sled reservation cannot succeed. In the uncontended case, the first attempted sled reservation insert will succeed. In the contended case however, another sled reservation could have succeeded, allocating local storage and invalidating the snapshot taken that the iterator is using to return all potential allocation mappings. Prior to this commit, if no possible allocation mapping will work (such as if all zpools no longer have any space), Nexus will end up trying them all anyway: the iterator was working with out-of-date information. This resulted in `instance-start` sagas that looked stuck, but in reality were just taking a long time exploring all possible permutations. This commit adds a pruning step: each iteration, Nexus will request a new snapshot of information related to a sled's zpools, and use that to prune the search space of all possible combinations. A test was also added (`local_storage_allocation_full_rack_concurrent`) that successfully reproduced the "stuck" scenario.
hawkw
reviewed
May 1, 2026
Member
hawkw
left a comment
There was a problem hiding this comment.
some nitpicks, but otherwise, nice fix!
Comment on lines
+1292
to
+1300
| let allocations = match complete_allocation_lists.next() { | ||
| Some(allocations) => allocations, | ||
|
|
||
| None => { | ||
| // All done searching, nothing worked. Try another | ||
| // sled! | ||
| break; | ||
| } | ||
| }; |
Member
There was a problem hiding this comment.
Suggested change
| let allocations = match complete_allocation_lists.next() { | |
| Some(allocations) => allocations, | |
| None => { | |
| // All done searching, nothing worked. Try another | |
| // sled! | |
| break; | |
| } | |
| }; | |
| let Some(allocations) = complete_allocation_lists.next() else { | |
| // All done searching, nothing worked. Try another | |
| // sled! | |
| break; | |
| }; |
| logctx.cleanup_successful(); | ||
| } | ||
|
|
||
| /// Ensure that a full rack can have one VMM take all the U2s on each sled, |
Member
There was a problem hiding this comment.
Suggested change
| /// Ensure that a full rack can have one VMM take all the U2s on each sled, | |
| /// Ensure that a full rack can have one VMM take all the U.2s on each sled, |
|
|
||
| impl ZpoolGetForSledReservationResult { | ||
| /// Does this Zpool have room for additional bytes to be allocated to it? | ||
| pub fn has_room_for_allocation(&self, additional_size: i64) -> bool { |
Member
There was a problem hiding this comment.
factoring this out is nice, good call!
| // An incomplete allocation list has a set of local storage | ||
| // allocations that were matched to zpools with available space: | ||
| // | ||
| // | A -> Z | A -> Z | A -> Z | |
Member
There was a problem hiding this comment.
this could maybe explain what A -> Z means...using context clauses i have inferred that it's "allocation to zpool" and not saying "alphabetical order" which is a little nonsensical
smklein
reviewed
May 1, 2026
hawkw
approved these changes
May 1, 2026
iliana
pushed a commit
that referenced
this pull request
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Allocating local storage for disks currently occurs during sled reservation: the decision of placing a VMM on a physical sled has to take into account the available storage for disks backed by local storage.
This is done by taking a snapshot of information related to those zpools, figuring out an assignment of requested local storage to those zpools (according to some constraints, like available space), and attempting the sled reservation insert query, which will check that all constraints are met before inserting both the VMM record and the local storage allocations.
For sled reservation where local storage allocations are required, an iterator is created for each sled target that will result in Nexus trying all possible mappings of requested local storage to zpools available for that sled target. Once all possible combinations have been tried, Nexus will try the next potential sled target, or return that the sled reservation cannot succeed.
In the uncontended case, the first attempted sled reservation insert will succeed. In the contended case however, another sled reservation could have succeeded, allocating local storage and invalidating the snapshot taken that the iterator is using to return all potential allocation mappings. Prior to this commit, if no possible allocation mapping will work (such as if all zpools no longer have any space), Nexus will end up trying them all anyway: the iterator was working with out-of-date information. This resulted in
instance-startsagas that looked stuck, but in reality were just taking a long time exploring all possible permutations.This commit adds a pruning step: each iteration, Nexus will request a new snapshot of information related to a sled's zpools, and use that to prune the search space of all possible combinations. A test was also added (
local_storage_allocation_full_rack_concurrent) that successfully reproduced the "stuck" scenario.