Prune potential sled reservations that cannot work by jmpesp · Pull Request #10353 · oxidecomputer/omicron

jmpesp · 2026-05-01T00:47:53Z

Allocating local storage for disks currently occurs during sled reservation: the decision of placing a VMM on a physical sled has to take into account the available storage for disks backed by local storage.

This is done by taking a snapshot of information related to those zpools, figuring out an assignment of requested local storage to those zpools (according to some constraints, like available space), and attempting the sled reservation insert query, which will check that all constraints are met before inserting both the VMM record and the local storage allocations.

For sled reservation where local storage allocations are required, an iterator is created for each sled target that will result in Nexus trying all possible mappings of requested local storage to zpools available for that sled target. Once all possible combinations have been tried, Nexus will try the next potential sled target, or return that the sled reservation cannot succeed.

In the uncontended case, the first attempted sled reservation insert will succeed. In the contended case however, another sled reservation could have succeeded, allocating local storage and invalidating the snapshot taken that the iterator is using to return all potential allocation mappings. Prior to this commit, if no possible allocation mapping will work (such as if all zpools no longer have any space), Nexus will end up trying them all anyway: the iterator was working with out-of-date information. This resulted in instance-start sagas that looked stuck, but in reality were just taking a long time exploring all possible permutations.

This commit adds a pruning step: each iteration, Nexus will request a new snapshot of information related to a sled's zpools, and use that to prune the search space of all possible combinations. A test was also added (local_storage_allocation_full_rack_concurrent) that successfully reproduced the "stuck" scenario.

Allocating local storage for disks currently occurs during sled reservation: the decision of placing a VMM on a physical sled has to take into account the available storage for disks backed by local storage. This is done by taking a snapshot of information related to those zpools, figuring out an assignment of requested local storage to those zpools (according to some constraints, like available space), and attempting the sled reservation insert query, which will check that all constraints are met before inserting both the VMM record and the local storage allocations. For sled reservation where local storage allocations are required, an iterator is created for each sled target that will result in Nexus trying all possible mappings of requested local storage to zpools available for that sled target. Once all possible combinations have been tried, Nexus will try the next potential sled target, or return that the sled reservation cannot succeed. In the uncontended case, the first attempted sled reservation insert will succeed. In the contended case however, another sled reservation could have succeeded, allocating local storage and invalidating the snapshot taken that the iterator is using to return all potential allocation mappings. Prior to this commit, if no possible allocation mapping will work (such as if all zpools no longer have any space), Nexus will end up trying them all anyway: the iterator was working with out-of-date information. This resulted in `instance-start` sagas that looked stuck, but in reality were just taking a long time exploring all possible permutations. This commit adds a pruning step: each iteration, Nexus will request a new snapshot of information related to a sled's zpools, and use that to prune the search space of all possible combinations. A test was also added (`local_storage_allocation_full_rack_concurrent`) that successfully reproduced the "stuck" scenario.

hawkw

some nitpicks, but otherwise, nice fix!

hawkw · 2026-05-01T01:04:57Z

+                    let allocations = match complete_allocation_lists.next() {
+                        Some(allocations) => allocations,
+
+                        None => {
+                            // All done searching, nothing worked. Try another
+                            // sled!
+                            break;
+                        }
+                    };


Suggested change

let allocations = match complete_allocation_lists.next() {

Some(allocations) => allocations,

None => {

// All done searching, nothing worked. Try another

// sled!

break;

}

};

let Some(allocations) = complete_allocation_lists.next() else {

// All done searching, nothing worked. Try another

// sled!

break;

};

hawkw · 2026-05-01T01:05:20Z

        logctx.cleanup_successful();
    }

+    /// Ensure that a full rack can have one VMM take all the U2s on each sled,


Suggested change

/// Ensure that a full rack can have one VMM take all the U2s on each sled,

/// Ensure that a full rack can have one VMM take all the U.2s on each sled,

hawkw · 2026-05-01T01:06:02Z

+
+impl ZpoolGetForSledReservationResult {
+    /// Does this Zpool have room for additional bytes to be allocated to it?
+    pub fn has_room_for_allocation(&self, additional_size: i64) -> bool {


factoring this out is nice, good call!

hawkw · 2026-05-01T01:08:19Z

+            // An incomplete allocation list has a set of local storage
+            // allocations that were matched to zpools with available space:
+            //
+            // | A -> Z | A -> Z | A -> Z |


this could maybe explain what A -> Z means...using context clauses i have inferred that it's "allocation to zpool" and not saying "alphabetical order" which is a little nonsensical

changed in 90f44ae, lmk

smklein

Thanks for the fixes!

Original-Commit: 2647983

jmpesp requested a review from smklein May 1, 2026 00:49

hawkw reviewed May 1, 2026

View reviewed changes

smklein reviewed May 1, 2026

View reviewed changes

Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated

Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated

Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated

Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated

askfongjojo added this to the 19 milestone May 1, 2026

jmpesp added 3 commits May 1, 2026 13:24

various review comments

90f44ae

another review comment

153e019

use iddqd::IdOrdMap

ae3f4a9

hawkw approved these changes May 1, 2026

View reviewed changes

smklein approved these changes May 1, 2026

View reviewed changes

jmpesp enabled auto-merge (squash) May 1, 2026 16:36

jmpesp merged commit 2647983 into oxidecomputer:main May 1, 2026
16 checks passed

iliana pushed a commit that referenced this pull request May 5, 2026

rel/v19: Prune potential sled reservations that cannot work (#10353)

4fa10e9

Original-Commit: 2647983

leftwo mentioned this pull request May 6, 2026

Too many local storage requests ruin the instance placement solver #10393

Open

jmpesp mentioned this pull request May 7, 2026

Return sentinels from sled_insert_resource_query #10399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune potential sled reservations that cannot work#10353

Prune potential sled reservations that cannot work#10353
jmpesp merged 4 commits intooxidecomputer:mainfrom
jmpesp:sled_reservation_stuck

jmpesp commented May 1, 2026

Uh oh!

hawkw left a comment

Uh oh!

Uh oh!

hawkw May 1, 2026

Uh oh!

jmpesp May 1, 2026

Uh oh!

hawkw May 1, 2026

Uh oh!

jmpesp May 1, 2026

Uh oh!

hawkw May 1, 2026

Uh oh!

hawkw May 1, 2026

Uh oh!

jmpesp May 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smklein left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	/// Ensure that a full rack can have one VMM take all the U2s on each sled,
	/// Ensure that a full rack can have one VMM take all the U.2s on each sled,

Conversation

jmpesp commented May 1, 2026

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hawkw May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jmpesp May 1, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jmpesp May 1, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw May 1, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jmpesp May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants