Skip to content

Any mutation should be conditional!#9668

Merged
jmpesp merged 6 commits into
oxidecomputer:mainfrom
jmpesp:fix_sled_reservatation_cte
Jan 17, 2026
Merged

Any mutation should be conditional!#9668
jmpesp merged 6 commits into
oxidecomputer:mainfrom
jmpesp:fix_sled_reservatation_cte

Conversation

@jmpesp
Copy link
Copy Markdown
Contributor

@jmpesp jmpesp commented Jan 16, 2026

Any INSERT or UPDATE done by the sled reservation CTE should be conditional on the computed INSERT_VALID result. Otherwise inserting the sled_resources_vmm record (aka successfully allocating a VMM) could fail but the local storage allocation would be successful anyway.

This came up when testing starting many instances, all with local storage disks, in parallel: set the number of instances too high, and even if all the local storage allocations would fit, there aren't enough CPUs to satisfy all instance resource requests. Deleting all the instances and disks showed that there were some "orphaned" local storage allocations, which were not cleaned up and could never be. This case was turned into the local_storage_allocation_fail_due_to_vmm_resources test.

Any INSERT or UPDATE done by the sled reservation CTE should be
conditional on the computed `INSERT_VALID` result. Otherwise inserting
the `sled_resources_vmm` record (aka successfully allocating a VMM)
could fail but the local storage allocation would be successful anyway.

This came up when testing starting _many_ instances, all with local
storage disks, in parallel: set the number of instances too high, and
even if all the local storage allocations would fit, there aren't enough
CPUs to satisfy all instance resource requests. Deleting all the
instances and disks showed that there were some "orphaned" local storage
allocations, which were not cleaned up and could never be. This case was
turned into the `local_storage_allocation_fail_due_to_vmm_resources`
test.
@jmpesp jmpesp requested a review from smklein January 16, 2026 16:29
@jmpesp jmpesp added this to the 18 milestone Jan 16, 2026
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
let datastore = db.datastore().clone();
let opctx =
OpContext::for_tests(logctx.log.clone(), datastore.clone());
let instance = Instance::new_with_id(config.instances[0].id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let instance = Instance::new_with_id(config.instances[0].id);
let instance_id = config.instances[0].id;

I don't think we need to construct the whole instance here; we just want the ID.

Same below within jh2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, done in e75afc5

Comment on lines +6276 to +6282
(Err(_), Err(_)) => {
panic!("both didn't work!");
}

(Ok(_), Ok(_)) => {
panic!("both worked!");
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, it's gonna be confusing to interpret this.

Maybe:

  • "Allocated zero instance reservations with reduced capacity - expected at least one to work"
  • "Allocated two instance reservations with reduced capacity - only one should have succeeded"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in cb50102, but kept the messages short

};

for rendezvous_dataset in &rendezvous_datasets {
let expected_size: i64 = allocation_records
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm - this only considers local storage. Would this accounting be wrong in a test where we allocate both local storage and distributed disks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it's ok this is only checking the rendezvous_local_storage_dataset table's size_used. Allocating distribute disks affects crucible_dataset table's size_used.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the function should be renamed to validate_computed_local_storage_size_used or something? Not a huge deal.

@jmpesp jmpesp enabled auto-merge (squash) January 16, 2026 18:55
@jmpesp jmpesp merged commit c6e516c into oxidecomputer:main Jan 17, 2026
16 checks passed
@jmpesp jmpesp deleted the fix_sled_reservatation_cte branch January 19, 2026 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants