Any mutation should be conditional!#9668
Conversation
Any INSERT or UPDATE done by the sled reservation CTE should be conditional on the computed `INSERT_VALID` result. Otherwise inserting the `sled_resources_vmm` record (aka successfully allocating a VMM) could fail but the local storage allocation would be successful anyway. This came up when testing starting _many_ instances, all with local storage disks, in parallel: set the number of instances too high, and even if all the local storage allocations would fit, there aren't enough CPUs to satisfy all instance resource requests. Deleting all the instances and disks showed that there were some "orphaned" local storage allocations, which were not cleaned up and could never be. This case was turned into the `local_storage_allocation_fail_due_to_vmm_resources` test.
| let datastore = db.datastore().clone(); | ||
| let opctx = | ||
| OpContext::for_tests(logctx.log.clone(), datastore.clone()); | ||
| let instance = Instance::new_with_id(config.instances[0].id); |
There was a problem hiding this comment.
| let instance = Instance::new_with_id(config.instances[0].id); | |
| let instance_id = config.instances[0].id; |
I don't think we need to construct the whole instance here; we just want the ID.
Same below within jh2
| (Err(_), Err(_)) => { | ||
| panic!("both didn't work!"); | ||
| } | ||
|
|
||
| (Ok(_), Ok(_)) => { | ||
| panic!("both worked!"); | ||
| } |
There was a problem hiding this comment.
If this fails, it's gonna be confusing to interpret this.
Maybe:
- "Allocated zero instance reservations with reduced capacity - expected at least one to work"
- "Allocated two instance reservations with reduced capacity - only one should have succeeded"
There was a problem hiding this comment.
changed in cb50102, but kept the messages short
| }; | ||
|
|
||
| for rendezvous_dataset in &rendezvous_datasets { | ||
| let expected_size: i64 = allocation_records |
There was a problem hiding this comment.
To confirm - this only considers local storage. Would this accounting be wrong in a test where we allocate both local storage and distributed disks?
There was a problem hiding this comment.
In this case it's ok this is only checking the rendezvous_local_storage_dataset table's size_used. Allocating distribute disks affects crucible_dataset table's size_used.
There was a problem hiding this comment.
Maybe the function should be renamed to validate_computed_local_storage_size_used or something? Not a huge deal.
Co-authored-by: Sean Klein <seanmarionklein@gmail.com>
Any INSERT or UPDATE done by the sled reservation CTE should be conditional on the computed
INSERT_VALIDresult. Otherwise inserting thesled_resources_vmmrecord (aka successfully allocating a VMM) could fail but the local storage allocation would be successful anyway.This came up when testing starting many instances, all with local storage disks, in parallel: set the number of instances too high, and even if all the local storage allocations would fit, there aren't enough CPUs to satisfy all instance resource requests. Deleting all the instances and disks showed that there were some "orphaned" local storage allocations, which were not cleaned up and could never be. This case was turned into the
local_storage_allocation_fail_due_to_vmm_resourcestest.