Skip to content

Conversation

@jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Apr 3, 2025

Recent customer issues have highlighted problems with storage accounting, namely that while there are quotas and reservations for individual Crucible regions, there's nothing set for the whole Crucible dataset. Crucible could end up using the whole disk, or some large fraction of it, such that other users of the same U2 could be starved out.

This commit adds a buffer to each zpool that the Crucible region allocation query will not allocate into. This overhead will be set to 250G initially (see #7875 for reasoning) but could also be modified with omdb.

Part of this commit's changes include using a CTE with regions_hard_delete, which is much more efficient than the previous for loop but has the effect of overwriting size_used for all datasets, which will undo any time this column value was manually set to prevent allocation for particular datasets / pools. Because of this, this commit also adds a no_provision flag for a Crucible dataset: if it is set, then the region allocation query will not allocate into that dataset. This flag can be toggled with omdb.

Part of the upgrade to R14 will include a support procedure to address if the addition of the control plane storage buffer of 250G causes a Crucible dataset to be "overprovisioned", necessitating manually requested region replacement requests to reduce the size allocated for a particular Crucible dataset. This commit adds an omdb command to show all overprovisioned crucible datasets, and changes the region listing command so it can list regions for a particular dataset.

Fixes #3480

Recent customer issues have highlighted problems with storage
accounting, namely that while there are quotas and reservations for
individual Crucible regions, there's nothing set for the whole Crucible
dataset. Crucible _could_ end up using the whole disk, or some large
fraction of it, such that other users of the same U2 could be starved
out.

This commit adds a buffer to each zpool that the Crucible region
allocation query will not allocate into. This overhead will be set to
250G initially (see oxidecomputer#7875 for reasoning) but could
also be modified with omdb.

Part of this commit's changes include using a CTE with
`regions_hard_delete`, which is much more efficient than the previous
for loop but has the effect of overwriting `size_used` for all datasets,
which will undo any time this column value was manually set to prevent
allocation for particular datasets / pools. Because of this, this commit
also adds a `no_provision` flag for a Crucible dataset: if it is set,
then the region allocation query will not allocate into that dataset.
This flag can be toggled with omdb.

Part of the upgrade to R14 will include a support procedure to address
if the addition of the control plane storage buffer of 250G causes a
Crucible dataset to be "overprovisioned", necessitating manually
requested region replacement requests to reduce the size allocated for a
particular Crucible dataset. This commit adds an omdb command to show
all overprovisioned crucible datasets, and changes the region listing
command so it can list regions for a particular dataset.

Fixes oxidecomputer#3480
@leftwo
Copy link
Contributor

leftwo commented Apr 3, 2025

Recent customer issues have highlighted problems with storage accounting, namely that while there are quotas and reservations for individual Crucible regions, there's nothing set for the whole Crucible dataset. Crucible could end up using the whole disk, or some large fraction of it, such that other users of the same U2 could be starved out.

This is using the existing zpool size numbers, and then taking 250 off of that right?
And this limitation is a policy one up in Nexus itself, not a quota on the crucible dataset?

@leftwo
Copy link
Contributor

leftwo commented Apr 3, 2025

... but has the effect of overwriting size_used for all datasets, which will undo any time this column value was manually set to prevent allocation for particular datasets / pools. Because of this, this commit also adds a no_provision flag for a Crucible dataset: if it is set, then the region allocation query will not allocate into that dataset. This flag can be toggled with omdb.

If we had manually set the size_used field as a mechanism to prevent allocations, we would have to record that before applying these changes, then come back and toggle the no_provision flag, right?

@jmpesp
Copy link
Contributor Author

jmpesp commented Apr 3, 2025

Recent customer issues have highlighted problems with storage accounting, namely that while there are quotas and reservations for individual Crucible regions, there's nothing set for the whole Crucible dataset. Crucible could end up using the whole disk, or some large fraction of it, such that other users of the same U2 could be starved out.

This is using the existing zpool size numbers, and then taking 250 off of that right? And this limitation is a policy one up in Nexus itself, not a quota on the crucible dataset?

Yes, and yes :)

@jmpesp
Copy link
Contributor Author

jmpesp commented Apr 3, 2025

... but has the effect of overwriting size_used for all datasets, which will undo any time this column value was manually set to prevent allocation for particular datasets / pools. Because of this, this commit also adds a no_provision flag for a Crucible dataset: if it is set, then the region allocation query will not allocate into that dataset. This flag can be toggled with omdb.

If we had manually set the size_used field as a mechanism to prevent allocations, we would have to record that before applying these changes, then come back and toggle the no_provision flag, right?

Correct

@leftwo
Copy link
Contributor

leftwo commented Apr 3, 2025

This will prevent crucible from using up space (which is a good thing) and I know urgency for R14 too.

This won't prevent some other service from using up all the space in the pool, even if we keep Crucible in check.
So whatever eventual solution we come up with to prevent full disks will need to include preventing the other services from taking more than their fair share of the pool, is that correct?

@smklein
Copy link
Collaborator

smklein commented Apr 3, 2025

This won't prevent some other service from using up all the space in the pool, even if we keep Crucible in check. So whatever eventual solution we come up with to prevent full disks will need to include preventing the other services from taking more than their fair share of the pool, is that correct?

This is true, but we planned to punt on quotas and reservations because assigning them can fail for deployed racks, and we need a fall-back plan. The "failure mode" introduced in this PR is "we don't overprovision further", which is a good thing, and won't need support staff to remediate.

@smklein
Copy link
Collaborator

smklein commented Apr 3, 2025

But anyway, yes, definitely do want quotas / reservations to help limit this issue from crossing abstraction boundaries

@leftwo
Copy link
Contributor

leftwo commented Apr 3, 2025

But anyway, yes, definitely do want quotas / reservations to help limit this issue from crossing abstraction boundaries

Just wanted to be sure we don't forget to do the additional work, and be sure the casual reader did not think that all the problems are now solved. Things will be better with this, and will continue getting better :)

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice job getting the non-provisionable + buffer merged together. And thanks for the tests!

@morlandi7 morlandi7 added this to the 14 milestone Apr 3, 2025
Ok(())
}

async fn cmd_crucible_dataset_show_overprovisioned(
Copy link
Collaborator

@smklein smklein Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an omdb command, so my bar for testing there is lower than it would be otherwise, but have you tested this API? (Even manually)

(I'm bringing this scrutiny because this command seems really useful, actually, and frankly like something we might want to pull into Nexus in the future)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it by filling up all the space I could, then increasing the storage_buffer:

EVT22200005 # omdb-7912 db crucible-dataset show-overprovisioned 2> /dev/null                                                                                                                                                                               
ID                                   |SIZE_USED     |NO_PROVISION |POOL_ID                              |CONTROL_PLANE_STORAGE_BUFFER |POOL_TOTAL_SIZE 
-------------------------------------+--------------+-------------+-------------------------------------+-----------------------------+----------------
12e4105b-3dde-40ff-9f12-5ce5d57f8b4f |2925946470400 |false        |02d72adc-f403-4eef-bede-3a2a860a22a3 |375809638400                 |3195455668224   

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking, but I wanted these comments posted now.

Ok(())
}

async fn cmd_crucible_dataset_show_overprovisioned(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it by filling up all the space I could, then increasing the storage_buffer:

EVT22200005 # omdb-7912 db crucible-dataset show-overprovisioned 2> /dev/null                                                                                                                                                                               
ID                                   |SIZE_USED     |NO_PROVISION |POOL_ID                              |CONTROL_PLANE_STORAGE_BUFFER |POOL_TOTAL_SIZE 
-------------------------------------+--------------+-------------+-------------------------------------+-----------------------------+----------------
12e4105b-3dde-40ff-9f12-5ce5d57f8b4f |2925946470400 |false        |02d72adc-f403-4eef-bede-3a2a860a22a3 |375809638400                 |3195455668224   

@leftwo
Copy link
Contributor

leftwo commented Apr 4, 2025

Could we have an omdb command somewhere in here that does the TOTAL_SIZE - CONTROL_PLANE_STORAGE_BUFFER math for me, so I can see how much each dataset has left?

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions and comment comments

@jmpesp
Copy link
Contributor Author

jmpesp commented Apr 7, 2025

Could we have an omdb command somewhere in here that does the TOTAL_SIZE - CONTROL_PLANE_STORAGE_BUFFER math for me, so I can see how much each dataset has left?

Done in 5de0a56

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the omdb with usage output:

EVT22200005 # omdb-7912 db crucible-dataset list 2> /dev/null
ID                                   |TIME_DELETED |POOL_ID                              |ADDRESS                        |SIZE_USED    |NO_PROVISION |CONTROL_PLANE_STORAGE_BUFFER |POOL_TOTAL_SIZE |SIZE_LEFT     
-------------------------------------+-------------+-------------------------------------+-------------------------------+-------------+-------------+-----------------------------+----------------+--------------
12e4105b-3dde-40ff-9f12-5ce5d57f8b4f |             |02d72adc-f403-4eef-bede-3a2a860a22a3 |[fd00:1122:3344:101::12]:32345 |132875550720 |false        |375809638400                 |3195455668224   |2686770479104 
1e9c36ea-9b9e-4762-b729-01124dc3d56c |             |83cb6813-d89a-4996-bae8-43609c63b1dd |[fd00:1122:3344:101::13]:32345 |71135395840  |false        |268435456000                 |3195455668224   |2855884816384 
238884a7-d934-4325-8d38-d6f5be067b3e |             |fd1796f2-f671-4d60-99dc-e1194024bebb |[fd00:1122:3344:101::18]:32345 |230854492160 |false        |268435456000                 |3195455668224   |2696165720064 
5b1defd7-249c-452b-9d4c-f3464195c659 |             |571bc087-1c0a-4378-ab37-7a42633b05ae |[fd00:1122:3344:101::16]:32345 |175825223680 |false        |268435456000                 |3195455668224   |2751194988544 
8cf5a521-a139-4596-bd3f-079c544381f1 |             |0fd4c7d7-2547-4769-a55f-54c1284f24e4 |[fd00:1122:3344:101::15]:32345 |71135395840  |false        |268435456000                 |3195455668224   |2855884816384 
9d233755-b283-4b00-9f6c-1ee6492ad4d7 |             |0c4db663-9601-45f3-b6b3-bde209e8f7d7 |[fd00:1122:3344:101::17]:32345 |147639500800 |false        |268435456000                 |3195455668224   |2779380711424 
daed3ed0-74ae-4acc-94b5-b8e4d26a97f9 |             |fb3ea440-ff27-43c4-bf37-a3f1d60ae265 |[fd00:1122:3344:101::14]:32345 |49660559360  |false        |268435456000                 |3195455668224   |2877359652864 

@jmpesp jmpesp enabled auto-merge (squash) April 8, 2025 01:05
@jmpesp jmpesp merged commit c496683 into oxidecomputer:main Apr 9, 2025
16 checks passed
@jmpesp jmpesp deleted the do_not_provision_flag_for_dataset branch April 9, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Want ability to take a crucible dataset out of provisioning pool

4 participants