Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

region allocation can consistently fail when there is one non-functional crucible #3416

Closed
askfongjojo opened this issue Jun 24, 2023 · 1 comment · Fixed by #3650
Closed
Assignees
Labels
known issue To include in customer documentation and training
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Jun 24, 2023

Splitting this issue out from #2483 which is about taking a non-functional sled out of the provisioning pool since there are other different failure modes that can result in a non-functional crucible, as seen in oxidecomputer/crucible#818.

The current logic is picking the 3 least used datasets. The bad dataset is going to be returned more often (or in the issue mentioned above, always, even though there are 50+ other datasets that have 0 bytes used). We'll need a way to at least randomize the query results a bit more before the control plane can perform crucible-level health checks.

I am marking this for FCS because we have run into this issue several times already since rack2 became operational.

@askfongjojo askfongjojo added this to the FCS milestone Jun 24, 2023
@morlandi7 morlandi7 added the known issue To include in customer documentation and training label Jun 27, 2023
@morlandi7
Copy link

Known issue: even with randomization you may hit the bad one and should retry

@faithanalog faithanalog self-assigned this Jun 29, 2023
faithanalog added a commit that referenced this issue Jul 16, 2023
Currently nexus allocates crucible regions on the least used datasets.
This leads to repeating failures (see #3416 ). This change introduces
the concept of region allocation strategies at the database layer.
Currently only one strategy is used: the random strategy. However this
could be expanded.

The random strategy picks 3 distinct datasets from zpools with enough
space to hold a copy of the region being allocated. datasets are
shuffled using the md5 of the dataset UUID appended to some value. This
value can be specified to get a deterministic allocation, mainly for
test purposes, but in prod it simply uses the current time in
nanoseconds. Because the md5 function has a uniformly random output
distribution, sorting on this provides a random shuffling of the
datasets, while allowing more control than simply using `RANDOM()`.

Due to SQL limitations, to the best of my knowledge, we cannot make the
selection both random and distinct on the zpool or sled IDs, therefore
an allocation may select multiple datasets in the same sled or zpool.
But, we can detect that this has occurred, and fail the query. I have
included some code which will retry a few times if two datasets on the
same zpool are selected, as a future shuffling is likely to result in a
good selection. Note that currently in production, we don't ever have two
Crucible datasets on the same zpool, but it was raised as a future
possibility.

I have not done anything to ensure we are spread across 3 sleds. We
could attempt to implement that using the same retry-approach, but it
feels a bit hacky to me as-is, and feels moreso if we were to rely on it
for that. Regardless of how we solve the problem of distributing across
3 distinct sleds, we should plumb the allocation strategy through more
parts of Nexus when moving to a 3-sled policy so that we can relax it
to a 1-sled requirement for development/testing.

Testing whether the allocation distribution is truly uniform is
difficult to do in a reproducible manner in CI. I made some attempts at
doing some statistical analysis, but to get a fully deterministic region
allocation we would need to allocate all the dataset Uuids
deterministically, which would require pulling in a direct dependency on
the chacha crate, and then hooking that up. Doing analysis on anything
other than perfectly deterministic data will eventually result in false
failures given enough CI runs. That's just the nature of measuring
whether the data is random. Additionally, a simple chi analysis isn't
quite appopriate here since the 3 dataset selections for a single region
impact each other due to the requirement for distinct zpools.

ANYWAYS. I ran 3 sets of 3000 region allocations, each resulting in
9000 dataset selections across 27 datasets. I got these distributions,
counting how many times each dataset was selected.

```
[351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319]
[329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345]
[352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344]
```

This seems convincingly uniform to me.
@faithanalog faithanalog linked a pull request Jul 16, 2023 that will close this issue
faithanalog added a commit that referenced this issue Jul 17, 2023
Currently nexus allocates crucible regions on the least used datasets.
This leads to repeating failures (see #3416 ). This change introduces
the concept of region allocation strategies at the database layer. It replaces the previously used approach of allocating on the least-used dataset with a "random" strategy that selects randomly from datasets with enough capacity for the requested region. We can expand this to support multiple configurable allocation strategies.

The random strategy picks 3 distinct datasets from zpools with enough
space to hold a copy of the region being allocated. datasets are
shuffled using the md5 of a number appended to the dataset UUID. This
number can be specified as part of the allocation strategy
to get a deterministic allocation, mainly for
test purposes. When unspecified, as in production, it simply uses the current time in
nanoseconds. Because the md5 function has a uniformly random output
distribution, sorting on this provides a random shuffling of the
datasets, while allowing more control than simply using `RANDOM()`.

At present, allocation selects 3 distinct datasets from zpools that have enough space for the region. Since there is currently only one crucible dataset per zpool, this selects 3 distinct zpools. If a future change to the rack adds additional crucible datasets to zpools, the code may select multiple datasets on the same zpool, however it will detect this and produce an error instead of performing the allocation. In a future change we will improve the allocation strategy to pick from 3 distinct sleds and eliminate this problem in the process, but that is not part of this commit.

We will plumb the allocation strategy through more parts of 
Nexus when moving to a 3-sled policy so that we can relax it to
a 1-sled requirement for development/testing.

Testing whether the allocation distribution is truly uniform is
difficult to do in a reproducible manner in CI. I made some attempts at
doing some statistical analysis, but to get a fully deterministic region
allocation we would need to allocate all the dataset Uuids
deterministically, which would require pulling in a direct dependency on
the chacha crate, and then hooking that up. Doing analysis on anything
other than perfectly deterministic data will eventually result in false
failures given enough CI runs. That's just the nature of measuring
whether the data is random. Additionally, a simple chi analysis isn't
quite appropriate here: The 3 dataset selections for a single region
are dependent on each other, because each dataset can only be chosen once.

I ran 3 sets of 3000 region allocations, each resulting in 9000
dataset selections across 27 datasets. I got these distributions,
counting how many times each dataset was selected.

```
[351, 318, 341, 366, 337, 322, 329, 328, 327, 373, 335, 322, 330, 335, 333, 324, 349, 338, 346, 314, 337, 327, 328, 330, 322, 319, 319]
[329, 350, 329, 329, 334, 299, 355, 319, 339, 335, 308, 310, 364, 330, 366, 341, 334, 316, 331, 329, 298, 337, 339, 344, 368, 322, 345]
[352, 314, 316, 332, 355, 332, 320, 332, 337, 329, 312, 339, 366, 339, 333, 352, 329, 343, 327, 297, 329, 340, 373, 320, 304, 334, 344]
```

This seems convincingly uniform to me.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants