-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: improved eviction #5331
Comments
Moved to inprogress as opening PRs this week and testing in staging. |
Did not yet get to start, but hopeful of this week. |
this is aimed at replacing the current mtime only based trashing alerting later. Cc: #5331
Discussed this just now, trying to summarize. My open question had been the "advertised resident set size", and what did it mean to not do time threshold based eviction. Latter is easier, and becomes obvious once you look at our I pointed out that if we would no longer do threshold based eviction, we might not settle on anything resembling "resident set size" as it would be guided by current Post-call comparison of synthetic size sums on a single region where there are mostly 12k tenants per pageserver to Related to the very fast redownloads, there are two guesses as for the reason which I investigated after the call:
These guesses are sort of related however, a One action point, soon recorded in the task list: need to make the logging better however for redownloads, as I only added it as a histogram and these searches are really expensive. |
Got to testing #5304 today due to unrelated staging problems. I need to go over the actual results on Assuming the results are sane, the next steps are:
Post-discussion afterthought: if we do disk usage based eviction before all imitations are complete, should the eviction be Lsn based or random...? |
After further discussion with @jcsp and some review of testing results refined the next steps:
|
Next:
|
Next step:
|
Calculate the `relative_last_activity` using the total evicted and resident layers similar to what we originally planned. Cc: #5331
PR list has these open:
Next steps:
|
Refactor out layer accesses so that we can have easy access to resident layers, which are needed for number of cases instead of layers for eviction. Simplifies the heatmap building by only using Layers, not RemoteTimelineClient. Cc: #5331
Refactor out layer accesses so that we can have easy access to resident layers, which are needed for number of cases instead of layers for eviction. Simplifies the heatmap building by only using Layers, not RemoteTimelineClient. Cc: #5331
mostly reusing the existing and perhaps controversially sharing the histogram. in practice we don't configure this per-tenant. Cc: #5331
This week testing out the imitation only policy on staging and deciding if we need to complicate eviction candidate discovery #6224. With imitation only, we will finally run with a high amount of layers all of the time, and disk usage based eviction will run often. Alternatives to #6224:
Before testing this week:
|
Extra notes:
|
New dashy for the metrics added in #6131: https://neonprod.grafana.net/d/adecaputaszcwd/disk-usage-based-eviction?orgId=1 -- so far there has not been any disk usage based evictions on staging. |
Work this week:
|
Last week:
This week:
|
Note to self: this about hangs in disk usage based eviction while collecting layers. |
Latest troubles in staging have provided good ground for disk usage based eviction runs (
Participated in 9 downloads.
Participated in 1 download.
Participated in 1 download.
Participated in 2 downloads.
Participated in 1 download. |
Aiming for the design where `heavier_once_cell::OnceCell` is initialized by a future factory lead to awkwardness with how `LayerInner::get_or_maybe_download` looks right now with the `loop`. The loop helps with two situations: - an eviction has been scheduled but has not yet happened, and a read access should cancel the eviction - a previous `LayerInner::get_or_maybe_download` that canceled a pending eviction was canceled leaving the `heavier_once_cell::OnceCell` uninitialized but needing repair by the next `LayerInner::get_or_maybe_download` By instead supporting detached initialization in `heavier_once_cell::OnceCell` via an `OnceCell::get_or_detached_init`, we can fix what the monolithic #7030 does: - spawned off download task initializes the `heavier_once_cell::OnceCell` regardless of the download starter being canceled - a canceled `LayerInner::get_or_maybe_download` no longer stops eviction but can win it if not canceled Split off from #7030. Cc: #5331
The second part of work towards fixing `Layer::keep_resident` so that it does not need to repair the internal state. #7135 added a nicer API for initialization. This PR uses it to remove a few indentation levels and the loop construction. The next PR #7175 will use the refactorings done in this PR, and always initialize the internal state after a download. Cc: #5331
Before this PR, cancellation for `LayerInner::get_or_maybe_download` could occur so that we have downloaded the layer file in the filesystem, but because of the cancellation chance, we have not set the internal `LayerInner::inner` or initialized the state. With the detached init support introduced in #7135 and in place in #7152, we can now initialize the internal state after successfully downloading in the spawned task. The next PR will fix the remaining problems that this PR leaves: - `Layer::keep_resident` is still used because - `Layer::get_or_maybe_download` always cancels an eviction, even when canceled Split off from #7030. Stacked on top of #7152. Cc: #5331.
## Problem The current implementation of struct Layer supports canceled read requests, but those will leave the internal state such that a following `Layer::keep_resident` call will need to repair the state. In pathological cases seen during generation numbers resetting in staging or with too many in-progress on-demand downloads, this repair activity will need to wait for the download to complete, which stalls disk usage-based eviction. Similar stalls have been observed in staging near disk-full situations, where downloads failed because the disk was full. Fixes #6028 or the "layer is present on filesystem but not evictable" problems by: 1. not canceling pending evictions by a canceled `LayerInner::get_or_maybe_download` 2. completing post-download initialization of the `LayerInner::inner` from the download task Not canceling evictions above case (1) and always initializing (2) lead to plain `LayerInner::inner` always having the up-to-date information, which leads to the old `Layer::keep_resident` never having to wait for downloads to complete. Finally, the `Layer::keep_resident` is replaced with `Layer::is_likely_resident`. These fix #7145. ## Summary of changes - add a new test showing that a canceled get_or_maybe_download should not cancel the eviction - switch to using a `watch` internally rather than a `broadcast` to avoid hanging eviction while a download is ongoing - doc changes for new semantics and cleanup - fix `Layer::keep_resident` to use just `self.0.inner.get()` as truth as `Layer::is_likely_resident` - remove `LayerInner::wanted_evicted` boolean as no longer needed Builds upon: #7185. Cc: #5331.
Before and after #7030:
The set of PRs culminating with #7030 also removed the "10min hang" previously observed. Later more evidence came that it was caused by waiting for a download. For other fixed cases, see: #6028 (comment)
Log analysis is still too time-consuming to spot any patterns. #7030 preliminaries also included fixes for updating this metric. The best guess so far is that we get unlucky with:
However, in the short time between (1) to (2), the PITR could have advanced just enough to warrant new synthetic size calculation, for example. The utilization endpoint work has just not been started. |
This change will improve the pageserver's ability to accomodate larger tenants, especially those with append-heavy work.
Tasks
The text was updated successfully, but these errors were encountered: