Epic: improved eviction #5331

jcsp · 2023-09-18T12:38:31Z

This change will improve the pageserver's ability to accomodate larger tenants, especially those with append-heavy work.

Tasks

Give feedback

Compaction and eviction are mutually exclusive #4745

11 of 11

c/storage/pageserver t/bug
feat(layer): pageserver_layer_redownloaded_after histogram #6132
disk usage based eviction: consider relative LRU order #5304

c/storage/pageserver
https://github.com/neondatabase/aws/pull/930
disk_usage_eviction_task: cleanup summaries #6490
fix: secondary tenant relative order eviction #6491
refactor(LayerManager): resident layers query #6634
feat: imitiation_only eviction_task policy #6598
https://github.com/neondatabase/aws/pull/976
https://github.com/neondatabase/aws/pull/1071
dube: timeout individual layer evictions, log progress and record metrics #6131
utilization api: pageserver fitness for the next tenant #6835

2 of 2

c/storage/pageserver
layer: evict downloaded-but-not-initialized #6028

c/storage/pageserver t/bug
fix(layer): remove the need to repair internal state #7030
Time-based eviction is not needed when operating with a disk usage target (e.g. target usage 70%, critical usage 85%). We only need space-driven eviction, but perhaps with two modes, one where it trims ruthlessly and another where it aims to avoid trimming things which are more recent than latest compaction.
For thrashing alterting, we will need to define a new measurement based on the time between evicting a layer and re-downloading it. A threshold approximately 20-60 minutes would be used, based on (disk_size / disk_bandwidth), where a totally streaming workload would be expected to fit in cache over that timescale (John's heuristic for thrashing
fast redownloads could log the access history or at least highlight "fastness" on the log line for easier finding
consider again not requiring read access to Timeline::layers for evictable layers query
Options

koivunej · 2023-11-27T12:34:04Z

Moved to inprogress as opening PRs this week and testing in staging.

koivunej · 2023-12-04T12:28:33Z

Did not yet get to start, but hopeful of this week.

this is aimed at replacing the current mtime only based trashing alerting later. Cc: #5331

koivunej · 2024-01-16T11:38:38Z

Discussed this just now, trying to summarize.

My open question had been the "advertised resident set size", and what did it mean to not do time threshold based eviction. Latter is easier, and becomes obvious once you look at our redownloaded_at histogram -- we simply would not do it, and we wouldn't have any of those very fast redownloads.

I pointed out that if we would no longer do threshold based eviction, we might not settle on anything resembling "resident set size" as it would be guided by current EvictionOrder and frequency of executing the more tame version or the more critical version. For a better estimate @jcsp had been thinking MAX(timelines.logical_size) with a fudge factor or just the plain synthetic size. This would be used to advertise if we can accept more tenants.

Post-call comparison of synthetic size sums on a single region where there are mostly 12k tenants per pageserver to sum(max(tenant.timelines.logical_size)) with the special pageservers removed gives a range of 1.32 to 3.08, meaning the suggested fudge factor of 2 might work. Removing threshold based eviction does not matter for this rational.

Related to the very fast redownloads, there are two guesses as for the reason which I investigated after the call:

synthetic size calculation where the point in which we calculate the latest logical size moves forward could get unlucky (in relation to imitation not "catching" this)
- one evidence, last_record_lsn increase -- no check_availability operations
- one evidence, last_record_lsn increase -- no check_availability operations
- one evidence, same tenant as above
availability check ends up redownloading those, because we don't directly imitate basebackup -- but we don't imitate basebackup because it is thought to be covered
- couldn't find any evidence with the really small sample size 130 on-demand downloads

These guesses are sort of related however, a check_availability might produce WAL (at least it did at one point), so it might cause the synthetic size logical size point to move forward.

One action point, soon recorded in the task list: need to make the logging better however for redownloads, as I only added it as a histogram and these searches are really expensive.

koivunej · 2024-01-22T11:27:16Z

Got to testing #5304 today due to unrelated staging problems. I need to go over the actual results on ps-7.us-east-2.aws.neon.build.

Assuming the results are sane, the next steps are:

cleanup the summary messages (semi-revert temp: human readable summaries for relative access time compared to absolute #6384, keep select_victims refactoring)
introduce a per timeline evictiontask mode which does not evict but only imitiates
perhaps introduce a second mode (or don't) for disk usage based eviction
- staging: we restart quite often, so pageserver inmemory state is reset often
- production: we restart much more rarely so perhaps there is no real need

Post-discussion afterthought: if we do disk usage based eviction before all imitations are complete, should the eviction be Lsn based or random...?

koivunej · 2024-01-23T11:01:27Z

After further discussion with @jcsp and some review of testing results refined the next steps:

testing on staging without per timeline eviction task to make sure huge layer counts are not noticeable for disk usage based eviction
enable on one production region which has high disk usage right now (50%)

jcsp · 2024-01-29T11:28:10Z

Implement the imitate-only task so that we can disable time based eviction.
Engage CP team to agree new API for exposing a size heuristic to unblock moving to disk-only (no time based) eviction
Enable relative eviction in prod configs

koivunej · 2024-02-05T11:09:58Z

#6491 and #6598 are ready-ish to go but I forgot the config updates from last week.

Discussion about pageserver owning the "is good for next tenant" is barely started.

jcsp · 2024-02-05T11:22:46Z

Next step:

Define the interface for CP for utilization
Avoid taking tenant locks when collecting layers to evict.

Calculate the `relative_last_activity` using the total evicted and resident layers similar to what we originally planned. Cc: #5331

koivunej · 2024-02-12T10:51:18Z

PR list has these open:

enable rel eviction in prod -- we should merge it
imitation only eviction task policy -- reuses the metrics, but we shouldn't have any different configured per tenant
rwlock contention needs refreshing of review

Next steps:

write up an issue on the new endpoint (next bullet)
impl the endpoint for querying how good the PS thinks it is for the next tenant

Refactor out layer accesses so that we can have easy access to resident layers, which are needed for number of cases instead of layers for eviction. Simplifies the heatmap building by only using Layers, not RemoteTimelineClient. Cc: #5331

mostly reusing the existing and perhaps controversially sharing the histogram. in practice we don't configure this per-tenant. Cc: #5331

PR adds a simple at most 1Hz refreshed informational API for querying pageserver utilization. In this first phase, no actual background calculation is performed. Instead, the worst possible score is always returned. The returned bytes information is however correct. Cc: #6835 Cc: #5331

koivunej · 2024-02-26T10:42:26Z

This week testing out the imitation only policy on staging and deciding if we need to complicate eviction candidate discovery #6224. With imitation only, we will finally run with a high amount of layers all of the time, and disk usage based eviction will run often.

Alternatives to #6224:

evict earlier non-hit layers after creating image layers

Before testing this week:

task list has a logging improvement
metric improvement for understanding how bad is the current layer discovery
could also do the low hanging fruit optimizations there

jcsp · 2024-02-26T11:12:35Z

Extra notes:

The try_lock change was reverted for lack of evidence that it was the underlying cause
So ~10 minute hang is still probably in there: expect to see reproduction in staging testing

koivunej · 2024-03-01T11:33:48Z

New dashy for the metrics added in #6131: https://neonprod.grafana.net/d/adecaputaszcwd/disk-usage-based-eviction?orgId=1 -- so far there has not been any disk usage based evictions on staging.

koivunej · 2024-03-04T10:36:17Z

Work this week:

staging shows an performance issue with struct Layer or disk usage based eviction collection
further testing in staging together with OnlyImitiate policy
we will likely roll out continious disk usage based eviction to a single pageserver in prod in a region which has great tenant imbalance

koivunej · 2024-03-11T10:53:31Z

Last week:

the performance issue was identified on staging, and fix(layer): remove the need to repair internal state #7030 was created
troubles creating even 10GB pgbench databases due to primary key query repeatedly being interrupted because of a SIGHUP (https://github.com/neondatabase/cloud/issues/11023)

This week:

split the fix(layer): remove the need to repair internal state #7030, get reviews through-out the week
migrate more tenants on to pageserver-1.eu-west-1

jcsp · 2024-03-11T11:28:11Z

split the #7030, get reviews through-out the week

Note to self: this about hangs in disk usage based eviction while collecting layers.

koivunej · 2024-03-14T13:01:48Z

Latest troubles in staging have provided good ground for disk usage based eviction runs (pageserver-[01].eu-west-1), listing the examined outliers after #6131:

2024-03-14T12:04:55.501895Z  INFO disk_usage_eviction_task:iteration{iteration_no=1093}: collection took longer than threshold tenant_id=9d984098974b482e25f8b85560f9bba3 shard_id=0000 elapsed_ms=15367

Participated in 9 downloads.

2024-03-14T12:15:44.980494Z  INFO disk_usage_eviction_task:iteration{iteration_no=1155}: collection took longer than threshold tenant_id=a992f0c69c3d69b7338586750ba3f9c1 shard_id=0000 elapsed_ms=12523

Participated in 1 download.

2024-03-14T12:18:45.162630Z  INFO disk_usage_eviction_task:iteration{iteration_no=1168}: collection took longer than threshold tenant_id=7affec0a9fdf9da5b3638894a84cb9cc shard_id=0000 elapsed_ms=13364

Participated in 1 download.

2024-03-14T12:18:59.848429Z  INFO disk_usage_eviction_task:iteration{iteration_no=1168}: collection took longer than threshold tenant_id=a776112dba9d2adbb7a7746b6533125d shard_id=0000 elapsed_ms=10176

Participated in 2 downloads.

2024-03-14T12:19:27.135951Z  INFO disk_usage_eviction_task:iteration{iteration_no=1168}: collection took longer than threshold tenant_id=f231e5ac37f956babb1cc98dcfb088ce shard_id=0000 elapsed_ms=17911

Participated in 1 download.

Split off from #7030: - each early exit is counted as canceled init, even though it most likely was just `LayerInner::keep_resident` doing the no-download repair check - `downloaded_after` could had been accounted for multiple times, and also when repairing to match on-disk state Cc: #5331

Aiming for the design where `heavier_once_cell::OnceCell` is initialized by a future factory lead to awkwardness with how `LayerInner::get_or_maybe_download` looks right now with the `loop`. The loop helps with two situations: - an eviction has been scheduled but has not yet happened, and a read access should cancel the eviction - a previous `LayerInner::get_or_maybe_download` that canceled a pending eviction was canceled leaving the `heavier_once_cell::OnceCell` uninitialized but needing repair by the next `LayerInner::get_or_maybe_download` By instead supporting detached initialization in `heavier_once_cell::OnceCell` via an `OnceCell::get_or_detached_init`, we can fix what the monolithic #7030 does: - spawned off download task initializes the `heavier_once_cell::OnceCell` regardless of the download starter being canceled - a canceled `LayerInner::get_or_maybe_download` no longer stops eviction but can win it if not canceled Split off from #7030. Cc: #5331

The second part of work towards fixing `Layer::keep_resident` so that it does not need to repair the internal state. #7135 added a nicer API for initialization. This PR uses it to remove a few indentation levels and the loop construction. The next PR #7175 will use the refactorings done in this PR, and always initialize the internal state after a download. Cc: #5331

Before this PR, cancellation for `LayerInner::get_or_maybe_download` could occur so that we have downloaded the layer file in the filesystem, but because of the cancellation chance, we have not set the internal `LayerInner::inner` or initialized the state. With the detached init support introduced in #7135 and in place in #7152, we can now initialize the internal state after successfully downloading in the spawned task. The next PR will fix the remaining problems that this PR leaves: - `Layer::keep_resident` is still used because - `Layer::get_or_maybe_download` always cancels an eviction, even when canceled Split off from #7030. Stacked on top of #7152. Cc: #5331.

Small fix to remove confusing `mut` bindings. Builds upon #7175, split off from #7030. Cc: #5331.

## Problem The current implementation of struct Layer supports canceled read requests, but those will leave the internal state such that a following `Layer::keep_resident` call will need to repair the state. In pathological cases seen during generation numbers resetting in staging or with too many in-progress on-demand downloads, this repair activity will need to wait for the download to complete, which stalls disk usage-based eviction. Similar stalls have been observed in staging near disk-full situations, where downloads failed because the disk was full. Fixes #6028 or the "layer is present on filesystem but not evictable" problems by: 1. not canceling pending evictions by a canceled `LayerInner::get_or_maybe_download` 2. completing post-download initialization of the `LayerInner::inner` from the download task Not canceling evictions above case (1) and always initializing (2) lead to plain `LayerInner::inner` always having the up-to-date information, which leads to the old `Layer::keep_resident` never having to wait for downloads to complete. Finally, the `Layer::keep_resident` is replaced with `Layer::is_likely_resident`. These fix #7145. ## Summary of changes - add a new test showing that a canceled get_or_maybe_download should not cancel the eviction - switch to using a `watch` internally rather than a `broadcast` to avoid hanging eviction while a download is ongoing - doc changes for new semantics and cleanup - fix `Layer::keep_resident` to use just `self.0.inner.get()` as truth as `Layer::is_likely_resident` - remove `LayerInner::wanted_evicted` boolean as no longer needed Builds upon: #7185. Cc: #5331.

koivunej · 2024-04-03T11:49:30Z

Before and after #7030:

2024-03-21T02:41:07.618648Z  INFO disk_usage_eviction_task:iteration{iteration_no=1362}: collection completed elapsed_ms=4969 total_layers=83690
2024-03-21T03:53:43.072165Z  INFO disk_usage_eviction_task:iteration{iteration_no=400}: collection completed elapsed_ms=135 total_layers=83695

The set of PRs culminating with #7030 also removed the "10min hang" previously observed. Later more evidence came that it was caused by waiting for a download. For other fixed cases, see: #6028 (comment)

pageserver_layer_downloaded_after metric is still not being used to alert because many cases in staging cause redownloads very soon after evicting. In production, the old mtime-based trashing alert has been downgraded as a warning. It is not known why we get into this situation.

Log analysis is still too time-consuming to spot any patterns. #7030 preliminaries also included fixes for updating this metric. The best guess so far is that we get unlucky with:

evict
initiate layer accesses right after

However, in the short time between (1) to (2), the PITR could have advanced just enough to warrant new synthetic size calculation, for example.

The utilization endpoint work has just not been started.

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Sep 18, 2023

koivunej mentioned this issue Oct 26, 2023

reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer #4938

Merged

shanyp assigned koivunej Nov 6, 2023

koivunej mentioned this issue Dec 13, 2023

feat(layer): pageserver_layer_redownloaded_after histogram #6132

Merged

koivunej added a commit that referenced this issue Dec 14, 2023

feat(layer): pageserver_layer_redownloaded_after histogram (#6132)

f010479

this is aimed at replacing the current mtime only based trashing alerting later. Cc: #5331

This was referenced Feb 2, 2024

feat: imitiation_only eviction_task policy #6598

Merged

fix: secondary tenant relative order eviction #6491

Merged

koivunej mentioned this issue Feb 5, 2024

refactor(LayerManager): resident layers query #6634

Merged

koivunej added a commit that referenced this issue Feb 8, 2024

fix: secondary tenant relative order eviction (#6491)

c099933

Calculate the `relative_last_activity` using the total evicted and resident layers similar to what we originally planned. Cc: #5331

koivunej mentioned this issue Feb 20, 2024

feat: bare-bones /v1/utilization #6831

Merged

koivunej added a commit that referenced this issue Feb 21, 2024

feat: imitiation_only eviction_task policy (#6598)

7257ffb

mostly reusing the existing and perhaps controversially sharing the histogram. in practice we don't configure this per-tenant. Cc: #5331

koivunej mentioned this issue Mar 4, 2024

layer: evict downloaded-but-not-initialized #6028

Closed

This was referenced Mar 15, 2024

heavier_once_cell: add detached init support #7135

Merged

fix(layer): metric miscalculations #7137

Merged

This was referenced Mar 15, 2024

Layer::keep_resident will wait for layer download blocking disk usage based eviction #7145

Closed

refactor(layer): use detached init #7152

Merged

fix(Layer): always init after downloading in the spawned task #7175

Merged

koivunej mentioned this issue Mar 20, 2024

fix(heavier_once_cell): take_and_deinit should take ownership #7185

Merged

koivunej added a commit that referenced this issue Mar 20, 2024

fix(heavier_once_cell): take_and_deinit should take ownership (#7185)

a95c41f

Small fix to remove confusing `mut` bindings. Builds upon #7175, split off from #7030. Cc: #5331.

koivunej mentioned this issue Mar 21, 2024

fix(layer): remove the need to repair internal state #7030

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: improved eviction #5331

Epic: improved eviction #5331

jcsp commented Sep 18, 2023 •

edited by koivunej

Loading

Tasks

koivunej commented Nov 27, 2023

koivunej commented Dec 4, 2023

koivunej commented Jan 16, 2024 •

edited

Loading

koivunej commented Jan 22, 2024 •

edited

Loading

koivunej commented Jan 23, 2024

jcsp commented Jan 29, 2024

koivunej commented Feb 5, 2024

jcsp commented Feb 5, 2024 •

edited

Loading

koivunej commented Feb 12, 2024 •

edited

Loading

koivunej commented Feb 26, 2024 •

edited

Loading

jcsp commented Feb 26, 2024

koivunej commented Mar 1, 2024 •

edited

Loading

koivunej commented Mar 4, 2024 •

edited

Loading

koivunej commented Mar 11, 2024 •

edited by jcsp

Loading

jcsp commented Mar 11, 2024

koivunej commented Mar 14, 2024

koivunej commented Apr 3, 2024

Epic: improved eviction #5331

Epic: improved eviction #5331

Comments

jcsp commented Sep 18, 2023 • edited by koivunej Loading

Tasks

koivunej commented Nov 27, 2023

koivunej commented Dec 4, 2023

koivunej commented Jan 16, 2024 • edited Loading

koivunej commented Jan 22, 2024 • edited Loading

koivunej commented Jan 23, 2024

jcsp commented Jan 29, 2024

koivunej commented Feb 5, 2024

jcsp commented Feb 5, 2024 • edited Loading

koivunej commented Feb 12, 2024 • edited Loading

koivunej commented Feb 26, 2024 • edited Loading

jcsp commented Feb 26, 2024

koivunej commented Mar 1, 2024 • edited Loading

koivunej commented Mar 4, 2024 • edited Loading

koivunej commented Mar 11, 2024 • edited by jcsp Loading

jcsp commented Mar 11, 2024

koivunej commented Mar 14, 2024

koivunej commented Apr 3, 2024

jcsp commented Sep 18, 2023 •

edited by koivunej

Loading

koivunej commented Jan 16, 2024 •

edited

Loading

koivunej commented Jan 22, 2024 •

edited

Loading

jcsp commented Feb 5, 2024 •

edited

Loading

koivunej commented Feb 12, 2024 •

edited

Loading

koivunej commented Feb 26, 2024 •

edited

Loading

koivunej commented Mar 1, 2024 •

edited

Loading

koivunej commented Mar 4, 2024 •

edited

Loading

koivunej commented Mar 11, 2024 •

edited by jcsp

Loading