disk usage based eviction: consider relative LRU order #5304

koivunej · 2023-09-14T08:10:33Z

Currently our disk usage based eviction evicts layers in the absolute order of recent accesses. It means that for example a single new timeline will first see all other timeline layers get evicted before one of it's layers gets evicted.

What if instead of using the absolute timestamps around below we would come up with a 0..1 f32 based on how relatively recently the layer has been accessed?

"Relatively recently": for layer x it's relative_recency is 1.0 - (x.last_activity_ts.as_secs_f32() / oldest_layer_access.as_secs_f32()) so that 1.0 would be for the most recently accessed layer and 0.0 for the oldest access.

This relative measure would put all timelines on the same equal footing, and would allow to handle the fast growing timeline at worst giving it some slower performance because of trashed layers but for the overall health of.

Relevant parts:

neon/pageserver/src/disk_usage_eviction_task.rs

Lines 570 to 571 in ffd146c

tenant_candidates

.sort_unstable_by_key(|(_, layer_info)| std::cmp::Reverse(layer_info.last_activity_ts));
neon/pageserver/src/disk_usage_eviction_task.rs

Lines 592 to 593 in ffd146c

candidates

.sort_unstable_by_key(|(partition, candidate)| (*partition, candidate.last_activity_ts));

The text was updated successfully, but these errors were encountered:

koivunej · 2023-09-14T08:17:52Z

Perhaps no floats are needed, the relative_recency could just be a index from the per tenant list.

EDIT: except that it will not be invariant to tenants having different amounts of layers, so we will still need to normalize it to f32.

koivunej · 2023-10-05T08:19:52Z

Discussed this with Heikki. He suggested reviewing what kind of algorithms there are in literature.

Adds a new disk usage based eviction option, EvictionOrder, which selects whether to use the current `AbsoluteAccessed` or this new proposed but not yet tested `RelativeAccessed`. Additionally a fudge factor was noticed while implementing this, which might help sparing smaller tenants at the expense of targeting larger tenants. Cc: #5304 Co-authored-by: Arpad Müller <arpad@neon.tech>

koivunej · 2024-01-15T11:09:21Z

Next steps:

modify the straightforward implementation to do old and new selection algs
gather summary statistics when running with the relative order
test it in staging (all servers)
- ballast file is easy
- need to figure out a way for "fast growing tenant"
- how to restore state between tests?
log the results here

koivunej · 2024-01-15T11:42:31Z

Discussion from the planning meeting:

pass criteria is needed, also a TODO in the tests
- innocent bystander tenant loses a proportion of layers, fast growing loses more
pytest idea: reset access times via http
"resetting staging" probably not an issue, assume imitation takes care of it

I just failed to see this earlier on #6136. layer counts are used as an abstraction, and each of the two tenants lose proportionally about the same amount of layers. sadly there is no difference in between `relative_spare` and `relative_equal` as both of these end up evicting the exact same amount of layers, but I'll try to add later another test for those. Cc: #5304

…bsolute (#6384) With testing the new eviction order there is a problem of all of the (currently rare) disk usage based evictions being rare and unique; this PR adds a human readable summary of what absolute order would had done and what the relative order does. Assumption is that these loggings will make the few evictions runs in staging more useful. Cc: #5304 for allowing testing in the staging

koivunej · 2024-01-23T10:56:20Z

This has been tested a bit on staging. For the most part the results are encouraging via bayadin/1tb benchmark and ballast files:

fast growing 2TB remote size tenants which are constantly being downloaded from s3 are evicted the most
other tenants did suffer as well, but not as much

It was interesting that in some cases absolute ordering did a more or less better job by only evicting from a single fast growing tenant and not any idle. I suspect this is because of imitation and per timeline eviction task: in our staging there is very little activity and most tenants on ps-7.us-east-2 are at their imitiated resident sizes. Imitation runs quite often compared to the duration of downloading the layers needed for these large tenants, and so absolute accessed order correctly saw the used once layers of a large tenant here:

2024-01-22T09:38:59.102599Z  INFO disk_usage_eviction_task:iteration{iteration_no=4646}: absolute accessed selection summary: selected 384 layers of 48.1GiB up to (Some(SystemTime { tv_sec: 1705914774, tv_nsec: 815487540 }), Some(0.16)):
- 384 layers: <bench1> (48.1GiB)
2024-01-22T09:38:59.104929Z  INFO disk_usage_eviction_task:iteration{iteration_no=4646}: relative accessed selection summary: selected 532 layers of 48.2GiB up to (Some(SystemTime { tv_sec: 1705915051, tv_nsec: 836883713 }), Some(0.08)):
- 192 layers: <bench1> (24.0GiB)
- 165 layers: <bench2> (20.7GiB)
- 8 layers: <unrelated1> (56.1MiB)
- 7 layers: <unrelated2> (484.9MiB), <unrelated3> (258.1MiB)
- 6 layers: <unrelated4> (1.4MiB)
- 5 layers: 5 tenants 429.2MiB in total 25 layers
- 4 layers: <unrelated5> (114.7MiB), <unrelated6> (1.3MiB)
- 3 layers: 7 tenants 579.7MiB in total 21 layers
- 2 layers: 22 tenants 1.3GiB in total 44 layers
- 1 layers: 49 tenants 330.7MiB in total 49 layers

Full logs can be found via this search.

Refactor out test_disk_usage_eviction tenant creation and add a custom case with 4 tenants, 3 made with pgbench scale=1 and 1 made with pgbench scale=4. Because the tenants are created in order of scales [1, 1, 1, 4] this is simple enough to demonstrate the problem with using absolute access times, because on a disk usage based eviction run we will disproportionally target the *first* scale=1 tenant(s), and the later larger tenant does not lose anything. This test is not enough to show the difference between `relative_equal` and `relative_spare` (the fudge factor); much larger scale will be needed for "the large tenant", but that will make debug mode tests slower. Cc: #5304

koivunej · 2024-02-05T14:50:37Z

Biggest problem was the 10min "layer collection" which happened together with Layer deletions hanging. Only #6634 so far as a partial solution. Using quotes around "layer collection" because it may have been the "layer collection" and absolute order reporting. However for this case the reporting was easy since there was just one tenant (above log message).

Layer deletions hanging can be explained with spawn_blocking pool having a lot of queue, but 10min does seem very long even for that. The whole node was CPU exhausted hovering at around 90% CPU utilization, with user+system totaling 50%, rest irq and iowait. I think that points to "200 downloads are too much".

koivunej · 2024-02-12T10:25:15Z

Consideration complete, didn't find anything else needing to be updated here.

koivunej added the c/storage/pageserver Component: storage: pageserver label Sep 14, 2023

This was referenced Sep 18, 2023

Epic: improved eviction #5331

Open

Layer eviction for secondary mode tenants #5342

Closed

koivunej mentioned this issue Dec 13, 2023

feat: relative last activity based eviction #6136

Merged

koivunej self-assigned this Jan 15, 2024

This was referenced Jan 18, 2024

test: figure out the relative eviction order assertions #6375

Merged

test: show relative order eviction with "fast growing tenant" #6377

Merged

temp: human readable summaries for relative access time compared to absolute #6384

Merged

koivunej mentioned this issue Jan 23, 2024

pageserver: optimize disk usage eviction for large total number of layers #6224

Open

koivunej closed this as completed Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disk usage based eviction: consider relative LRU order #5304

disk usage based eviction: consider relative LRU order #5304

koivunej commented Sep 14, 2023

koivunej commented Sep 14, 2023 •

edited

Loading

koivunej commented Oct 5, 2023

koivunej commented Jan 15, 2024

koivunej commented Jan 15, 2024 •

edited

Loading

koivunej commented Jan 23, 2024

koivunej commented Feb 5, 2024

koivunej commented Feb 12, 2024

disk usage based eviction: consider relative LRU order #5304

disk usage based eviction: consider relative LRU order #5304

Comments

koivunej commented Sep 14, 2023

koivunej commented Sep 14, 2023 • edited Loading

koivunej commented Oct 5, 2023

koivunej commented Jan 15, 2024

koivunej commented Jan 15, 2024 • edited Loading

koivunej commented Jan 23, 2024

koivunej commented Feb 5, 2024

koivunej commented Feb 12, 2024

koivunej commented Sep 14, 2023 •

edited

Loading

koivunej commented Jan 15, 2024 •

edited

Loading