Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk usage based eviction: consider relative LRU order #5304

Closed
Tracked by #5331
koivunej opened this issue Sep 14, 2023 · 7 comments
Closed
Tracked by #5331

disk usage based eviction: consider relative LRU order #5304

koivunej opened this issue Sep 14, 2023 · 7 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Contributor

Currently our disk usage based eviction evicts layers in the absolute order of recent accesses. It means that for example a single new timeline will first see all other timeline layers get evicted before one of it's layers gets evicted.

What if instead of using the absolute timestamps around below we would come up with a 0..1 f32 based on how relatively recently the layer has been accessed?

"Relatively recently": for layer x it's relative_recency is 1.0 - (x.last_activity_ts.as_secs_f32() / oldest_layer_access.as_secs_f32()) so that 1.0 would be for the most recently accessed layer and 0.0 for the oldest access.

This relative measure would put all timelines on the same equal footing, and would allow to handle the fast growing timeline at worst giving it some slower performance because of trashed layers but for the overall health of.

Relevant parts:

@koivunej koivunej added the c/storage/pageserver Component: storage: pageserver label Sep 14, 2023
@koivunej
Copy link
Contributor Author

koivunej commented Sep 14, 2023

Perhaps no floats are needed, the relative_recency could just be a index from the per tenant list.

EDIT: except that it will not be invariant to tenants having different amounts of layers, so we will still need to normalize it to f32.

@koivunej
Copy link
Contributor Author

koivunej commented Oct 5, 2023

Discussed this with Heikki. He suggested reviewing what kind of algorithms there are in literature.

koivunej added a commit that referenced this issue Dec 20, 2023
Adds a new disk usage based eviction option, EvictionOrder, which
selects whether to use the current `AbsoluteAccessed` or this new
proposed but not yet tested `RelativeAccessed`. Additionally a fudge
factor was noticed while implementing this, which might help sparing
smaller tenants at the expense of targeting larger tenants.

Cc: #5304

Co-authored-by: Arpad Müller <arpad@neon.tech>
@koivunej koivunej self-assigned this Jan 15, 2024
@koivunej
Copy link
Contributor Author

Next steps:

  • modify the straightforward implementation to do old and new selection algs
  • gather summary statistics when running with the relative order
  • test it in staging (all servers)
    • ballast file is easy
    • need to figure out a way for "fast growing tenant"
    • how to restore state between tests?
  • log the results here

@koivunej
Copy link
Contributor Author

koivunej commented Jan 15, 2024

Discussion from the planning meeting:

  • pass criteria is needed, also a TODO in the tests
    • innocent bystander tenant loses a proportion of layers, fast growing loses more
  • pytest idea: reset access times via http
  • "resetting staging" probably not an issue, assume imitation takes care of it

koivunej added a commit that referenced this issue Jan 18, 2024
I just failed to see this earlier on #6136. layer counts are used as an
abstraction, and each of the two tenants lose proportionally about the
same amount of layers. sadly there is no difference in between
`relative_spare` and `relative_equal` as both of these end up evicting
the exact same amount of layers, but I'll try to add later another test
for those.

Cc: #5304
koivunej added a commit that referenced this issue Jan 18, 2024
…bsolute (#6384)

With testing the new eviction order there is a problem of all of the
(currently rare) disk usage based evictions being rare and unique; this
PR adds a human readable summary of what absolute order would had done
and what the relative order does. Assumption is that these loggings will
make the few evictions runs in staging more useful.

Cc: #5304 for allowing testing in the staging
@koivunej
Copy link
Contributor Author

This has been tested a bit on staging. For the most part the results are encouraging via bayadin/1tb benchmark and ballast files:

  • fast growing 2TB remote size tenants which are constantly being downloaded from s3 are evicted the most
  • other tenants did suffer as well, but not as much

It was interesting that in some cases absolute ordering did a more or less better job by only evicting from a single fast growing tenant and not any idle. I suspect this is because of imitation and per timeline eviction task: in our staging there is very little activity and most tenants on ps-7.us-east-2 are at their imitiated resident sizes. Imitation runs quite often compared to the duration of downloading the layers needed for these large tenants, and so absolute accessed order correctly saw the used once layers of a large tenant here:

2024-01-22T09:38:59.102599Z  INFO disk_usage_eviction_task:iteration{iteration_no=4646}: absolute accessed selection summary: selected 384 layers of 48.1GiB up to (Some(SystemTime { tv_sec: 1705914774, tv_nsec: 815487540 }), Some(0.16)):
- 384 layers: <bench1> (48.1GiB)
2024-01-22T09:38:59.104929Z  INFO disk_usage_eviction_task:iteration{iteration_no=4646}: relative accessed selection summary: selected 532 layers of 48.2GiB up to (Some(SystemTime { tv_sec: 1705915051, tv_nsec: 836883713 }), Some(0.08)):
- 192 layers: <bench1> (24.0GiB)
- 165 layers: <bench2> (20.7GiB)
- 8 layers: <unrelated1> (56.1MiB)
- 7 layers: <unrelated2> (484.9MiB), <unrelated3> (258.1MiB)
- 6 layers: <unrelated4> (1.4MiB)
- 5 layers: 5 tenants 429.2MiB in total 25 layers
- 4 layers: <unrelated5> (114.7MiB), <unrelated6> (1.3MiB)
- 3 layers: 7 tenants 579.7MiB in total 21 layers
- 2 layers: 22 tenants 1.3GiB in total 44 layers
- 1 layers: 49 tenants 330.7MiB in total 49 layers

Full logs can be found via this search.

koivunej added a commit that referenced this issue Jan 25, 2024
Refactor out test_disk_usage_eviction tenant creation and add a custom
case with 4 tenants, 3 made with pgbench scale=1 and 1 made with pgbench
scale=4.

Because the tenants are created in order of scales [1, 1, 1, 4] this is
simple enough to demonstrate the problem with using absolute access
times, because on a disk usage based eviction run we will
disproportionally target the *first* scale=1 tenant(s), and the later
larger tenant does not lose anything.

This test is not enough to show the difference between `relative_equal`
and `relative_spare` (the fudge factor); much larger scale will be
needed for "the large tenant", but that will make debug mode tests
slower.

Cc: #5304
@koivunej
Copy link
Contributor Author

koivunej commented Feb 5, 2024

Biggest problem was the 10min "layer collection" which happened together with Layer deletions hanging. Only #6634 so far as a partial solution. Using quotes around "layer collection" because it may have been the "layer collection" and absolute order reporting. However for this case the reporting was easy since there was just one tenant (above log message).

Layer deletions hanging can be explained with spawn_blocking pool having a lot of queue, but 10min does seem very long even for that. The whole node was CPU exhausted hovering at around 90% CPU utilization, with user+system totaling 50%, rest irq and iowait. I think that points to "200 downloads are too much".

@koivunej
Copy link
Contributor Author

Consideration complete, didn't find anything else needing to be updated here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

1 participant