Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: optimize disk usage eviction for large total number of layers #6224

Open
jcsp opened this issue Dec 21, 2023 · 2 comments
Open
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Collaborator

jcsp commented Dec 21, 2023

The first phase of disk usage eviction is to enumerate all layers across all tenants, so that the layers can then be globally ordered by LRU. This generates an O(n_layers) data structure, which will be millions of layers. This has a memory cost, and also will generate a lot of spurious atomics from cloning args into EvictionCandidate, etc.

Basic approach: avoid using O(N_layers) memory

We can make this much more scalable by accepting an inexact ordering:

  1. First calculate how much space we want to free
  2. Then consider the first 10% of tenants: order their layers, and then delete layers until we have reclaimed 10% of the target
  3. ...and so on for the next 10% of tenants, etc.

That way we avoid holding the entire list of layers in memory at a time.

Sampling approach: operate in constant size memory

A more sophisticated approach would be to use statistical sampling of the layer age distribution:

  1. Make a histogram of layer ages
  2. Sample a modest number of layers from a modest number of tenants, e.g. 100 layers from 100 tenants each.
  3. To free 10% of the used space, take a 10th percentile sample from the histogram: that is our age threshold for deleting layers
  4. Iterate through tenants & layers, evicting anything older than the age threshold.

Unfair but fast approach

Avoid iterating through all tenants at all, by accepting that some tenants will "take one for the team" so that we don't have to touch all of them.

For example, to touch only half the tenants:

  1. Start with the sampling approach
  2. If our target is 10%, then adjust it up to 20%
  3. Pick a random 50% subset of the tenants, and apply eviction with the 20% threshold

This will work fine if eviction is somewhat common, as each iteration we'll pick different tenants.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Dec 21, 2023
@koivunej
Copy link
Member

Only now found this issue. The options presented do not work badly with #5304 which we are now in progress of rolling out.

@koivunej
Copy link
Member

Re: todo I found from the code:

// TODO: avoid listing every layer in every tenant: this loop can block the executor,
// and the resulting data structure can be huge.
// (https://github.com/neondatabase/neon/issues/6224)

We do take the LayerMap rwlock which is tokio lock for each attached timeline which will make progress per tokio's coop facilities and so yield every now and then. This is not true for secondaries. I'll add a yield per secondary tenant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants