pageserver: optimize disk usage eviction for large total number of layers #6224

jcsp · 2023-12-21T16:33:20Z

The first phase of disk usage eviction is to enumerate all layers across all tenants, so that the layers can then be globally ordered by LRU. This generates an O(n_layers) data structure, which will be millions of layers. This has a memory cost, and also will generate a lot of spurious atomics from cloning args into EvictionCandidate, etc.

Basic approach: avoid using O(N_layers) memory

We can make this much more scalable by accepting an inexact ordering:

First calculate how much space we want to free
Then consider the first 10% of tenants: order their layers, and then delete layers until we have reclaimed 10% of the target
...and so on for the next 10% of tenants, etc.

That way we avoid holding the entire list of layers in memory at a time.

Sampling approach: operate in constant size memory

A more sophisticated approach would be to use statistical sampling of the layer age distribution:

Make a histogram of layer ages
Sample a modest number of layers from a modest number of tenants, e.g. 100 layers from 100 tenants each.
To free 10% of the used space, take a 10th percentile sample from the histogram: that is our age threshold for deleting layers
Iterate through tenants & layers, evicting anything older than the age threshold.

Unfair but fast approach

Avoid iterating through all tenants at all, by accepting that some tenants will "take one for the team" so that we don't have to touch all of them.

For example, to touch only half the tenants:

Start with the sampling approach
If our target is 10%, then adjust it up to 20%
Pick a random 50% subset of the tenants, and apply eviction with the 20% threshold

This will work fine if eviction is somewhat common, as each iteration we'll pick different tenants.

koivunej · 2024-01-23T10:46:27Z

Only now found this issue. The options presented do not work badly with #5304 which we are now in progress of rolling out.

koivunej · 2024-01-26T12:32:27Z

Re: todo I found from the code:

neon/pageserver/src/disk_usage_eviction_task.rs

Lines 758 to 760 in 12e9b2a

    
           // TODO: avoid listing every layer in every tenant: this loop can block the executor, 
        
           // and the resulting data structure can be huge. 
        
           // (https://github.com/neondatabase/neon/issues/6224)

We do take the LayerMap rwlock which is tokio lock for each attached timeline which will make progress per tokio's coop facilities and so yield every now and then. This is not true for secondaries. I'll add a yield per secondary tenant.

re: #6224 (comment)

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Dec 21, 2023

koivunej added a commit that referenced this issue Jan 26, 2024

fix: yield per secondary tenant

3e025e9

re: #6224 (comment)

koivunej added a commit that referenced this issue Feb 2, 2024

fix: yield per secondary tenant

3430b3f

re: #6224 (comment)

koivunej mentioned this issue Feb 26, 2024

Epic: improved eviction #5331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: optimize disk usage eviction for large total number of layers #6224

pageserver: optimize disk usage eviction for large total number of layers #6224

jcsp commented Dec 21, 2023

koivunej commented Jan 23, 2024

koivunej commented Jan 26, 2024

pageserver: optimize disk usage eviction for large total number of layers #6224

pageserver: optimize disk usage eviction for large total number of layers #6224

Comments

jcsp commented Dec 21, 2023

Basic approach: avoid using O(N_layers) memory

Sampling approach: operate in constant size memory

Unfair but fast approach

koivunej commented Jan 23, 2024

koivunej commented Jan 26, 2024