Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: limit total ephemeral layer bytes #7218

Merged
merged 8 commits into from Mar 26, 2024
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Mar 23, 2024

Problem

Follows: #7182

  • Sufficient concurrent writes could OOM a pageserver from the size of indices on all the InMemoryLayer instances.
  • Enforcement of checkpoint_period only happened if there were some writes.

Closes: #6916

Summary of changes

  • Add ephemeral_bytes_per_memory_kb config property. This controls the ratio of ephemeral layer capacity to memory capacity. The weird unit is to enable making the ratio less than 1:1 (set this property to 1024 to use 1MB of ephemeral layers for every 1MB of RAM, set it smaller to get a fraction).
  • Implement background layer rolling checks in Timeline::compaction_iteration -- this ensures we apply layer rolling policy in the absence of writes.
  • During background checks, if the total ephemeral layer size has exceeded the limit, then roll layers whose size is greater than the mean size of all ephemeral layers.
  • Remove the tick() path from walreceiver: it isn't needed any more now that we do equivalent checks from compaction_iteration.
  • Add tests for the above.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Mar 23, 2024
@jcsp jcsp changed the title Jcsp/issue 6916 pt2 pageserver: limit total ephemeral layer bytes Mar 23, 2024
Copy link

github-actions bot commented Mar 23, 2024

2730 tests run: 2590 passed, 0 failed, 140 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_deletion_queue_recovery[no-validate-lose]: debug

Code coverage* (full report)

  • functions: 28.2% (6299 of 22348 functions)
  • lines: 47.0% (44254 of 94189 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
23ca0b3 at 2024-03-26T15:56:44.437Z :recycle:

@jcsp jcsp marked this pull request as ready for review March 26, 2024 09:27
@jcsp jcsp requested a review from a team as a code owner March 26, 2024 09:27
@jcsp jcsp requested review from arpad-m and VladLazar March 26, 2024 09:27
pageserver/src/config.rs Outdated Show resolved Hide resolved
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
@jcsp jcsp merged commit 47d2b3a into main Mar 26, 2024
46 of 50 checks passed
@jcsp jcsp deleted the jcsp/issue-6916-pt2 branch March 26, 2024 15:45
jcsp added a commit that referenced this pull request May 6, 2024
…7594)

## Problem

In testing of the earlier fix for OOMs under heavy write load
(#7218), we saw that the limit
on ephemeral layer size wasn't being reliably enforced. That was
diagnosed as being due to overwhelmed compaction loops: most tenants
were waiting on the semaphore for background tasks, and thereby not
running the function that proactively rolls layers frequently enough.

Related: #6939 

## Summary of changes

- Create a new per-tenant background loop for "ingest housekeeping",
which invokes maybe_freeze_ephemeral_layer() without taking the
background task semaphore.
- Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had
been INFO, but turns out to be pretty common in the field.

There's some discussion on the issue
(#6939 (comment))
about alternatives for calling this maybe_freeze_epemeral_layer
periodically without it getting stuck behind compaction. A whole task
just for this feels like kind of a big hammer, but we may in future find
that there are other pieces of lightweight housekeeping that we want to
do here too.

Why is it okay to call maybe_freeze_ephemeral_layer outside of the
background tasks semaphore?
- this is the same work we would do anyway if we receive writes from the
safekeeper, just done a bit sooner.
- The period of the new task is generously jittered (+/- 5%), so when
the ephemeral layer size tips over the threshold, we shouldn't see an
excessively aggressive thundering herd of layer freezes (and only layers
larger than the mean layer size will be frozen)
- All that said, this is an imperfect approach that relies on having a
generous amount of RAM to dip into when we need to freeze somewhat
urgently. It would be nice in future to also block compaction/GC when we
recognize resource stress and need to do other work (like layer
freezing) to reduce memory footprint.
conradludgate pushed a commit that referenced this pull request May 8, 2024
…7594)

## Problem

In testing of the earlier fix for OOMs under heavy write load
(#7218), we saw that the limit
on ephemeral layer size wasn't being reliably enforced. That was
diagnosed as being due to overwhelmed compaction loops: most tenants
were waiting on the semaphore for background tasks, and thereby not
running the function that proactively rolls layers frequently enough.

Related: #6939 

## Summary of changes

- Create a new per-tenant background loop for "ingest housekeeping",
which invokes maybe_freeze_ephemeral_layer() without taking the
background task semaphore.
- Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had
been INFO, but turns out to be pretty common in the field.

There's some discussion on the issue
(#6939 (comment))
about alternatives for calling this maybe_freeze_epemeral_layer
periodically without it getting stuck behind compaction. A whole task
just for this feels like kind of a big hammer, but we may in future find
that there are other pieces of lightweight housekeeping that we want to
do here too.

Why is it okay to call maybe_freeze_ephemeral_layer outside of the
background tasks semaphore?
- this is the same work we would do anyway if we receive writes from the
safekeeper, just done a bit sooner.
- The period of the new task is generously jittered (+/- 5%), so when
the ephemeral layer size tips over the threshold, we shouldn't see an
excessively aggressive thundering herd of layer freezes (and only layers
larger than the mean layer size will be frozen)
- All that said, this is an imperfect approach that relies on having a
generous amount of RAM to dip into when we need to freeze somewhat
urgently. It would be nice in future to also block compaction/GC when we
recognize resource stress and need to do other work (like layer
freezing) to reduce memory footprint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pageserver: ephemeral layers' indices can use unbounded memory
3 participants