Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify page-caching of EphemeralFile #4994

Merged
merged 17 commits into from
Aug 18, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Aug 15, 2023

(This PR is the successor of #4984 )

Summary

The current way in which EphemeralFile uses PageCache complicates the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache EphemeralFile contents, by exploiting the append-only nature of EphemeralFile.

The result is that PageCache only holds ImmutableFilePage and MaterializedPage.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from PageCache, also eliminating an entire failure mode.

Futher, many great open-source libraries exist to solve the problem of a read-only cache,
much better than our page_cache.rs (e.g., better replacement policy, less global locking).
With this PR, we can now explore using them.

Problem & Analysis

Before this PR, PageCache had three types of pages:

  • ImmutableFilePage: caches Delta / Image layer file contents
  • MaterializedPage: caches results of Timeline::get (page materialization)
  • EphemeralPage: caches EphemeralFile contents

EphemeralPage is quite different from ImmutableFilePage and MaterializedPage:

  • Immutable and materialized pages are for the acceleration of (future) reads of the same data using PAGE_CACHE_SIZE * PAGE_SIZE bytes of DRAM.
  • Ephemeral pages are a write-back cache of EphemeralFile contents, i.e., if there is pressure in the page cache, we spill EphemeralFile contents to disk.

EphemeralFile is only used by InMemoryLayer, for the following purposes:

  • write: when filling up the InMemoryLayer, via impl BlobWriter for EphemeralFile
  • read: when doing page reconstruction for a page@lsn that isn't written to disk
  • read: when writing L0 layer files, we re-read the InMemoryLayer and put the contents into the L0 delta writer (create_delta_layer). This happens every 10min or when InMemoryLayer reaches 256MB in size.

The access patterns of the InMemoryLayer use case are as follows:

  • write: via BlobWriter, strictly append-only
  • read for page reconstruction: via BlobReader, random
  • read for create_delta_layer: via BlobReader, dependent on data, but generally random. Why?
    • in classical LSM terms, this function is what writes the memory-resident C0 tree into the disk-resident C1 tree
    • in our system, though, the values of InMemoryLayer are stored in an EphemeralFile, and hence they are not guaranteed to be memory-resident
    • the function reads Values in Key, LSN order, which is != insert order

What do these EphemeralFile-level access patterns mean for the page cache?

  • write:
    • the common case is that Value is a WAL record, and if it isn't a full-page-image WAL record, then it's smaller than PAGE_SIZE
    • So, the EphemeralPage pages act as a buffer for these < PAGE_CACHE sized writes.
    • If there's no page cache eviction between subsequent InMemoryLayer::put_value calls, the EphemeralPage is still resident, so the page cache avoids doing a write system call.
      • In practice, a busy page server will have page cache evictions because we only configure 64MB of page cache size.
  • reads for page reconstruction: read acceleration, just as for the other page types.
  • reads for create_delta_layer:
    • The Value reads happen through a BlockCursor, which optimizes the case of repeated reads from the same page.
    • So, the best case is that subsequent values are located on the same page; hence BlockCursors buffer is maximally effective.
    • The worst case is that each Value is on a different page; hence the BlockCursor's 1-page-sized buffer is ineffective.
    • The best case translates into 256MB/PAGE_SIZE page cache accesses, one per page.
    • the worst case translates into #Values page cache accesses
    • again, the page cache accesses must be assumed to be random because the Values aren't accessed in insertion order but Key, LSN order.

Summary of changes

Preliminaries for this PR were:

Based on the observations outlined above, this PR makes the following changes:

  • Rip out EphemeralPage from page_cache.rs
  • Move the block_io::FileId to page_cache::FileId
  • Add a PAGE_SIZEd buffer to the EphemeralPage struct.
    It's called mutable_tail.
  • Change write_blob to use mutable_tail for the write buffering instead of a page cache page.
    • if mutable_tail is full, it writes it out to disk, zeroes it out, and re-uses it.
      • There is explicitly no double-buffering, so that memory allocation per EphemeralFile instance is fixed.
  • Change read_blob to return different BlockLease variants depending on blknum
    • for the blknum that corresponds to the mutable_tail, return a ref to it
      • Rust borrowing rules prevent write_blob calls while refs are outstanding.
    • for all non-tail blocks, return a page-cached ImmutablePage
      • It is safe to page-cache these as ImmutablePage because EphemeralFile is append-only.

Performance

How doe the changes above affect performance?
M claim is: not significantly.

  • write path:
    • before this PR, the EphemeralFile::write_blob didn't issue its own write system calls.
      • If there were enough free pages, it didn't issue any write system calls.
      • If it had to evict other EphemeralPages to get pages a page for its writes (get_buf_for_write), the page cache code would implicitly issue the writeback of victim pages as needed.
    • With this PR, EphemeralFile::write_blob always issues all of its own write system calls.
    • The perf impact of always doing the writes is the CPU overhead and syscall latency.
      • Before this PR, we might have never issued them if there were enough free pages.
      • We don't issue fsync and can expect the writes to only hit the kernel page cache.
      • There is also an advantage in issuing the writes directly: the perf impact is paid by the tenant that caused the writes, instead of whatever tenant evicts the EphemeralPage.
  • reads for page reconstruction: no impact.
    • The write_blob function pre-warms the page cache when it writes the mutable_tail to disk.
    • So, the behavior is the same as with the EphemeralPages before this PR.
  • reads for create_delta_layer: no impact.
    • Same argument as for page reconstruction.
    • Note for the future:
      • going through the page cache likely causes read amplification here. Why?
        • Due to the Key,Lsn-ordered access pattern, we don't read all the values in the page before moving to the next page. In the worst case, we might read the same page multiple times to read different Values from it.
      • So, it might be better to bypass the page cache here.
      • Idea drafts:
        • bypass PS page cache + prefetch pipeline + iovec-based IO
        • bypass PS page cache + use copy_file_range to copy from ephemeral file into the L0 delta file, without going through user space

@problame problame force-pushed the problame/simplify-page-cache-ephemeral-files branch from fdfd4d9 to 2a5cdee Compare August 15, 2023 12:42
@problame problame changed the title only page-cache the immutable part of EphemeralFile simplify page caching of EphemeralFile Aug 15, 2023
@github-actions
Copy link

github-actions bot commented Aug 15, 2023

1608 tests run: 1532 passed, 0 failed, 76 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_get_tenant_size_with_multiple_branches: debug
The comment gets automatically updated with the latest test results
8e891e3 at 2023-08-18T15:52:55.194Z :recycle:

@problame problame marked this pull request as ready for review August 15, 2023 13:50
@problame problame requested review from a team as code owners August 15, 2023 13:50
@problame problame requested review from knizhnik and removed request for a team and knizhnik August 15, 2023 13:50
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, didn't fully understand.

pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
It was necessary before I rebased this patch on top of

commit baf3959
Author: Arpad Müller <arpad-m@users.noreply.github.com>
Date:   Mon Aug 14 18:48:09 2023 +0200

    Turn BlockLease associated type into an enum (#4982)
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at top of page_cache.rs needs updating, now that you cannot (or shouldn't) modify a page

pageserver/src/page_cache.rs Outdated Show resolved Hide resolved
pageserver/src/page_cache.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/block_io.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
@problame
Copy link
Contributor Author

Thanks for the reviews, will push more fixes shortly.


In the meantime, I did a small extraction:

@problame
Copy link
Contributor Author

I'm working on more extractions.

problame added a commit that referenced this pull request Aug 16, 2023
Before this patch, we had the `off` and `blknum` as function-wide
mutable state. Now it's contained in the `Writer` struct.

The use of `push_bytes` instead of index-based filling of the buffer
also makes it easier to reason about what's going on.

This is prep for #4994
@problame
Copy link
Contributor Author

problame added a commit that referenced this pull request Aug 16, 2023
problame added a commit that referenced this pull request Aug 16, 2023
This makes it more explicit that these are different u64-sized namespaces.
Re-using one in place of the other would be catastrophic.

Prep for #4994
which will eliminate the ephemeral_file::FileId and move the
blob_io::FileId into page_cache.
It makes sense to have this preliminary commit though,
to minimize amount of new concept in #4994 and other
preliminaries that depend on that work.
@problame
Copy link
Contributor Author

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments, otherwise okay.

pageserver/src/tenant/ephemeral_file.rs Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Show resolved Hide resolved
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the 'static lifetime instead of 'a this should be good.

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final approval

@problame problame changed the title simplify page caching of EphemeralFile simplify page-caching of EphemeralFile Aug 18, 2023
@problame problame enabled auto-merge (squash) August 18, 2023 15:34
@problame problame merged commit 7a63685 into main Aug 18, 2023
28 checks passed
@problame problame deleted the problame/simplify-page-cache-ephemeral-files branch August 18, 2023 17:31
arpad-m pushed a commit that referenced this pull request Aug 24, 2023
## Problem
close #5034

## Summary of changes
Based on the
[comment](#4994 (comment)).
Just rename the `EphmeralFile::size` to `EphemeralFile::len`.
problame added a commit that referenced this pull request Aug 28, 2023
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).

This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.

But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.

The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0, whereas `drop_buffers_for_immutable` can kick the page out right away.

However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.

Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.

So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
problame added a commit that referenced this pull request Aug 28, 2023
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).

This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.

But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.

The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0,
whereas `drop_buffers_for_immutable` can kick the page out right away.

However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.

Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.

Also, the real reason why we're looking into this piece of code is
that we want to make the slots rwlock async in #5023.
Since `drop_buffers_for_immutable` is called from drop, and there
is no async drop, it would be nice to not have to deal with this.

So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
problame pushed a commit that referenced this pull request Aug 30, 2023
## Problem

`read_blk` does I/O and thus we would like to make it async. We can't
make the function async as long as the `PageReadGuard` returned by
`read_blk` isn't `Send`. The page cache is called by `read_blk`, and
thus it can't be async without `read_blk` being async. Thus, we have a
circular dependency.

## Summary of changes

Due to the circular dependency, we convert both the page cache and
`read_blk` to async at the same time:

We make the page cache use `tokio::sync` synchronization primitives as
those are `Send`. This makes all the places that acquire a lock require
async though, which we then also do. This includes also asyncification
of the `read_blk` function.

Builds upon #4994, #5015, #5056, and #5129.

Part of #4743.
arpad-m pushed a commit that referenced this pull request Sep 2, 2023
(part of #4743)
(preliminary to #5180)
 
This PR adds a special-purpose API to `VirtualFile` for write-once
files.
It adopts it for `save_metadata` and `persist_tenant_conf`.

This is helpful for the asyncification efforts (#4743) and specifically
asyncification of `VirtualFile` because above two functions were the
only ones that needed the VirtualFile to be an `std::io::Write`.
(There was also `manifest.rs` that needed the `std::io::Write`, but, it
isn't used right now, and likely won't be used because we're taking a
different route for crash consistency, see #5172. So, let's remove it.
It'll be in Git history if we need to re-introduce it when picking up
the compaction work again; that's why it was introduced in the first
place).

We can't remove the `impl std::io::Write for VirtualFile` just yet
because of the `BufWriter` in

```rust
struct DeltaLayerWriterInner {
...
    blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>,
}
```

But, @arpad-m and I have a plan to get rid of that by extracting the
append-only-ness-on-top-of-VirtualFile that #4994 added to
`EphemeralFile` into an abstraction that can be re-used in the
`DeltaLayerWriterInner` and `ImageLayerWriterInner`.
That'll be another PR.


### Performance Impact

This PR adds more fsyncs compared to before because we fsync the parent
directory every time.

1. For `save_metadata`, the additional fsyncs are unnecessary because we
know that `metadata` fits into a kernel page, and hence the write won't
be torn on the way into the kernel. However, the `metadata` file in
general is going to lose signficance very soon (=> see #5172), and the
NVMes in prod can definitely handle the additional fsync. So, let's not
worry about it.
2. For `persist_tenant_conf`, which we don't check to fit into a single
kernel page, this PR makes it actually crash-consistent. Before, we
could crash while writing out the tenant conf, leaving a prefix of the
tenant conf on disk.
problame added a commit that referenced this pull request Sep 20, 2023
…_ephemeral (#5338)

We removed the user of this in #4994 .

But the metrics field was `pub`, so, didn't cause an unused-warning in
#4994.

This is preliminary for: #5339
problame added a commit that referenced this pull request Oct 5, 2023
…ed_page

Motivation
==========

It's the only user, and the name of `_for_write` is wrong as of

    commit 7a63685
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Aug 18 19:31:03 2023 +0200

        simplify page-caching of EphemeralFile (#4994)

Notes
=====

This also allows us to get rid of the WriteBufResult type.

Also rename `search_mapping_for_write` to `search_mapping_exact`.
It makes more sense that way because there is `_for_write`-locking
anymore.
problame added a commit that referenced this pull request Oct 6, 2023
…d_page` (#5480)

Motivation
==========

It's the only user, and the name of `_for_write` is wrong as of

    commit 7a63685
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Aug 18 19:31:03 2023 +0200

        simplify page-caching of EphemeralFile (#4994)

Notes
=====

This also allows us to get rid of the WriteBufResult type.

Also rename `search_mapping_for_write` to `search_mapping_exact`. It
makes more sense that way because there is `_for_write`-locking anymore.

Refs
====

part of #4743
specifically #5479

this is prep work for #5482
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants