-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplify page-caching of EphemeralFile #4994
Conversation
fdfd4d9
to
2a5cdee
Compare
1608 tests run: 1532 passed, 0 failed, 76 skipped (full report)The comment gets automatically updated with the latest test results
8e891e3 at 2023-08-18T15:52:55.194Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, didn't fully understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment at top of page_cache.rs
needs updating, now that you cannot (or shouldn't) modify a page
Thanks for the reviews, will push more fixes shortly. In the meantime, I did a small extraction: |
I'm working on more extractions. |
Before this patch, we had the `off` and `blknum` as function-wide mutable state. Now it's contained in the `Writer` struct. The use of `push_bytes` instead of index-based filling of the buffer also makes it easier to reason about what's going on. This is prep for #4994
This makes it more explicit that these are different u64-sized namespaces. Re-using one in place of the other would be catastrophic. Prep for #4994 which will eliminate the ephemeral_file::FileId and move the blob_io::FileId into page_cache. It makes sense to have this preliminary commit though, to minimize amount of new concept in #4994 and other preliminaries that depend on that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments, otherwise okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the 'static
lifetime instead of 'a
this should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final approval
## Problem close #5034 ## Summary of changes Based on the [comment](#4994 (comment)). Just rename the `EphmeralFile::size` to `EphemeralFile::len`.
Before this patch, when dropping an EphemeralFile, we'd scan the entire `slots` to proactively evict its pages (`drop_buffers_for_immutable`). This was _necessary_ before #4994 because the page cache was a write-back cache: we'd be deleting the EphemeralFile from disk after, so, if we hadn't evicted its pages before that, write-back in `find_victim` wouldhave failed. But, since #4994, the page cache is a read-only cache, so, it's safe to keep read-only data cached. It's never going to get accessed again and eventually, `find_victim` will evict it. The only remaining advantage of `drop_buffers_for_immutable` over relying on `find_victim` is that `find_victim` has to do the clock page replacement iterations until the count reaches 0, whereas `drop_buffers_for_immutable` can kick the page out right away. However, weigh that against the cost of `drop_buffers_for_immutable`, which currently scans the entire `slots` array to find the EphemeralFile's pages. Alternatives have been proposed in #5122 and #5128, but, they come with their own overheads & trade-offs. So, let's just stop doing `drop_buffers_for_immutable` and observe the performance impact in benchmarks.
Before this patch, when dropping an EphemeralFile, we'd scan the entire `slots` to proactively evict its pages (`drop_buffers_for_immutable`). This was _necessary_ before #4994 because the page cache was a write-back cache: we'd be deleting the EphemeralFile from disk after, so, if we hadn't evicted its pages before that, write-back in `find_victim` wouldhave failed. But, since #4994, the page cache is a read-only cache, so, it's safe to keep read-only data cached. It's never going to get accessed again and eventually, `find_victim` will evict it. The only remaining advantage of `drop_buffers_for_immutable` over relying on `find_victim` is that `find_victim` has to do the clock page replacement iterations until the count reaches 0, whereas `drop_buffers_for_immutable` can kick the page out right away. However, weigh that against the cost of `drop_buffers_for_immutable`, which currently scans the entire `slots` array to find the EphemeralFile's pages. Alternatives have been proposed in #5122 and #5128, but, they come with their own overheads & trade-offs. Also, the real reason why we're looking into this piece of code is that we want to make the slots rwlock async in #5023. Since `drop_buffers_for_immutable` is called from drop, and there is no async drop, it would be nice to not have to deal with this. So, let's just stop doing `drop_buffers_for_immutable` and observe the performance impact in benchmarks.
## Problem `read_blk` does I/O and thus we would like to make it async. We can't make the function async as long as the `PageReadGuard` returned by `read_blk` isn't `Send`. The page cache is called by `read_blk`, and thus it can't be async without `read_blk` being async. Thus, we have a circular dependency. ## Summary of changes Due to the circular dependency, we convert both the page cache and `read_blk` to async at the same time: We make the page cache use `tokio::sync` synchronization primitives as those are `Send`. This makes all the places that acquire a lock require async though, which we then also do. This includes also asyncification of the `read_blk` function. Builds upon #4994, #5015, #5056, and #5129. Part of #4743.
(part of #4743) (preliminary to #5180) This PR adds a special-purpose API to `VirtualFile` for write-once files. It adopts it for `save_metadata` and `persist_tenant_conf`. This is helpful for the asyncification efforts (#4743) and specifically asyncification of `VirtualFile` because above two functions were the only ones that needed the VirtualFile to be an `std::io::Write`. (There was also `manifest.rs` that needed the `std::io::Write`, but, it isn't used right now, and likely won't be used because we're taking a different route for crash consistency, see #5172. So, let's remove it. It'll be in Git history if we need to re-introduce it when picking up the compaction work again; that's why it was introduced in the first place). We can't remove the `impl std::io::Write for VirtualFile` just yet because of the `BufWriter` in ```rust struct DeltaLayerWriterInner { ... blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>, } ``` But, @arpad-m and I have a plan to get rid of that by extracting the append-only-ness-on-top-of-VirtualFile that #4994 added to `EphemeralFile` into an abstraction that can be re-used in the `DeltaLayerWriterInner` and `ImageLayerWriterInner`. That'll be another PR. ### Performance Impact This PR adds more fsyncs compared to before because we fsync the parent directory every time. 1. For `save_metadata`, the additional fsyncs are unnecessary because we know that `metadata` fits into a kernel page, and hence the write won't be torn on the way into the kernel. However, the `metadata` file in general is going to lose signficance very soon (=> see #5172), and the NVMes in prod can definitely handle the additional fsync. So, let's not worry about it. 2. For `persist_tenant_conf`, which we don't check to fit into a single kernel page, this PR makes it actually crash-consistent. Before, we could crash while writing out the tenant conf, leaving a prefix of the tenant conf on disk.
…ed_page Motivation ========== It's the only user, and the name of `_for_write` is wrong as of commit 7a63685 Author: Christian Schwarz <christian@neon.tech> Date: Fri Aug 18 19:31:03 2023 +0200 simplify page-caching of EphemeralFile (#4994) Notes ===== This also allows us to get rid of the WriteBufResult type. Also rename `search_mapping_for_write` to `search_mapping_exact`. It makes more sense that way because there is `_for_write`-locking anymore.
…d_page` (#5480) Motivation ========== It's the only user, and the name of `_for_write` is wrong as of commit 7a63685 Author: Christian Schwarz <christian@neon.tech> Date: Fri Aug 18 19:31:03 2023 +0200 simplify page-caching of EphemeralFile (#4994) Notes ===== This also allows us to get rid of the WriteBufResult type. Also rename `search_mapping_for_write` to `search_mapping_exact`. It makes more sense that way because there is `_for_write`-locking anymore. Refs ==== part of #4743 specifically #5479 this is prep work for #5482
(This PR is the successor of #4984 )
Summary
The current way in which
EphemeralFile
usesPageCache
complicates the Pageserver code base to a degree that isn't worth it.This PR refactors how we cache
EphemeralFile
contents, by exploiting the append-only nature ofEphemeralFile
.The result is that
PageCache
only holdsImmutableFilePage
andMaterializedPage
.These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from
PageCache
, also eliminating an entire failure mode.Futher, many great open-source libraries exist to solve the problem of a read-only cache,
much better than our
page_cache.rs
(e.g., better replacement policy, less global locking).With this PR, we can now explore using them.
Problem & Analysis
Before this PR,
PageCache
had three types of pages:ImmutableFilePage
: caches Delta / Image layer file contentsMaterializedPage
: caches results of Timeline::get (page materialization)EphemeralPage
: cachesEphemeralFile
contentsEphemeralPage
is quite different fromImmutableFilePage
andMaterializedPage
:PAGE_CACHE_SIZE * PAGE_SIZE
bytes of DRAM.EphemeralFile
contents, i.e., if there is pressure in the page cache, we spillEphemeralFile
contents to disk.EphemeralFile
is only used byInMemoryLayer
, for the following purposes:InMemoryLayer
, viaimpl BlobWriter for EphemeralFile
InMemoryLayer
and put the contents into the L0 delta writer (create_delta_layer
). This happens every 10min or when InMemoryLayer reaches 256MB in size.The access patterns of the
InMemoryLayer
use case are as follows:BlobWriter
, strictly append-onlyBlobReader
, randomcreate_delta_layer
: viaBlobReader
, dependent on data, but generally random. Why?C0
tree into the disk-residentC1
treeValue
s inKey, LSN
order, which is!=
insert orderWhat do these
EphemeralFile
-level access patterns mean for the page cache?Value
is a WAL record, and if it isn't a full-page-image WAL record, then it's smaller thanPAGE_SIZE
EphemeralPage
pages act as a buffer for these< PAGE_CACHE
sized writes.InMemoryLayer::put_value
calls, theEphemeralPage
is still resident, so the page cache avoids doing awrite
system call.create_delta_layer
:Value
reads happen through aBlockCursor
, which optimizes the case of repeated reads from the same page.BlockCursor
s buffer is maximally effective.Value
is on a different page; hence theBlockCursor
's 1-page-sized buffer is ineffective.256MB/PAGE_SIZE
page cache accesses, one per page.#Values
page cache accessesValue
s aren't accessed in insertion order butKey, LSN
order.Summary of changes
Preliminaries for this PR were:
Based on the observations outlined above, this PR makes the following changes:
EphemeralPage
frompage_cache.rs
block_io::FileId
topage_cache::FileId
PAGE_SIZE
d buffer to theEphemeralPage
struct.It's called
mutable_tail
.write_blob
to usemutable_tail
for the write buffering instead of a page cache page.mutable_tail
is full, it writes it out to disk, zeroes it out, and re-uses it.EphemeralFile
instance is fixed.read_blob
to return differentBlockLease
variants depending onblknum
blknum
that corresponds to themutable_tail
, return a ref to itwrite_blob
calls while refs are outstanding.ImmutablePage
Performance
How doe the changes above affect performance?
M claim is: not significantly.
EphemeralFile::write_blob
didn't issue its ownwrite
system calls.write
system calls.EphemeralPage
s to get pages a page for its writes (get_buf_for_write
), the page cache code would implicitly issue the writeback of victim pages as needed.EphemeralFile::write_blob
always issues all of its ownwrite
system calls.Timeline::get
toasync fn
#4743fsync
and can expect the writes to only hit the kernel page cache.EphemeralPage
.write_blob
function pre-warms the page cache when it writes themutable_tail
to disk.create_delta_layer
: no impact.Key,Lsn
-ordered access pattern, we don't read all the values in the page before moving to the next page. In the worst case, we might read the same page multiple times to read differentValues
from it.copy_file_range
to copy from ephemeral file into the L0 delta file, without going through user space