page cache: newtype the blob_io and ephemeral_file file ids #5005

problame · 2023-08-16T13:44:13Z

This makes it more explicit that these are different u64-sized namespaces.
Re-using one in place of the other would be catastrophic.

Prep for #4994
which will eliminate the ephemeral_file::FileId and move the
blob_io::FileId into page_cache.
It makes sense to have this preliminary commit though,
to minimize amount of new concept in #4994 and other
preliminaries that depend on that work.

This makes it more explicit that these are different u64-sized namespaces. Re-using one in place of the other would be catastrophic. Prep for #4994 which will eliminate the ephemeral_file::FileId and move the blob_io::FileId into page_cache. It makes sense to have this preliminary commit though, to minimize amount of new concept in #4994 and other preliminaries that depend on that work.

github-actions · 2023-08-16T14:15:21Z

1588 tests run: 1512 passed, 0 failed, 76 skipped (full report)

_{The comment gets automatically updated with the latest test results
163fbbf at 2023-08-16T14:15:20.841Z :recycle:}

(This PR is the successor of #4984 ) ## Summary The current way in which `EphemeralFile` uses `PageCache` complicates the Pageserver code base to a degree that isn't worth it. This PR refactors how we cache `EphemeralFile` contents, by exploiting the append-only nature of `EphemeralFile`. The result is that `PageCache` only holds `ImmutableFilePage` and `MaterializedPage`. These types of pages are read-only and evictable without write-back. This allows us to remove the writeback code from `PageCache`, also eliminating an entire failure mode. Futher, many great open-source libraries exist to solve the problem of a read-only cache, much better than our `page_cache.rs` (e.g., better replacement policy, less global locking). With this PR, we can now explore using them. ## Problem & Analysis Before this PR, `PageCache` had three types of pages: * `ImmutableFilePage`: caches Delta / Image layer file contents * `MaterializedPage`: caches results of Timeline::get (page materialization) * `EphemeralPage`: caches `EphemeralFile` contents `EphemeralPage` is quite different from `ImmutableFilePage` and `MaterializedPage`: * Immutable and materialized pages are for the acceleration of (future) reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of DRAM. * Ephemeral pages are a write-back cache of `EphemeralFile` contents, i.e., if there is pressure in the page cache, we spill `EphemeralFile` contents to disk. `EphemeralFile` is only used by `InMemoryLayer`, for the following purposes: * **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter for EphemeralFile` * **read**: when doing **page reconstruction** for a page@lsn that isn't written to disk * **read**: when writing L0 layer files, we re-read the `InMemoryLayer` and put the contents into the L0 delta writer (**`create_delta_layer`**). This happens every 10min or when InMemoryLayer reaches 256MB in size. The access patterns of the `InMemoryLayer` use case are as follows: * **write**: via `BlobWriter`, strictly append-only * **read for page reconstruction**: via `BlobReader`, random * **read for `create_delta_layer`**: via `BlobReader`, dependent on data, but generally random. Why? * in classical LSM terms, this function is what writes the memory-resident `C0` tree into the disk-resident `C1` tree * in our system, though, the values of InMemoryLayer are stored in an EphemeralFile, and hence they are not guaranteed to be memory-resident * the function reads `Value`s in `Key, LSN` order, which is `!=` insert order What do these `EphemeralFile`-level access patterns mean for the page cache? * **write**: * the common case is that `Value` is a WAL record, and if it isn't a full-page-image WAL record, then it's smaller than `PAGE_SIZE` * So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE` sized writes. * If there's no page cache eviction between subsequent `InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident, so the page cache avoids doing a `write` system call. * In practice, a busy page server will have page cache evictions because we only configure 64MB of page cache size. * **reads for page reconstruction**: read acceleration, just as for the other page types. * **reads for `create_delta_layer`**: * The `Value` reads happen through a `BlockCursor`, which optimizes the case of repeated reads from the same page. * So, the best case is that subsequent values are located on the same page; hence `BlockCursor`s buffer is maximally effective. * The worst case is that each `Value` is on a different page; hence the `BlockCursor`'s 1-page-sized buffer is ineffective. * The best case translates into `256MB/PAGE_SIZE` page cache accesses, one per page. * the worst case translates into `#Values` page cache accesses * again, the page cache accesses must be assumed to be random because the `Value`s aren't accessed in insertion order but `Key, LSN` order. ## Summary of changes Preliminaries for this PR were: - #5003 - #5004 - #5005 - uncommitted microbenchmark in #5011 Based on the observations outlined above, this PR makes the following changes: * Rip out `EphemeralPage` from `page_cache.rs` * Move the `block_io::FileId` to `page_cache::FileId` * Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct. It's called `mutable_tail`. * Change `write_blob` to use `mutable_tail` for the write buffering instead of a page cache page. * if `mutable_tail` is full, it writes it out to disk, zeroes it out, and re-uses it. * There is explicitly no double-buffering, so that memory allocation per `EphemeralFile` instance is fixed. * Change `read_blob` to return different `BlockLease` variants depending on `blknum` * for the `blknum` that corresponds to the `mutable_tail`, return a ref to it * Rust borrowing rules prevent `write_blob` calls while refs are outstanding. * for all non-tail blocks, return a page-cached `ImmutablePage` * It is safe to page-cache these as ImmutablePage because EphemeralFile is append-only. ## Performance How doe the changes above affect performance? M claim is: not significantly. * **write path**: * before this PR, the `EphemeralFile::write_blob` didn't issue its own `write` system calls. * If there were enough free pages, it didn't issue *any* `write` system calls. * If it had to evict other `EphemeralPage`s to get pages a page for its writes (`get_buf_for_write`), the page cache code would implicitly issue the writeback of victim pages as needed. * With this PR, `EphemeralFile::write_blob` *always* issues *all* of its *own* `write` system calls. * Also, the writes are explicit instead of implicit through page cache write back, which will help #4743 * The perf impact of always doing the writes is the CPU overhead and syscall latency. * Before this PR, we might have never issued them if there were enough free pages. * We don't issue `fsync` and can expect the writes to only hit the kernel page cache. * There is also an advantage in issuing the writes directly: the perf impact is paid by the tenant that caused the writes, instead of whatever tenant evicts the `EphemeralPage`. * **reads for page reconstruction**: no impact. * The `write_blob` function pre-warms the page cache when it writes the `mutable_tail` to disk. * So, the behavior is the same as with the EphemeralPages before this PR. * **reads for `create_delta_layer`**: no impact. * Same argument as for page reconstruction. * Note for the future: * going through the page cache likely causes read amplification here. Why? * Due to the `Key,Lsn`-ordered access pattern, we don't read all the values in the page before moving to the next page. In the worst case, we might read the same page multiple times to read different `Values` from it. * So, it might be better to bypass the page cache here. * Idea drafts: * bypass PS page cache + prefetch pipeline + iovec-based IO * bypass PS page cache + use `copy_file_range` to copy from ephemeral file into the L0 delta file, without going through user space

problame requested review from a team as code owners August 16, 2023 13:44

problame requested review from save-buffer, skyzh, hlinnaka and arpad-m and removed request for a team, save-buffer and skyzh August 16, 2023 13:44

problame mentioned this pull request Aug 16, 2023

simplify page-caching of EphemeralFile #4994

Merged

arpad-m approved these changes Aug 16, 2023

View reviewed changes

hlinnaka approved these changes Aug 16, 2023

View reviewed changes

problame merged commit 994411f into main Aug 16, 2023
28 checks passed

problame deleted the problame/page-cache-newtyped-file-ids branch August 16, 2023 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page cache: newtype the blob_io and ephemeral_file file ids #5005

page cache: newtype the blob_io and ephemeral_file file ids #5005

problame commented Aug 16, 2023

github-actions bot commented Aug 16, 2023

page cache: newtype the blob_io and ephemeral_file file ids #5005

page cache: newtype the blob_io and ephemeral_file file ids #5005

Conversation

problame commented Aug 16, 2023

github-actions bot commented Aug 16, 2023

1588 tests run: 1512 passed, 0 failed, 76 skipped (full report)