Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page_cache: ensure forward progress on miss #5482

Merged
merged 4 commits into from
Oct 6, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Oct 5, 2023

Problem

Prior to this PR, when we had a cache miss, we'd get back a write guard, fill it, the drop it and retry the read from cache.

If there's severe contention for the cache, it could happen that the just-filled data gets evicted before our retry, resulting in lost work and no forward progress.

Solution

This PR leverages the now-available tokio::sync::RwLockWriteGuard's downgrade() functionality to turn the filled slot write guard into a read guard.
We don't drop the guard at any point, so, forward progress is ensured.

Refs

Stacked atop #5480

part of #4743
specifically part of #5479

…ed_page

Motivation
==========

It's the only user, and the name of `_for_write` is wrong as of

    commit 7a63685
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Aug 18 19:31:03 2023 +0200

        simplify page-caching of EphemeralFile (#4994)

Notes
=====

This also allows us to get rid of the WriteBufResult type.

Also rename `search_mapping_for_write` to `search_mapping_exact`.
It makes more sense that way because there is `_for_write`-locking
anymore.
@problame problame force-pushed the problame/page-cache-forward-progress/2 branch from 652b1df to dc96a76 Compare October 5, 2023 14:52
@problame problame linked an issue Oct 5, 2023 that may be closed by this pull request
1 task
@problame problame marked this pull request as ready for review October 5, 2023 15:28
@problame problame requested a review from a team as a code owner October 5, 2023 15:28
@problame problame requested review from hlinnaka and arpad-m and removed request for a team October 5, 2023 15:28
@github-actions
Copy link

github-actions bot commented Oct 5, 2023

2250 tests run: 2134 passed, 0 failed, 116 skipped (full report)


Flaky tests (1)

Postgres 16

Code coverage (full report)

  • functions: 52.3% (8122 of 15521 functions)
  • lines: 81.1% (47482 of 58577 lines)

The comment gets automatically updated with the latest test results
a390c21 at 2023-10-06T12:33:28.774Z :recycle:

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, lets us get rid of those loops 👍

pageserver/src/page_cache.rs Show resolved Hide resolved
pageserver/src/page_cache.rs Outdated Show resolved Hide resolved
Base automatically changed from problame/page-cache-forward-progress/1 to main October 6, 2023 11:38
problame added a commit that referenced this pull request Oct 6, 2023
…d_page` (#5480)

Motivation
==========

It's the only user, and the name of `_for_write` is wrong as of

    commit 7a63685
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Aug 18 19:31:03 2023 +0200

        simplify page-caching of EphemeralFile (#4994)

Notes
=====

This also allows us to get rid of the WriteBufResult type.

Also rename `search_mapping_for_write` to `search_mapping_exact`. It
makes more sense that way because there is `_for_write`-locking anymore.

Refs
====

part of #4743
specifically #5479

this is prep work for #5482
@problame problame enabled auto-merge (squash) October 6, 2023 11:50
@problame problame merged commit bfba5e3 into main Oct 6, 2023
39 checks passed
@problame problame deleted the problame/page-cache-forward-progress/2 branch October 6, 2023 12:41
problame added a commit that referenced this pull request Jan 11, 2024
)

This reverts commit ab1f37e.
Thereby
fixes #5479

Updated Analysis
================

The problem with the original patch was that it, for the first time,
exposed the `VirtualFile` code to tokio task concurrency instead of just
thread-based concurrency. That caused the VirtualFile file descriptor
cache to start thrashing, effectively grinding the system to a halt.

Details
-------

At the time of the original patch, we had a _lot_ of runnable tasks in
the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is
that our production systems fell into a valley of zero goodput, high
CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent
initial logical size calculations.
Initial size calculation was initiated for all timelines within the
first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the `VirtualFile` code never returned
`Poll::Pending`.
That meant that once we entered it, the calling tokio task would not
yield to the tokio executor until we were done performing the
VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's
synchronization primitives to those from `tokio::sync`.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return `Poll::Pending`.
But, the tokio scheduler makes tokio sync primitives return `Pending`
*artificially*, as a mechanism for the scheduler to get back into
control more often
([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)).

So, the new reality was that VirtualFile calls could now yield to the
tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing,
most notably the VirtualFile file descriptor cache: each time a task
would be rescheduled, it would want to do its next VirtualFile
operation. For that, it would first need to evict another (task's)
VirtualFile fd from the cache to make room for its own fd. It would then
do one VirtualFile operation before hitting an await point and yielding
to the executor again. The executor would run the other 19999 tasks for
fairness before circling back to the first task, which would find its fd
evicted.

The other cache that would theoretically be impacted in a similar way is
the pageserver's `PageCache`.
However, for initial logical size calculation, it seems much less
relevant in experiments, likely because of the random access nature of
initial logical size calculation.

Fixes
=====

We fixed the above problems by
- raising VirtualFile cache sizes
  - neondatabase/cloud#8351
- changing code to ensure forward-progress once cache slots have been
acquired
  - #5480
  - #5482
  - tbd: #6065
- reducing the amount of runnable tokio tasks
  - #5578
  - #6000
- fix bugs that caused unnecessary concurrency induced by connection
handlers
  - #5993

I manually verified that this PR doesn't negatively affect startup
performance as follows:
create a pageserver in production configuration, with 20k
tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

```
INFO Startup complete (368.009s since start) elapsed_ms=368009
```

I further verified in that same setup that, when using `pagebench`'s
getpage benchmark at as-fast-as-possible request rate against 5k of the
20k tenants, the achieved throughput is identical. The VirtualFile cache
isn't thrashing in that case.

Future Work
===========

We will still exposed to the cache thrashing risk from outside factors,
e.g., request concurrency is unbounded, and initial size calculation
skips the concurrency limiter when we establish a walreceiver
connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter
a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:
1. we have dashboards & metrics to monitor & alert on cache thrashing
2. we can react by scaling the bottleneck resources (cache size) or by
manually shedding load through tenant relocation

Potential systematic solutions are future work:
* global concurrency limiting
* per-tenant rate limiting => #5899
* pageserver-initiated load shedding

Related Issues
==============

This PR unblocks the introduction of tokio-epoll-uring for asynchronous
disk IO ([Epic](#4744)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

revert "revert recent VirtualFile asyncification changes #5291"
2 participants