Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "revert recent VirtualFile asyncification changes (#5291)" #6309

Merged
merged 1 commit into from
Jan 11, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Jan 9, 2024

This reverts commit ab1f37e.
Thereby
fixes #5479

Updated Analysis

The problem with the original patch was that it, for the first time, exposed the VirtualFile code to tokio task concurrency instead of just thread-based concurrency. That caused the VirtualFile file descriptor cache to start thrashing, effectively grinding the system to a halt.

Details

At the time of the original patch, we had a lot of runnable tasks in the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is that our production systems fell into a valley of zero goodput, high CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent initial logical size calculations.
Initial size calculation was initiated for all timelines within the first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the VirtualFile code never returned Poll::Pending.
That meant that once we entered it, the calling tokio task would not yield to the tokio executor until we were done performing the VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's synchronization primitives to those from tokio::sync.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return Poll::Pending.
But, the tokio scheduler makes tokio sync primitives return Pending artificially, as a mechanism for the scheduler to get back into control more often (example).

So, the new reality was that VirtualFile calls could now yield to the tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing, most notably the VirtualFile file descriptor cache: each time a task would be rescheduled, it would want to do its next VirtualFile operation. For that, it would first need to evict another (task's) VirtualFile fd from the cache to make room for its own fd. It would then do one VirtualFile operation before hitting an await point and yielding to the executor again. The executor would run the other 19999 tasks for fairness before circling back to the first task, which would find its fd evicted.

The other cache that would theoretically be impacted in a similar way is the pageserver's PageCache.
However, for initial logical size calculation, it seems much less relevant in experiments, likely because of the random access nature of initial logical size calculation.

Fixes

We fixed the above problems by

I manually verified that this PR doesn't negatively affect startup performance as follows:
create a pageserver in production configuration, with 20k tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

INFO Startup complete (368.009s since start) elapsed_ms=368009

I further verified in that same setup that, when using pagebench's getpage benchmark at as-fast-as-possible request rate against 5k of the 20k tenants, the achieved throughput is identical. The VirtualFile cache isn't thrashing in that case.

Future Work

We will still exposed to the cache thrashing risk from outside factors, e.g., request concurrency is unbounded, and initial size calculation skips the concurrency limiter when we establish a walreceiver connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:

  1. we have dashboards & metrics to monitor & alert on cache thrashing
  2. we can react by scaling the bottleneck resources (cache size) or by manually shedding load through tenant relocation

Potential systematic solutions are future work:

Related Issues

This PR unblocks the introduction of tokio-epoll-uring for asynchronous disk IO (Epic).

Copy link

github-actions bot commented Jan 9, 2024

2238 tests run: 2153 passed, 0 failed, 85 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_statvfs_pressure_usage: debug

Code coverage (full report)

  • functions: 54.7% (10108 of 18486 functions)
  • lines: 81.6% (58037 of 71140 lines)

The comment gets automatically updated with the latest test results
16cb2cd at 2024-01-09T19:45:43.299Z :recycle:

@problame problame changed the base branch from problame/revert-revert-virtualfile-asyncification/2024-01-09--01 to main January 9, 2024 19:09
@problame problame force-pushed the problame/revert-revert-virtualfile-asyncification/pr branch from 0ecfce6 to 16cb2cd Compare January 9, 2024 19:09
@problame problame marked this pull request as ready for review January 10, 2024 18:41
@problame problame requested a review from a team as a code owner January 10, 2024 18:41
@problame problame requested review from VladLazar, jcsp, koivunej and arpad-m and removed request for a team January 10, 2024 18:41
@problame problame merged commit fc66ba4 into main Jan 11, 2024
46 checks passed
@problame problame deleted the problame/revert-revert-virtualfile-asyncification/pr branch January 11, 2024 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

revert "revert recent VirtualFile asyncification changes #5291"
3 participants