Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrency-limit initial logical size calculation [v2] #6000

Merged
merged 6 commits into from
Dec 4, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Nov 30, 2023

Problem

Before this PR, there was no concurrency limit on initial logical size computations.

While logical size computations are lazy in theory, in practice (production), they happen in a short timeframe after restart.

This means that on a PS with 20k tenants, we'd have up to 20k concurrent initial logical size calculation requests.

This is self-inflicted needless overload.

This hasn't been a problem so far because the .await points on the logical size calculation path never return Pending, hence we have a natural concurrency limit of the number of executor threads.
But, as soon as we return Pending somewhere in the logical size calculation path, other concurrent tasks get scheduled by tokio.
If these other tasks are also logical size calculations, they eventually pound on the same bottleneck.

For example, in #5479, we want to switch the VirtualFile descriptor cache to a tokio::sync::RwLock, which makes us return Pending, and without measures like this patch, after PS restart, VirtualFile descriptor cache thrashes heavily for 2 hours until all the logical size calculations have been computed and the degree of concurrency / concurrent VirtualFile operations is down to regular levels.
See the Experiment section below for details.

Background

Before this PR, initial logical size calculation was spawned lazily on first call to Timeline::get_current_logical_size().

In practice (prod), the lazy calculation is triggered by WalReceiverConnectionHandler if the timeline is active according to storage broker, or by the first iteration of consumption metrics worker after restart (MetricsCollection).

The spawns by walreceiver are high-priority because logical size is needed by Safekeepers (via walreceiver PageserverFeedback) to enforce the project logical size limit.
The spawns by metrics collection are not on the user-critical path and hence low-priority. 1

The ratio of walreceiver-initiated spawns vs consumption-metrics-initiated spawns can be reconstructed from logs (spawning logical size computation from context of task kind {:?}").
PR #5995 and #6018 adds metrics for this.

First investigation of the ratio lead to the discovery that walreceiver spawns 75% of init logical size computations.
That's because of two bugs:

The safekeeper bug is likely primarily responsible but we don't have the data yet. The metrics will hopefully provide some insights.

When assessing production-readiness of this PR, please assume that neither of these bugs are fixed yet.

Changes In This PR

With this PR, initial logical size calculation is reworked as follows:

First, all initial logical size calculation task_mgr tasks are started early, as part of timeline activation, and run a retry loop with long back-off until success. This removes the lazy computation; it was needless complexity because in practice, we compute all logical sizes anyways, because consumption metrics collects it.

Second, within the initial logical size calculation task, each attempt queues behind the background loop concurrency limiter semaphore. This fixes the performance issue that we pointed out in the "Problem" section earlier.

Third, there is a twist to queuing behind the background loop concurrency limiter semaphore. Logical size is needed by Safekeepers (via walreceiver PageserverFeedback) to enforce the project logical size limit. However, we currently do open walreceiver connections even before we have an exact logical size. That's bad, and I'll build on top of this PR to fix that (#5963). But, for the purposes of this PR, we don't want to introduce a regression, i.e., we don't want to provide an exact value later than before this PR. The solution is to introduce a priority-boosting mechanism (GetLogicalSizePriority), allowing callers of Timeline::get_current_logical_size to specify how urgently they need an exact value. The effect of specifying high urgency is that the initial logical size calculation task for the timeline will skip the concurrency limiting semaphore. This should yield effectively the same behavior as we had before this PR with lazy spawning.

Last, the priority-boosting mechanism obsoletes the init_order's grace period for initial logical size calculations. It's a separate commit to reduce the churn during review. We can drop that commit if people think it's too much churn, and commit it later once we know this PR here worked as intended.

Experiment With #5479

I validated this PR combined with #5479 to assess whether we're making forward progress towards asyncification.

The setup is an i3en.3xlarge instance with 20k tenants, each with one timeline that has 9 layers.
All tenants are inactive, i.e., not known to SKs nor storage broker.
This means all initial logical size calculations are spawned by consumption metrics MetricsCollection task kind.
The consumption metrics worker starts requesting logical sizes at low priority immediately after restart. This is achieved by deleting the consumption metrics cache file on disk before starting PS.[^consumption_metrics_cache_file]

[^consumption_metrics_cache_file] Consumption metrics worker persists its interval across restarts to achieve persistent reporting intervals across PS restarts; delete the state file on disk to get predictable (and I believe worst-case in terms of concurrency during PS restart) behavior.

Before this patch, all of these timelines would all do their initial logical size calculation in parallel, leading to extreme thrashing in page cache and virtual file cache.

With this patch, the virtual file cache thrashing is reduced significantly (from 80k open-system-calls/second to ~500 open-system-calls/second during loading).

Critique

The obvious critique with above experiment is that there's no skipping of the semaphore, i.e., the priority-boosting aspect of this PR is not exercised.

If even just 1% of our 20k tenants in the setup were active in SK/storage_broker, then 200 logical size calculations would skip the limiting semaphore immediately after restart and run concurrently.

Further critique: given the two bugs wrt timeline inactive vs active state that were mentioned in the Background section, we could have 75% of our 20k tenants being (falsely) active on restart.

So... (next section)

This Doesn't Make Us Ready For Async VirtualFile

This PR is a step towards asynchronous VirtualFile, aka, #5479 or even #4744.

But it doesn't yet enable us to ship #5479.

The reason is that this PR doesn't limit the amount of high-priority logical size computations.
If there are many high-priority logical size calculations requested, we'll fall over like we did if #5479 is applied without this PR.
And currently, at very least due to the bugs mentioned in the Background section, we run thousands of high-priority logical size calculations on PS startup in prod.

So, at a minimum, we need to fix these bugs.

Then we can ship #5479 and #4744, and things will likely be fine under normal operation.

But in high-traffic situations, overload problems will still be more likely to happen, e.g., VirtualFile cache descriptor thrashing.
The solution candidates for that are orthogonal to this PR though:

  • global concurrency limiting
  • per-tenant rate limiting => Epic: per-tenant read path throttling #5899
  • load shedding
  • scaling bottleneck resources (fd cache size (neondatabase/cloud#8351), page cache size(neondatabase/cloud#8351), spread load across more PSes, etc)

Conclusion

Even with the remarks from in the previous section, we should merge this PR because:

  1. it's an improvement over the status quo (esp. if the aforementioned bugs wrt timeline active / inactive are fixed)
  2. it prepares the way for WIP: fix: logical size limit is broken during PS restart #6010
  3. it gets us close to shipping revert "revert recent VirtualFile asyncification changes #5291" #5479 and Epic: scalable async disk IO (tokio-epoll-uring) #4744

Footnotes

  1. We can't delay metrics collection indefintely because there are TBD internal SLOs tied to metrics collection happening in a timeline manner (https://github.com/neondatabase/cloud/issues/7408). But let's ignore that in this issue.

@problame problame changed the title WIP: rework initial logical size calculation WIP: concurrency-limit initial logical size calculation [v2] Nov 30, 2023
@problame problame self-assigned this Nov 30, 2023
Copy link

github-actions bot commented Nov 30, 2023

2436 tests run: 2337 passed, 0 failed, 99 skipped (full report)


Flaky tests (4)

Postgres 16

  • test_branching_with_pgbench[flat-1-10]: debug

Postgres 15

  • test_statvfs_pressure_min_avail_bytes: debug
  • test_timetravel: debug

Postgres 14

Code coverage (full report)

  • functions: 54.5% (9281 of 17022 functions)
  • lines: 82.0% (53944 of 65781 lines)

The comment gets automatically updated with the latest test results
5b0e7a7 at 2023-12-04T17:07:34.152Z :recycle:

problame added a commit that referenced this pull request Dec 1, 2023
fixes #5963

On top of #6000

Will ship this in a release after #600
@problame problame force-pushed the problame/init-logical-size-cleanups branch from 8d427ea to 9275824 Compare December 1, 2023 12:00
@problame problame force-pushed the problame/init-logical-size-rework branch from ef1848f to b354bff Compare December 1, 2023 12:59
Base automatically changed from problame/init-logical-size-cleanups to main December 1, 2023 13:16
@problame problame force-pushed the problame/init-logical-size-rework branch 2 times, most recently from a717bef to 5b01887 Compare December 1, 2023 18:14
@problame problame changed the title WIP: concurrency-limit initial logical size calculation [v2] concurrency-limit initial logical size calculation [v2] Dec 1, 2023
@problame problame force-pushed the problame/init-logical-size-rework branch from 5b01887 to f886a4c Compare December 1, 2023 22:16
@problame problame marked this pull request as ready for review December 4, 2023 10:42
@problame problame requested a review from a team as a code owner December 4, 2023 10:42
@problame problame requested review from hlinnaka and removed request for a team December 4, 2023 10:42
@problame
Copy link
Contributor Author

problame commented Dec 4, 2023

@koivunej I polished up the PR description, re-ran the experiment to make sure this is a step forward with regard to #5479 , and added two sections explaining what exactly that means.

The remaining comment is to make sure that logical size computation impl is cancel safe: #6000 (comment)

I think it is, and added a commit that adds cancel-safety comments to the methods that I audited. I'm treating the Layer::get_reconstruct_data as a black box.

Would appreciate you to comment there, and resolve your other conversation.

Then let's get this shipped so we maximize soak time.

@problame problame changed the title concurrency-limit initial logical size calculation [v2] concurrency-limit low-priority initial logical size calculation [v2] Dec 4, 2023
problame added a commit that referenced this pull request Dec 4, 2023
# Problem

I need walredo to be cancellation-safe for
#6000 (comment)

# Solution

We are only `async fn` because of
`wait_for(stderr_logger_task_done).await`, added in #5560 .

The `stderr_logger_cancel` and `stderr_logger_task_done` were there out
of precaution that the stderr logger task might for some reason not stop
when the walredo process terminates.
That hasn't been a problem in practice.
So, simplify things:
- remove `stderr_logger_cancel` and the
`wait_for(...stderr_logger_task_done...)`
- use `tokio::process::ChildStderr` in the stderr logger task
- add metrics to track number of running stderr logger tasks so in case
I'm wrong here, we can use these metrics to identify the issue (not
planning to put them into a dashboard or anything)
@problame problame enabled auto-merge (squash) December 4, 2023 16:10
@problame problame merged commit c7f1143 into main Dec 4, 2023
41 checks passed
@problame problame deleted the problame/init-logical-size-rework branch December 4, 2023 17:22
@shanyp shanyp changed the title concurrency-limit low-priority initial logical size calculation [v2] concurrency-limit initial logical size calculation [v2] Dec 5, 2023
problame added a commit that referenced this pull request Jan 11, 2024
)

This reverts commit ab1f37e.
Thereby
fixes #5479

Updated Analysis
================

The problem with the original patch was that it, for the first time,
exposed the `VirtualFile` code to tokio task concurrency instead of just
thread-based concurrency. That caused the VirtualFile file descriptor
cache to start thrashing, effectively grinding the system to a halt.

Details
-------

At the time of the original patch, we had a _lot_ of runnable tasks in
the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is
that our production systems fell into a valley of zero goodput, high
CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent
initial logical size calculations.
Initial size calculation was initiated for all timelines within the
first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the `VirtualFile` code never returned
`Poll::Pending`.
That meant that once we entered it, the calling tokio task would not
yield to the tokio executor until we were done performing the
VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's
synchronization primitives to those from `tokio::sync`.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return `Poll::Pending`.
But, the tokio scheduler makes tokio sync primitives return `Pending`
*artificially*, as a mechanism for the scheduler to get back into
control more often
([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)).

So, the new reality was that VirtualFile calls could now yield to the
tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing,
most notably the VirtualFile file descriptor cache: each time a task
would be rescheduled, it would want to do its next VirtualFile
operation. For that, it would first need to evict another (task's)
VirtualFile fd from the cache to make room for its own fd. It would then
do one VirtualFile operation before hitting an await point and yielding
to the executor again. The executor would run the other 19999 tasks for
fairness before circling back to the first task, which would find its fd
evicted.

The other cache that would theoretically be impacted in a similar way is
the pageserver's `PageCache`.
However, for initial logical size calculation, it seems much less
relevant in experiments, likely because of the random access nature of
initial logical size calculation.

Fixes
=====

We fixed the above problems by
- raising VirtualFile cache sizes
  - neondatabase/cloud#8351
- changing code to ensure forward-progress once cache slots have been
acquired
  - #5480
  - #5482
  - tbd: #6065
- reducing the amount of runnable tokio tasks
  - #5578
  - #6000
- fix bugs that caused unnecessary concurrency induced by connection
handlers
  - #5993

I manually verified that this PR doesn't negatively affect startup
performance as follows:
create a pageserver in production configuration, with 20k
tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

```
INFO Startup complete (368.009s since start) elapsed_ms=368009
```

I further verified in that same setup that, when using `pagebench`'s
getpage benchmark at as-fast-as-possible request rate against 5k of the
20k tenants, the achieved throughput is identical. The VirtualFile cache
isn't thrashing in that case.

Future Work
===========

We will still exposed to the cache thrashing risk from outside factors,
e.g., request concurrency is unbounded, and initial size calculation
skips the concurrency limiter when we establish a walreceiver
connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter
a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:
1. we have dashboards & metrics to monitor & alert on cache thrashing
2. we can react by scaling the bottleneck resources (cache size) or by
manually shedding load through tenant relocation

Potential systematic solutions are future work:
* global concurrency limiting
* per-tenant rate limiting => #5899
* pageserver-initiated load shedding

Related Issues
==============

This PR unblocks the introduction of tokio-epoll-uring for asynchronous
disk IO ([Epic](#4744)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants