Epic: per-tenant read path throttling #5899

problame · 2023-11-22T13:18:54Z

Motivation

tl;dr: we currently serve reads (and write path => #7564) as fast as possible.
This sets the wrong incentives and poses operational and economic problems:

noisy neighbor issues: Pageservers are multi-tenant. But currently, a single tenant can max out the resources of a PS on both read and write path.
reads specifically: computes that should be increasing their DRAM can instead issue more GetPage requests (we currently don't charge for IOPS)

DoD

Pageserver artificially caps the per-tenant throughput on the read path.
I.e., to all upstream Neon components, this cap will appear to be the maximum read performance that you can get per tenant per pageserver.

The limits will be chosen such that a TBD (small single-digit) number of tenants can run at the limit. Discovery of the limit values is done through gradual rollout, conservative experimentation, and informed by benchmarks.

The upstream (compute) responds to the limit-induced backpressure efficiently, gracefully, and without risk of starvation.

There is enough observability to clearly disambiguate slowness induced by limiting from slowness caused by otherwise slow pageserver. This disambiguation must be on per-tenant (better: per-timeline) granularity.

The throttle are on-by-default and cannot be permanently overridden on a per-tenant basis.
I.e., the implementation need not be suited for productization as "performance tier" or "QoS" feature.

Interactions

Sharding: with sharding, above limits will be per shard instead of per tenant.
However, we may need to (re-)introduce per-tenant limits within a single pageserver process to incentivize placement of shards across different nodes for increased performance & load spreading.
However, that's subject to future work.

High-Level Plan

Impl

Give feedback

pagebench: actually implement --num_clients #6640
per-TenantShard read throttling #6706
https://github.com/neondatabase/aws/pull/1048
fix(per-tenant throttling): incorrect allowed_rps field in log message #6869
https://github.com/neondatabase/aws/pull/1054
feat(per-tenant throttling): exclude throttled time from page_service metrics + regression test #6953
https://github.com/neondatabase/aws/pull/1124
throttling: exclude throttled time from basebackup (fixup of #6953) #7072
https://github.com/neondatabase/aws/pull/1125
Options

The text was updated successfully, but these errors were encountered:

…6000) Problem ------- Before this PR, there was no concurrency limit on initial logical size computations. While logical size computations are lazy in theory, in practice (production), they happen in a short timeframe after restart. This means that on a PS with 20k tenants, we'd have up to 20k concurrent initial logical size calculation requests. This is self-inflicted needless overload. This hasn't been a problem so far because the `.await` points on the logical size calculation path never return `Pending`, hence we have a natural concurrency limit of the number of executor threads. But, as soon as we return `Pending` somewhere in the logical size calculation path, other concurrent tasks get scheduled by tokio. If these other tasks are also logical size calculations, they eventually pound on the same bottleneck. For example, in #5479, we want to switch the VirtualFile descriptor cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and without measures like this patch, after PS restart, VirtualFile descriptor cache thrashes heavily for 2 hours until all the logical size calculations have been computed and the degree of concurrency / concurrent VirtualFile operations is down to regular levels. See the *Experiment* section below for details.  Background ---------- Before this PR, initial logical size calculation was spawned lazily on first call to `Timeline::get_current_logical_size()`. In practice (prod), the lazy calculation is triggered by `WalReceiverConnectionHandler` if the timeline is active according to storage broker, or by the first iteration of consumption metrics worker after restart (`MetricsCollection`). The spawns by walreceiver are high-priority because logical size is needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce the project logical size limit. The spawns by metrics collection are not on the user-critical path and hence low-priority. [^consumption_metrics_slo] [^consumption_metrics_slo]: We can't delay metrics collection indefintely because there are TBD internal SLOs tied to metrics collection happening in a timeline manner (neondatabase/cloud#7408). But let's ignore that in this issue. The ratio of walreceiver-initiated spawns vs consumption-metrics-initiated spawns can be reconstructed from logs (`spawning logical size computation from context of task kind {:?}"`). PR #5995 and #6018 adds metrics for this. First investigation of the ratio lead to the discovery that walreceiver spawns 75% of init logical size computations. That's because of two bugs: - In Safekeepers: #5993 - In interaction between Pageservers and Safekeepers: #5962 The safekeeper bug is likely primarily responsible but we don't have the data yet. The metrics will hopefully provide some insights. When assessing production-readiness of this PR, please assume that neither of these bugs are fixed yet. Changes In This PR ------------------ With this PR, initial logical size calculation is reworked as follows: First, all initial logical size calculation task_mgr tasks are started early, as part of timeline activation, and run a retry loop with long back-off until success. This removes the lazy computation; it was needless complexity because in practice, we compute all logical sizes anyways, because consumption metrics collects it. Second, within the initial logical size calculation task, each attempt queues behind the background loop concurrency limiter semaphore. This fixes the performance issue that we pointed out in the "Problem" section earlier. Third, there is a twist to queuing behind the background loop concurrency limiter semaphore. Logical size is needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce the project logical size limit. However, we currently do open walreceiver connections even before we have an exact logical size. That's bad, and I'll build on top of this PR to fix that (#5963). But, for the purposes of this PR, we don't want to introduce a regression, i.e., we don't want to provide an exact value later than before this PR. The solution is to introduce a priority-boosting mechanism (`GetLogicalSizePriority`), allowing callers of `Timeline::get_current_logical_size` to specify how urgently they need an exact value. The effect of specifying high urgency is that the initial logical size calculation task for the timeline will skip the concurrency limiting semaphore. This should yield effectively the same behavior as we had before this PR with lazy spawning. Last, the priority-boosting mechanism obsoletes the `init_order`'s grace period for initial logical size calculations. It's a separate commit to reduce the churn during review. We can drop that commit if people think it's too much churn, and commit it later once we know this PR here worked as intended. Experiment With #5479 --------------------- I validated this PR combined with #5479 to assess whether we're making forward progress towards asyncification. The setup is an `i3en.3xlarge` instance with 20k tenants, each with one timeline that has 9 layers. All tenants are inactive, i.e., not known to SKs nor storage broker. This means all initial logical size calculations are spawned by consumption metrics `MetricsCollection` task kind. The consumption metrics worker starts requesting logical sizes at low priority immediately after restart. This is achieved by deleting the consumption metrics cache file on disk before starting PS.[^consumption_metrics_cache_file] [^consumption_metrics_cache_file] Consumption metrics worker persists its interval across restarts to achieve persistent reporting intervals across PS restarts; delete the state file on disk to get predictable (and I believe worst-case in terms of concurrency during PS restart) behavior. Before this patch, all of these timelines would all do their initial logical size calculation in parallel, leading to extreme thrashing in page cache and virtual file cache. With this patch, the virtual file cache thrashing is reduced significantly (from 80k `open`-system-calls/second to ~500 `open`-system-calls/second during loading). ### Critique The obvious critique with above experiment is that there's no skipping of the semaphore, i.e., the priority-boosting aspect of this PR is not exercised. If even just 1% of our 20k tenants in the setup were active in SK/storage_broker, then 200 logical size calculations would skip the limiting semaphore immediately after restart and run concurrently. Further critique: given the two bugs wrt timeline inactive vs active state that were mentioned in the Background section, we could have 75% of our 20k tenants being (falsely) active on restart. So... (next section) This Doesn't Make Us Ready For Async VirtualFile ------------------------------------------------ This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even #4744. But it doesn't yet enable us to ship #5479. The reason is that this PR doesn't limit the amount of high-priority logical size computations. If there are many high-priority logical size calculations requested, we'll fall over like we did if #5479 is applied without this PR. And currently, at very least due to the bugs mentioned in the Background section, we run thousands of high-priority logical size calculations on PS startup in prod. So, at a minimum, we need to fix these bugs. Then we can ship #5479 and #4744, and things will likely be fine under normal operation. But in high-traffic situations, overload problems will still be more likely to happen, e.g., VirtualFile cache descriptor thrashing. The solution candidates for that are orthogonal to this PR though: * global concurrency limiting * per-tenant rate limiting => #5899 * load shedding * scaling bottleneck resources (fd cache size (neondatabase/cloud#8351), page cache size(neondatabase/cloud#8351), spread load across more PSes, etc) Conclusion ---------- Even with the remarks from in the previous section, we should merge this PR because: 1. it's an improvement over the status quo (esp. if the aforementioned bugs wrt timeline active / inactive are fixed) 2. it prepares the way for #6010 3. it gets us close to shipping #5479 and #4744

Implementation epic: #5899

) This reverts commit ab1f37e. Thereby fixes #5479 Updated Analysis ================ The problem with the original patch was that it, for the first time, exposed the `VirtualFile` code to tokio task concurrency instead of just thread-based concurrency. That caused the VirtualFile file descriptor cache to start thrashing, effectively grinding the system to a halt. Details ------- At the time of the original patch, we had a _lot_ of runnable tasks in the pageserver. The symptom that prompted the revert (now being reverted in this PR) is that our production systems fell into a valley of zero goodput, high CPU, and zero disk IOPS shortly after PS restart. We lay out the root cause for that behavior in this subsection. At the time, there was no concurrency limit on the number of concurrent initial logical size calculations. Initial size calculation was initiated for all timelines within the first 10 minutes as part of consumption metrics collection. On a PS with 20k timelines, we'd thus have 20k runnable tasks. Before the original patch, the `VirtualFile` code never returned `Poll::Pending`. That meant that once we entered it, the calling tokio task would not yield to the tokio executor until we were done performing the VirtualFile operation, i.e., doing a blocking IO system call. The original patch switched the VirtualFile file descriptor cache's synchronization primitives to those from `tokio::sync`. It did not change that we were doing synchronous IO system calls. And the cache had more slots than we have tokio executor threads. So, these primitives never actually needed to return `Poll::Pending`. But, the tokio scheduler makes tokio sync primitives return `Pending` *artificially*, as a mechanism for the scheduler to get back into control more often ([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)). So, the new reality was that VirtualFile calls could now yield to the tokio executor. Tokio would pick one of the other 19999 runnable tasks to run. These tasks were also using VirtualFile. So, we now had a lot more concurrency in that area of the code. The problem with more concurrency was that caches started thrashing, most notably the VirtualFile file descriptor cache: each time a task would be rescheduled, it would want to do its next VirtualFile operation. For that, it would first need to evict another (task's) VirtualFile fd from the cache to make room for its own fd. It would then do one VirtualFile operation before hitting an await point and yielding to the executor again. The executor would run the other 19999 tasks for fairness before circling back to the first task, which would find its fd evicted. The other cache that would theoretically be impacted in a similar way is the pageserver's `PageCache`. However, for initial logical size calculation, it seems much less relevant in experiments, likely because of the random access nature of initial logical size calculation. Fixes ===== We fixed the above problems by - raising VirtualFile cache sizes - neondatabase/cloud#8351 - changing code to ensure forward-progress once cache slots have been acquired - #5480 - #5482 - tbd: #6065 - reducing the amount of runnable tokio tasks - #5578 - #6000 - fix bugs that caused unnecessary concurrency induced by connection handlers - #5993 I manually verified that this PR doesn't negatively affect startup performance as follows: create a pageserver in production configuration, with 20k tenants/timelines, 9 tiny L0 layer files each; Start it, and observe ``` INFO Startup complete (368.009s since start) elapsed_ms=368009 ``` I further verified in that same setup that, when using `pagebench`'s getpage benchmark at as-fast-as-possible request rate against 5k of the 20k tenants, the achieved throughput is identical. The VirtualFile cache isn't thrashing in that case. Future Work =========== We will still exposed to the cache thrashing risk from outside factors, e.g., request concurrency is unbounded, and initial size calculation skips the concurrency limiter when we establish a walreceiver connection. Once we start thrashing, we will degrade non-gracefully, i.e., encounter a valley as was seen with the original patch. However, we have sufficient means to deal with that unlikely situation: 1. we have dashboards & metrics to monitor & alert on cache thrashing 2. we can react by scaling the bottleneck resources (cache size) or by manually shedding load through tenant relocation Potential systematic solutions are future work: * global concurrency limiting * per-tenant rate limiting => #5899 * pageserver-initiated load shedding Related Issues ============== This PR unblocks the introduction of tokio-epoll-uring for asynchronous disk IO ([Epic](#4744)).

jcsp · 2024-02-05T11:38:37Z

Status:

bumped last week by walredo latency
will start after io_uring stuff is wrapped.

Will need this to validate per-tenant throttling in #5899

jcsp · 2024-02-12T11:21:56Z

Initial draft PR is up for review -- could land this week.

Testing:

In staging we can enable throttling for everyone? Benchmarks will be unhappy. So we can set the limit to something very close to what the benchmarks currently do at peak.
Need to monitor impact on CPU usage from the overhead of throttling, while realizing that throttling will also decrease CPU usage due to lower ops/s

jcsp · 2024-02-19T11:09:15Z

Status:

Initial code landed on Friday
Next: enable for all tenants in staging, watch for ~2 days and overlap with Peter's bench.
If staging is OK, identify troublesome tenants in prod (k8s pod small ones) + work with compute/autoscaling team to remediate them. Target free tier users who are generating outsized I/O.

…age (#6869) The `refill_interval` switched from a milliseconds usize to a Duration during a review follow-up, hence this slipped through manual testing. Part of #5899

jcsp · 2024-02-26T11:23:28Z

This week:

Figure out our plan for making our latency metrics exclude the throttling time, and other metrics to measure impact/activity of throttling (currently just have logs)
Go with 20k in prod, with 40k burst limit: recognizing that this is higher than we want it to be ultimately, but we don't want to clamp things like full postgres restarts to a lower 5k limit.
notion page explaining how to adjust the throttle for a tenant in the field.

…questContextAdaptor` uses it (#6961) Extracted from #6953 Part of #5899

…in callers (#6960) Extracted from #6953 Part of #5899 Core Change ----------- In #6953, we need the ability to scan the log _after_ a specific line and ignore anything before that line. This PR changes `log_contains` to returns a tuple of `(matching line, cursor)`. Hand that cursor to a subsequent `log_contains` call to search the log for the next occurrence of the pattern. Other Changes ------------- - Inspect all the callsites of `log_contains` to handle the new tuple return type. - Above inspection unveiled many callers aren't using `assert log_contains(...) is not None` but some weaker version of the code that breaks if `log_contains` ever returns a not-None but falsy value. Fix that. - Above changes unveiled that `test_remote_storage_upload_queue_retries` was using `wait_until` incorrectly; after fixing the usage, I had to raise the `wait_until` timeout. So, maybe this will fix its flakiness.

problame · 2024-03-04T11:23:13Z

get merged metrics changes beginning of week feat(per-tenant throttling): exclude throttled time from page_service metrics + regression test #6953
other plans from last week: will have to wait until next week because we need the metrics changes
thread with team on future work quoted in PR feat(per-tenant throttling): exclude throttled time from page_service metrics + regression test #6953

… metrics + regression test (#6953) part of #5899 Problem ------- Before this PR, the time spent waiting on the throttle was charged towards the higher-level page_service metrics, i.e., `pageserver_smgr_query_seconds`. The metrics are the foundation of internal SLIs / SLOs. A throttled tenant would cause the SLI to degrade / SLO alerts to fire. Changes ------- - don't charge time spent in throttle towards the page_service metrics - record time spent in throttle in RequestContext and subtract it from the elapsed time - this works because the page_service path doesn't create child context, so, all the throttle time is recorded in the parent - it's quite brittle and will break if we ever decide to spawn child tasks that need child RequestContexts, which would have separate instances of the `micros_spent_throttled` counter. - however, let's punt that to a more general refactoring of RequestContext - add a test case that ensures that - throttling happens for getpage requests; this aspect of the test passed before this PR - throttling delays aren't charged towards the page_service metrics; this aspect of the test only passes with this PR - drive-by: make the throttle log message `info!`, it's an expected condition Performance ----------- I took the same measurements as in #6706 , no meaningful change in CPU overhead. Future Work ----------- This PR enables us to experiment with the throttle for select tenants without affecting the SLI metrics / triggering SLO alerts. Before declaring this feature done, we need more work to happen, specifically: - decide on whether we want to retain the flexibility of throttling any `Timeline::get` call, filtered by TaskKind - versus: separate throttles for each page_service endpoint, potentially with separate config options - the trouble here is that this decision implies changes to the TenantConfig, so, if we start using the current config style now, then decide to switch to a different config, it'll be a breaking change Nice-to-haves but probably not worth the time right now: - Equivalent tests to ensure the throttle applies to all other page_service handlers.

jcsp · 2024-03-11T11:13:38Z

Status:

We alert if a tenant is throttled on >2% of requests
- Actions on the alerts:
  - if a client is a legacy pod that should be migrated to an autoscaled VM, migrate it
  - if a client is legitimately high throughput and large: shard it.
Reconciling with vecfored get changes to ensure we aren't double throttling.

problame · 2024-03-18T11:04:35Z

Apart from

Reconciling with vecfored get changes to ensure we aren't double throttling.

nothing happened last week.

This week:

investigate Sentry panic from prod
notion page explaining how to adjust the throttle for a tenant in the field.
examine amount of throttling in prod => take action (see previous comment)

problame · 2024-03-25T10:01:52Z

Status update:

Notion page done
Sentry panic => likely tokio issue, further updates by @VladLazar in Investigate tokio panic (assertion failed: self.ref_count() > 0) #7162

This week:

examination of throttling in prod

problame · 2024-03-28T15:00:58Z

Status update:

no progress this week

problame · 2024-04-30T14:30:34Z

I split off the write throttling aspect of this epic into a separate draft epic: #7564

(We do not expect to work on write throttling this quarter)

problame added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic c/storage Component: storage labels Nov 22, 2023

problame mentioned this issue Nov 22, 2023

pageserver: WAL ingestion backpressure #5897

Open

problame assigned problame and unassigned problame Nov 22, 2023

problame changed the title ~~Epic: per-tenant rate limiting~~ Epic: per-tenant throttling Nov 30, 2023

This was referenced Nov 30, 2023

RFC: Per-Tenant GetPage@LSN Throttling #5648

Merged

concurrency-limit initial logical size calculation [v2] #6000

Merged

problame added a commit that referenced this issue Dec 19, 2023

RFC: Per-Tenant GetPage@LSN Throttling (#5648)

c272c68

Implementation epic: #5899

problame mentioned this issue Jan 9, 2024

Revert "revert recent VirtualFile asyncification changes (#5291)" #6309

Merged

problame self-assigned this Jan 22, 2024

problame mentioned this issue Feb 5, 2024

pagebench: actually implement --num_clients #6640

Merged

problame added a commit that referenced this issue Feb 6, 2024

pagebench: actually implement --num_clients (#6640)

e196d97

Will need this to validate per-tenant throttling in #5899

problame mentioned this issue Feb 9, 2024

per-TenantShard read throttling #6706

Merged

2 tasks

problame mentioned this issue Feb 21, 2024

fix(per-tenant throttling): incorrect allowed_rps field in log message #6869

Merged

problame added a commit that referenced this issue Feb 29, 2024

refactor(compaction): RequestContext shouldn't be Clone, only `Re…

502b69b

…questContextAdaptor` uses it (#6961) Extracted from #6953 Part of #5899

problame mentioned this issue Apr 30, 2024

Epic: per-tenant write throttle #7564

Open

problame changed the title ~~Epic: per-tenant throttling~~ Epic: per-tenant read path throttling Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: per-tenant read path throttling #5899

Epic: per-tenant read path throttling #5899

problame commented Nov 22, 2023 •

edited

Loading

High Level

Impl

jcsp commented Feb 5, 2024

jcsp commented Feb 12, 2024

jcsp commented Feb 19, 2024

jcsp commented Feb 26, 2024 •

edited

Loading

problame commented Mar 4, 2024

jcsp commented Mar 11, 2024

problame commented Mar 18, 2024

problame commented Mar 25, 2024 •

edited

Loading

problame commented Mar 28, 2024

problame commented Apr 30, 2024

Epic: per-tenant read path throttling #5899

Epic: per-tenant read path throttling #5899

Comments

problame commented Nov 22, 2023 • edited Loading

Motivation

DoD

Interactions

High-Level Plan

High Level

Impl

jcsp commented Feb 5, 2024

jcsp commented Feb 12, 2024

jcsp commented Feb 19, 2024

jcsp commented Feb 26, 2024 • edited Loading

problame commented Mar 4, 2024

jcsp commented Mar 11, 2024

problame commented Mar 18, 2024

problame commented Mar 25, 2024 • edited Loading

problame commented Mar 28, 2024

problame commented Apr 30, 2024

problame commented Nov 22, 2023 •

edited

Loading

jcsp commented Feb 26, 2024 •

edited

Loading

problame commented Mar 25, 2024 •

edited

Loading