logical size limit is broken during PS restart #5963

problame · 2023-11-28T18:52:47Z

Problem

Logical size is part of PageserverFeedback, which is sent from PS to SK so that SK can enforce the project's logical size limit:

neon/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs

Lines 398 to 404 in d8c21ec

    
           // Regular standby_status_update fields are put into this message. 
        
           let (timeline_logical_size, _) = timeline 
        
               .get_current_logical_size(&ctx) 
        
               .context("Status update creation failed to get current logical size")?; 
        
           let status_update = PageserverFeedback { 
        
               current_timeline_size: timeline_logical_size, 
        
               last_received_lsn,

Logical size is calculated lazily.
The value that is returned before it has been lazily calculated is the logical size delta since PS startup.
If it's negative, we currently round it to 0.

The (quite common) worst case: whenever we restart the PS, there's a window in which we report a logical size that is way below the actual logical size, likely near 0. This allows a project to go over their logical size limit. Once we're done calculating, we report the correct value. But at that point, the user may be over the size limit. Which means they're using more logical size than they're allowed (and paying for?) .

Fixing This

We should not start walreceiver connections to SKs until we have an accurate logical size.

The challenge is that the logical size needs to be available quickly because walreceiver connection establishment is on the user-visible path, i.e., it's a latency-bound task.

Design Idea 1

Persistently cache the incremental logical size on disk and re-use it during startup.
Implement probablistic invalidation of the cache
Implement probablistic re-calculation of the base logical size because we don't fully trust the incremental logical size calculation (do we really not?)
Eventually: trust-but-verify the incremental logical size calculation, i.e., trust it but have something (local probablistic checker, control plane, whatever) trigger checks that would log errors & correct it.

(As a follow-up, also think about how this change impacts synthetic logical size calculations)

Design Idea 2

No persistent caching
Before establishing walreceiver connection, calculate the logical size.
- If the logical size calculation is already triggered by some background task but waiting behind some background-loop concurrency-limiter ( like proposed in concurrency-limit initial logical size calculation #5955 ) , we'd need to somehow priority-promote that calculation

Tasks

Give feedback

https://github.com/neondatabase/cloud/pull/8317
WIP: fix: logical size limit is broken during PS restart #5982
initial logical size calculation: add a bunch of metrics #5995
init logical size: infallible accessor + type-level cleanups #5999
initial logical size: remove CALLS metric from hot path #6018
ship metrics & observe
Safekeepers: persist state before timeline deactivation. #5994
concurrency-limit initial logical size calculation #5955
concurrency-limit initial logical size calculation [v2] #6000
ship concurrency limit & observe
WIP: fix: logical size limit is broken during PS restart #6010
ship fix & observe
clean up after initial logical size: remove CALLS metric from hot path #6018
Options

The text was updated successfully, but these errors were encountered:

These will help us answer questions such as: - when & at what do calculations get started after PS restart? - how often is the api to get current incrementally-computed logical size called, and does it return Exact vs Approximate? I'd also be interested in a histogram of how much wall clock time size calculations take, but, I don't know good bucket sizes, and, logging it would introduce yet another per-timeline log message during startup; don't think that's worth it just yet. Context - https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769 - #5962 - #5963 - #5955 - neondatabase/cloud#7408

hlinnaka · 2023-11-30T16:01:07Z

Idea 3: Store the logical size persistently as a separate key-value pair in the storage.

Whenever a relation is extended or truncated, update the logical size key-value pair too, in WAL ingestion.

That makes it fast to access the logical size, at any point in time, with no special caching required. The downside is that it adds work to the WAL ingestion codepath instead. Don't know how significant that is, but given how much trouble the logical size calculations are causing us, it might be the right tradeoff.

These will help us answer questions such as: - when & at what do calculations get started after PS restart? - how often is the api to get current incrementally-computed logical size called, and does it return Exact vs Approximate? I'd also be interested in a histogram of how much wall clock time size calculations take, but, I don't know good bucket sizes, and, logging it would introduce yet another per-timeline log message during startup; don't think that's worth it just yet. Context - https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769 - #5962 - #5963 - #5955 - neondatabase/cloud#7408

fixes #5963 On top of #6000 Will ship this in a release after #600

…6000) Problem ------- Before this PR, there was no concurrency limit on initial logical size computations. While logical size computations are lazy in theory, in practice (production), they happen in a short timeframe after restart. This means that on a PS with 20k tenants, we'd have up to 20k concurrent initial logical size calculation requests. This is self-inflicted needless overload. This hasn't been a problem so far because the `.await` points on the logical size calculation path never return `Pending`, hence we have a natural concurrency limit of the number of executor threads. But, as soon as we return `Pending` somewhere in the logical size calculation path, other concurrent tasks get scheduled by tokio. If these other tasks are also logical size calculations, they eventually pound on the same bottleneck. For example, in #5479, we want to switch the VirtualFile descriptor cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and without measures like this patch, after PS restart, VirtualFile descriptor cache thrashes heavily for 2 hours until all the logical size calculations have been computed and the degree of concurrency / concurrent VirtualFile operations is down to regular levels. See the *Experiment* section below for details.  Background ---------- Before this PR, initial logical size calculation was spawned lazily on first call to `Timeline::get_current_logical_size()`. In practice (prod), the lazy calculation is triggered by `WalReceiverConnectionHandler` if the timeline is active according to storage broker, or by the first iteration of consumption metrics worker after restart (`MetricsCollection`). The spawns by walreceiver are high-priority because logical size is needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce the project logical size limit. The spawns by metrics collection are not on the user-critical path and hence low-priority. [^consumption_metrics_slo] [^consumption_metrics_slo]: We can't delay metrics collection indefintely because there are TBD internal SLOs tied to metrics collection happening in a timeline manner (neondatabase/cloud#7408). But let's ignore that in this issue. The ratio of walreceiver-initiated spawns vs consumption-metrics-initiated spawns can be reconstructed from logs (`spawning logical size computation from context of task kind {:?}"`). PR #5995 and #6018 adds metrics for this. First investigation of the ratio lead to the discovery that walreceiver spawns 75% of init logical size computations. That's because of two bugs: - In Safekeepers: #5993 - In interaction between Pageservers and Safekeepers: #5962 The safekeeper bug is likely primarily responsible but we don't have the data yet. The metrics will hopefully provide some insights. When assessing production-readiness of this PR, please assume that neither of these bugs are fixed yet. Changes In This PR ------------------ With this PR, initial logical size calculation is reworked as follows: First, all initial logical size calculation task_mgr tasks are started early, as part of timeline activation, and run a retry loop with long back-off until success. This removes the lazy computation; it was needless complexity because in practice, we compute all logical sizes anyways, because consumption metrics collects it. Second, within the initial logical size calculation task, each attempt queues behind the background loop concurrency limiter semaphore. This fixes the performance issue that we pointed out in the "Problem" section earlier. Third, there is a twist to queuing behind the background loop concurrency limiter semaphore. Logical size is needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce the project logical size limit. However, we currently do open walreceiver connections even before we have an exact logical size. That's bad, and I'll build on top of this PR to fix that (#5963). But, for the purposes of this PR, we don't want to introduce a regression, i.e., we don't want to provide an exact value later than before this PR. The solution is to introduce a priority-boosting mechanism (`GetLogicalSizePriority`), allowing callers of `Timeline::get_current_logical_size` to specify how urgently they need an exact value. The effect of specifying high urgency is that the initial logical size calculation task for the timeline will skip the concurrency limiting semaphore. This should yield effectively the same behavior as we had before this PR with lazy spawning. Last, the priority-boosting mechanism obsoletes the `init_order`'s grace period for initial logical size calculations. It's a separate commit to reduce the churn during review. We can drop that commit if people think it's too much churn, and commit it later once we know this PR here worked as intended. Experiment With #5479 --------------------- I validated this PR combined with #5479 to assess whether we're making forward progress towards asyncification. The setup is an `i3en.3xlarge` instance with 20k tenants, each with one timeline that has 9 layers. All tenants are inactive, i.e., not known to SKs nor storage broker. This means all initial logical size calculations are spawned by consumption metrics `MetricsCollection` task kind. The consumption metrics worker starts requesting logical sizes at low priority immediately after restart. This is achieved by deleting the consumption metrics cache file on disk before starting PS.[^consumption_metrics_cache_file] [^consumption_metrics_cache_file] Consumption metrics worker persists its interval across restarts to achieve persistent reporting intervals across PS restarts; delete the state file on disk to get predictable (and I believe worst-case in terms of concurrency during PS restart) behavior. Before this patch, all of these timelines would all do their initial logical size calculation in parallel, leading to extreme thrashing in page cache and virtual file cache. With this patch, the virtual file cache thrashing is reduced significantly (from 80k `open`-system-calls/second to ~500 `open`-system-calls/second during loading). ### Critique The obvious critique with above experiment is that there's no skipping of the semaphore, i.e., the priority-boosting aspect of this PR is not exercised. If even just 1% of our 20k tenants in the setup were active in SK/storage_broker, then 200 logical size calculations would skip the limiting semaphore immediately after restart and run concurrently. Further critique: given the two bugs wrt timeline inactive vs active state that were mentioned in the Background section, we could have 75% of our 20k tenants being (falsely) active on restart. So... (next section) This Doesn't Make Us Ready For Async VirtualFile ------------------------------------------------ This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even #4744. But it doesn't yet enable us to ship #5479. The reason is that this PR doesn't limit the amount of high-priority logical size computations. If there are many high-priority logical size calculations requested, we'll fall over like we did if #5479 is applied without this PR. And currently, at very least due to the bugs mentioned in the Background section, we run thousands of high-priority logical size calculations on PS startup in prod. So, at a minimum, we need to fix these bugs. Then we can ship #5479 and #4744, and things will likely be fine under normal operation. But in high-traffic situations, overload problems will still be more likely to happen, e.g., VirtualFile cache descriptor thrashing. The solution candidates for that are orthogonal to this PR though: * global concurrency limiting * per-tenant rate limiting => #5899 * load shedding * scaling bottleneck resources (fd cache size (neondatabase/cloud#8351), page cache size(neondatabase/cloud#8351), spread load across more PSes, etc) Conclusion ---------- Even with the remarks from in the previous section, we should merge this PR because: 1. it's an improvement over the status quo (esp. if the aforementioned bugs wrt timeline active / inactive are fixed) 2. it prepares the way for #6010 3. it gets us close to shipping #5479 and #4744

problame · 2024-01-22T11:45:31Z

Meeting notes today:

Intersects with SK sharding work: https://www.notion.so/neondatabase/Safekeepers-d06a3be92a0349dd9de486b7d2d2011d?pvs=4#a036fb0806634b2aa0fdaeae9a254de3
Not super urgent business-wise, but, easy to do
Would probably be enough to flip the order in PS startup sequence, small change, let's give it a budget of 1h

jcsp · 2024-06-19T15:40:01Z

We anticipate persisting snapshots of timeline logical sizes to remote storage in the near future to enable hibernated timelines (#8088 ), which should also enable us to ensure that we always have a logical size for a timeline. This may lag ingest a little bit after restart, but it will eliminate the 0 logical size phase.

problame added t/bug Issue Type: Bug c/storage/safekeeper Component: storage: safekeeper c/storage/pageserver Component: storage: pageserver c/storage Component: storage labels Nov 28, 2023

problame self-assigned this Nov 29, 2023

problame mentioned this issue Nov 29, 2023

WIP: fix: logical size limit is broken during PS restart #5982

Closed

problame mentioned this issue Nov 30, 2023

initial logical size calculation: add a bunch of metrics #5995

Merged

problame mentioned this issue Nov 30, 2023

concurrency-limit initial logical size calculation [v2] #6000

Merged

problame added a commit that referenced this issue Dec 1, 2023

WIP: fix: logical size limit is broken during PS restart

57a7b2e

fixes #5963 On top of #6000 Will ship this in a release after #600

problame mentioned this issue Dec 1, 2023

WIP: fix: logical size limit is broken during PS restart #6010

Closed

problame removed their assignment Mar 25, 2024

problame self-assigned this Apr 12, 2024

problame removed their assignment Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logical size limit is broken during PS restart #5963

logical size limit is broken during PS restart #5963

problame commented Nov 28, 2023 •

edited

Loading

Tasks

hlinnaka commented Nov 30, 2023 •

edited

Loading

problame commented Jan 22, 2024

jcsp commented Jun 19, 2024

logical size limit is broken during PS restart #5963

logical size limit is broken during PS restart #5963

Comments

problame commented Nov 28, 2023 • edited Loading

Problem

Fixing This

Design Idea 1

Design Idea 2

Tasks

hlinnaka commented Nov 30, 2023 • edited Loading

problame commented Jan 22, 2024

jcsp commented Jun 19, 2024

problame commented Nov 28, 2023 •

edited

Loading

hlinnaka commented Nov 30, 2023 •

edited

Loading