Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrency-limit initial logical size calculation #5955

Closed
wants to merge 1 commit into from

Conversation

problame
Copy link
Contributor

Before this patch, there was no concurrency limit on initial logical
size computations.

In an experiment with a PS with 20k tenants, 1 timeline each,
all tenants inactive in SKs / not present in storage broker,
all logical size calculations are spawned by MetricsCollection,
i.e., consumption metrics worker.

Before this patch, these timelines would all do their initial logical
size calculation in parallel, leading to extreme thrashing in page cache
and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
signficantly (from 80k open-system-calls/second to ~500
open-system-calls/second during loading).

This patch uses the existing background tasks semaphore to limit
concurrency, which generally is the right call for background activity.
However, due to logical size's involvement in PageserverFeedback towards
safekeepers, I think we need a priority-boosting mechanism, e.g., if
we're still calculating but walreceiver is actively asking, skip the
semaphore. That's fairly easy to implement, but, want to some feedback
on the general idea first before implementing it.
See also the FIXME in the block comment added in this commit.

NB: when evaluating, keep in mind that consumption metrics worker
persists its interval across restarts; delete the state file on disk
to get predictable (and I believe worst-case in terms of concurrency
during PS restart) behavior.

Before this patch, there was no concurrency limit on initial logical
size computations.

In an experiment with a PS with 20k tenants, 1 timeline each,
all tenants inactive in SKs / not present in storage broker,
all logical size calculations are spawned by MetricsCollection,
i.e., consumption metrics worker.

Before this patch, these timelines would all do their initial logical
size calculation in parallel, leading to extreme thrashing in page cache
and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
signficantly (from 80k `open`-system-calls/second to ~500
`open`-system-calls/second during loading).

This patch uses the existing background tasks semaphore to limit
concurrency, which generally is the right call for background activity.
However, due to logical size's involvement in PageserverFeedback towards
safekeepers, I think we need a priority-boosting mechanism, e.g., if
we're still calculating but walreceiver is actively asking, skip the
semaphore. That's fairly easy to implement, but, want to some feedback
on the general idea first before implementing it.
See also the FIXME in the block comment added in this commit.

NB: when evaluating, keep in mind that consumption metrics worker
persists its interval across restarts; delete the state file on disk
to get predictable (and I believe worst-case in terms of concurrency
during PS restart) behavior.
//
// FIXME: with the current code, walreceiver requests would also hit this semaphore
// and get queued behind other background operations. That's bad because walreceiver_connection
// will push the not-precise value as `current_timeline_size` in the `PageserverFeedback`
Copy link
Contributor

@koivunej koivunej Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be zero, there is no more not-precise-value. If I recall correctly this will lead to no throttling, or not registering the new value if a higher value is already calculated.

Comment on lines +1843 to +1845
// Example query to show different causes of initial size calculation spawning:
//
// https://neonprod.grafana.net/explore?panes=%7B%22wSx%22:%7B%22datasource%22:%22grafanacloud-logs%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22sum%20by%20%28task_kind%29%20%28count_over_time%28%7Bneon_service%3D%5C%22pageserver%5C%22,%20neon_region%3D%5C%22us-west-2%5C%22%7D%20%7C%3D%20%60logical%20size%20computation%20from%20context%20of%20task%20kind%60%20%7C%20regexp%20%60logical%20size%20computation%20from%20context%20of%20task%20kind%20%28%3FP%3Ctask_kind%3E.%2A%29%60%20%5B1m%5D%29%29%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22grafanacloud-logs%22%7D,%22editorMode%22:%22code%22,%22step%22:%221m%22%7D%5D,%22range%22:%7B%22from%22:%221700637500615%22,%22to%22:%221700639648743%22%7D%7D%7D&schemaVersion=1&orgId=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not include this in, it's of no use here.

Suggested change
// Example query to show different causes of initial size calculation spawning:
//
// https://neonprod.grafana.net/explore?panes=%7B%22wSx%22:%7B%22datasource%22:%22grafanacloud-logs%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22sum%20by%20%28task_kind%29%20%28count_over_time%28%7Bneon_service%3D%5C%22pageserver%5C%22,%20neon_region%3D%5C%22us-west-2%5C%22%7D%20%7C%3D%20%60logical%20size%20computation%20from%20context%20of%20task%20kind%60%20%7C%20regexp%20%60logical%20size%20computation%20from%20context%20of%20task%20kind%20%28%3FP%3Ctask_kind%3E.%2A%29%60%20%5B1m%5D%29%29%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22grafanacloud-logs%22%7D,%22editorMode%22:%22code%22,%22step%22:%221m%22%7D%5D,%22range%22:%7B%22from%22:%221700637500615%22,%22to%22:%221700639648743%22%7D%7D%7D&schemaVersion=1&orgId=1

Copy link

2394 tests run: 2298 passed, 1 failed, 95 skipped (full report)


Failures on Postgres 15

  • test_branching_with_pgbench[cascade-1-10]: debug
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_branching_with_pgbench[debug-pg15-cascade-1-10]"
Flaky tests (5)

Postgres 16

  • test_crafted_wal_end[last_wal_record_xlog_switch_ends_on_page_boundary]: debug
  • test_emergency_mode: release
  • test_pitr_gc: debug

Postgres 14

  • test_gc_cutoff: debug
  • test_remote_storage_upload_queue_retries: debug

Test coverage report is not available

The comment gets automatically updated with the latest test results
85445cd at 2023-11-28T16:33:51.260Z :recycle:

@problame problame self-assigned this Nov 29, 2023
problame added a commit that referenced this pull request Nov 30, 2023
These will help us answer questions such as:
- when & at what do calculations get started after PS restart?
- how often is the api to get current incrementally-computed logical
  size called, and does it return Exact vs Approximate?

I'd also be interested in a histogram of how much wall clock
time size calculations take, but, I don't know good bucket sizes,
and, logging it would introduce yet another per-timeline log
message during startup; don't think that's worth it just yet.

Context

- https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769
- #5962
- #5963
- #5955
- neondatabase/cloud#7408
problame added a commit that referenced this pull request Nov 30, 2023
These will help us answer questions such as:
- when & at what do calculations get started after PS restart?
- how often is the api to get current incrementally-computed logical
  size called, and does it return Exact vs Approximate?

I'd also be interested in a histogram of how much wall clock
time size calculations take, but, I don't know good bucket sizes,
and, logging it would introduce yet another per-timeline log
message during startup; don't think that's worth it just yet.

Context

- https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769
- #5962
- #5963
- #5955
- neondatabase/cloud#7408
@problame
Copy link
Contributor Author

v2 in #6000

@problame problame closed this Nov 30, 2023
problame added a commit that referenced this pull request Dec 1, 2023
These will help us answer questions such as:
- when & at what do calculations get started after PS restart?
- how often is the api to get current incrementally-computed logical
  size called, and does it return Exact vs Approximate?

I'd also be interested in a histogram of how much wall clock
time size calculations take, but, I don't know good bucket sizes,
and, logging it would introduce yet another per-timeline log
message during startup; don't think that's worth it just yet.

Context

- https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769
- #5962
- #5963
- #5955
- neondatabase/cloud#7408
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants