-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage controller: test for large shard counts #7475
Conversation
2955 tests run: 2821 passed, 0 failed, 134 skipped (full report)Flaky tests (2)Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
a2ff93b at 2024-04-30T15:35:30.523Z :recycle: |
d63de7a
to
ea786ac
Compare
On my workstation:
|
ea786ac
to
7040a39
Compare
## Problem Storage controller memory can spike very high if we have many tenants and they all try to reconcile at the same time. Related: - #7463 - #7460 Not closing those issues in this PR, because the test coverage for them will be in #7475 ## Summary of changes - Add a CLI arg `--reconciler-concurrency`, defaulted to 128 - Add a semaphore to Service with this many units - In `maybe_reconcile_shard`, try to acquire semaphore unit. If we can't get one, return a ReconcileWaiter for a future sequence number, and push the TenantShardId onto a channel of delayed IDs. - In `process_result`, consume from the channel of delayed IDs if there are semaphore units available and call maybe_reconcile_shard again for these delayed shards. This has been tested in #7475, but will land that PR separately because it contains other changes & needs the test stabilizing. This change is worth merging sooner, because it fixes a practical issue with larger shard counts.
These are testability/logging improvements spun off from #7475 - Don't log warnings for shutdown errors in compute hook - Revise logging around heartbeats and reconcile_all so that we aren't emitting such a large volume of INFO messages under normal quite conditions. - Clean up the `last_error` of TenantShard to hold a ReconcileError instead of a String, and use that properly typed error to suppress reconciler cancel errors during reconcile_all_now. This is important for tests that iteratively call that, as otherwise they would get 500 errors when some reconciler in flight was cancelled (perhaps due to a state change on the tenant shard starting a new reconciler).
…7495) ## Problem Previously, we try to send compute notifications in startup_reconcile before completing that function, with a time limit. Any notifications that don't happen within the time limit result in tenants having their `pending_compute_notification` flag set, which causes them to spawn a Reconciler next time the background reconciler loop runs. This causes two problems: - Spawning a lot of reconcilers after startup caused a spike in memory (this is addressed in #7493) - After #7493, spawning lots of reconcilers will block some other operations, e.g. a tenant creation might fail due to lack of reconciler semaphore units while the controller is busy running all the Reconcilers for its startup compute notifications. When the code was first written, ComputeHook didn't have internal ordering logic to ensure that notifications for a shard were sent in the right order. Since that was added in #7088, we can use it to avoid waiting for notifications to complete in startup_reconcile. Related to: #7460 ## Summary of changes - Add a `notify_background` method to ComputeHook. - Call this from startup_reconcile instead of doing notifications inline - Process completions from `notify_background` in `process_results`, and if a notification failed then set the `pending_compute_notification` flag on the shard. The result is that we will only spawn lots of Reconcilers if the compute notifications _fail_, not just because they take some significant amount of time. Test coverage for this case is in #7475
7040a39
to
b360551
Compare
b360551
to
cb9b4fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test!
When running on slower nodes in CI, this demonstrated that the default heartbeat period is currently not long enough to gracefully handle a pageserver restart. Opened #7552 to track this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me once tests pass.
## Problem `init_tenant_mgr` blocks the rest of pageserver startup, including starting the admin API. This was noticeable in #7475 , where the init_tenant_mgr runtime could be long enough to trip the controller's 30 second heartbeat timeout. ## Summary of changes - When detaching tenants during startup, spawn the background deletes as background tasks instead of doing them inline - Write all configs before spawning any tenants, so that the config writes aren't fighting tenants for system resources - Write configs with some concurrency (16) rather than writing them all sequentially.
Problem
Storage controller was observed to have unexpectedly large memory consumption when loaded with many thousands of shards.
This was recently fixed:
...but we need a general test that the controller is well behaved with thousands of shards.
Closes: #7460
Closes: #7463
Summary of changes
On the test with 8000 shards, the resident memory per shard is about 20KiB. This is not really per-shard memory: the primary source of memory growth is the number of concurrent network/db clients we create.
With 8000 shards, the test takes 125s to run on my workstation.
Checklist before requesting a review
Checklist before merging