storage controller: test for large shard counts #7475

jcsp · 2024-04-23T08:49:54Z

Problem

Storage controller was observed to have unexpectedly large memory consumption when loaded with many thousands of shards.

This was recently fixed:

controller: limit Reconciler concurrency #7493

...but we need a general test that the controller is well behaved with thousands of shards.

Closes: #7460
Closes: #7463

Summary of changes

Add test test_storage_controller_many_tenants to exercise the system's behaviour with a more substantial workload. This test measures memory consumption and reproduces storage controller using ~500kb memory per shard #7460 before the other changes in this PR.
Tweak reconcile_all's return value to make it nonzero if it spawns no reconcilers, but would have spawned some reconcilers if they weren't blocked by the reconcile concurrency limit. This makes the test's reconcile_until_idle behave as expected (i.e. not complete until the system is nice and calm).
Fix an issue where tenant migrations would leave a spurious secondary location when migrated to some location that was not already their secondary (this was an existing low-impact bug that tripped up the test's consistency checks).

On the test with 8000 shards, the resident memory per shard is about 20KiB. This is not really per-shard memory: the primary source of memory growth is the number of concurrent network/db clients we create.

With 8000 shards, the test takes 125s to run on my workstation.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-04-23T09:38:52Z

2955 tests run: 2821 passed, 0 failed, 134 skipped (full report)

Flaky tests (2)

Postgres 15

test_vm_bit_clear_on_heap_lock: debug

Postgres 14

test_basebackup_with_high_slru_count[github-actions-selfhosted-sequential-10-13-30]: release

Code coverage* (full report)

functions: 28.1% (6574 of 23355 functions)
lines: 46.9% (46728 of 99639 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
a2ff93b at 2024-04-30T15:35:30.523Z :recycle:}

jcsp · 2024-04-24T12:05:08Z

On my workstation:

Runtime of the test is 1125.35s ( 8000 tenants with 2 shards each)
Most of that runtime is the reconciles that happen after turning the world upside-down by marking all the nodes as offline then active. I'm not sure exactly where the slowness is coming from: spot checking individual shards' migrations, I see them taking less than a second, so we should be doing them all in ~120 seconds, but it's taking several times longer than that. Not a blocker for this PR, but worth figuring out in future, as the rate at which we execute migrations will impact the rate at which we can drain nodes when we're in a hurry.

## Problem Storage controller memory can spike very high if we have many tenants and they all try to reconcile at the same time. Related: - #7463 - #7460 Not closing those issues in this PR, because the test coverage for them will be in #7475 ## Summary of changes - Add a CLI arg `--reconciler-concurrency`, defaulted to 128 - Add a semaphore to Service with this many units - In `maybe_reconcile_shard`, try to acquire semaphore unit. If we can't get one, return a ReconcileWaiter for a future sequence number, and push the TenantShardId onto a channel of delayed IDs. - In `process_result`, consume from the channel of delayed IDs if there are semaphore units available and call maybe_reconcile_shard again for these delayed shards. This has been tested in #7475, but will land that PR separately because it contains other changes & needs the test stabilizing. This change is worth merging sooner, because it fixes a practical issue with larger shard counts.

These are testability/logging improvements spun off from #7475 - Don't log warnings for shutdown errors in compute hook - Revise logging around heartbeats and reconcile_all so that we aren't emitting such a large volume of INFO messages under normal quite conditions. - Clean up the `last_error` of TenantShard to hold a ReconcileError instead of a String, and use that properly typed error to suppress reconciler cancel errors during reconcile_all_now. This is important for tests that iteratively call that, as otherwise they would get 500 errors when some reconciler in flight was cancelled (perhaps due to a state change on the tenant shard starting a new reconciler).

…7495) ## Problem Previously, we try to send compute notifications in startup_reconcile before completing that function, with a time limit. Any notifications that don't happen within the time limit result in tenants having their `pending_compute_notification` flag set, which causes them to spawn a Reconciler next time the background reconciler loop runs. This causes two problems: - Spawning a lot of reconcilers after startup caused a spike in memory (this is addressed in #7493) - After #7493, spawning lots of reconcilers will block some other operations, e.g. a tenant creation might fail due to lack of reconciler semaphore units while the controller is busy running all the Reconcilers for its startup compute notifications. When the code was first written, ComputeHook didn't have internal ordering logic to ensure that notifications for a shard were sent in the right order. Since that was added in #7088, we can use it to avoid waiting for notifications to complete in startup_reconcile. Related to: #7460 ## Summary of changes - Add a `notify_background` method to ComputeHook. - Call this from startup_reconcile instead of doing notifications inline - Process completions from `notify_background` in `process_results`, and if a notification failed then set the `pending_compute_notification` flag on the shard. The result is that we will only spawn lots of Reconcilers if the compute notifications _fail_, not just because they take some significant amount of time. Test coverage for this case is in #7475

VladLazar

Nice test!

storage_controller/src/service.rs

test_runner/performance/test_storage_controller_scale.py

This reverts commit b300af2.

jcsp · 2024-04-30T08:21:03Z

When running on slower nodes in CI, this demonstrated that the default heartbeat period is currently not long enough to gracefully handle a pageserver restart.

Opened #7552 to track this.

VladLazar

Looks good to me once tests pass.

…test

## Problem `init_tenant_mgr` blocks the rest of pageserver startup, including starting the admin API. This was noticeable in #7475 , where the init_tenant_mgr runtime could be long enough to trip the controller's 30 second heartbeat timeout. ## Summary of changes - When detaching tenants during startup, spawn the background deletes as background tasks instead of doing them inline - Write all configs before spawning any tenants, so that the config writes aren't fighting tenants for system resources - Write configs with some concurrency (16) rather than writing them all sequentially.

jcsp added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Apr 23, 2024

This was referenced Apr 24, 2024

controller: limit Reconciler concurrency #7493

Merged

storage controller: send startup compute notifications in background #7495

Merged

jcsp force-pushed the jcsp/storcon-scale-test branch from d63de7a to ea786ac Compare April 24, 2024 11:58

jcsp force-pushed the jcsp/storcon-scale-test branch from ea786ac to 7040a39 Compare April 25, 2024 06:59

jcsp mentioned this pull request Apr 25, 2024

storage controller: log hygiene & better error type #7508

Merged

5 tasks

jcsp added 4 commits April 29, 2024 10:04

fix a typo

ebde66c

tweak reconcile_all to not return 0 when something is delayed

a1b7093

Backoff in reconcile_until_idle

4b62b24

storage_controller: fix bug in migrate API

b300af2

jcsp force-pushed the jcsp/storcon-scale-test branch from 7040a39 to b360551 Compare April 29, 2024 09:05

jcsp added a/test Area: related to testing c/storage/controller Component: Storage Controller labels Apr 29, 2024

tests: controller scale

cb9b4fa

jcsp force-pushed the jcsp/storcon-scale-test branch from b360551 to cb9b4fa Compare April 29, 2024 09:14

jcsp changed the title ~~storage controller: test & improvements for large tenant counts~~ storage controller: test for large shard counts Apr 29, 2024

jcsp marked this pull request as ready for review April 29, 2024 09:15

jcsp requested a review from a team as a code owner April 29, 2024 09:15

jcsp requested a review from VladLazar April 29, 2024 09:15

VladLazar reviewed Apr 29, 2024

View reviewed changes

jcsp added 3 commits April 29, 2024 16:42

Revert "storage_controller: fix bug in migrate API"

92bfb2f

This reverts commit b300af2.

update doc comment

1af31d9

Comments about consistency check safety

2fdd87d

jcsp requested a review from VladLazar April 29, 2024 17:38

fix reconcile_until_idle timeout enforcement

30fda52

storage_controller: more generous default heartbeat timeout

b9325a1

pageserver: make detaches faster during init_tenant_mgr

1e39ff8

VladLazar approved these changes Apr 30, 2024

View reviewed changes

jcsp mentioned this pull request Apr 30, 2024

pageserver: reduce runtime of init_tenant_mgr #7553

Merged

5 tasks

jcsp added 4 commits April 30, 2024 09:55

docs lint

cd5e536

Merge remote-tracking branch 'upstream/main' into jcsp/storcon-scale-…

c34890d

…test

neon_local: add a controller config object

09790cd

tests: override neon_local's default max_unavailable

a2ff93b

jcsp enabled auto-merge (squash) April 30, 2024 14:59

jcsp merged commit a74b600 into main Apr 30, 2024
57 checks passed

jcsp deleted the jcsp/storcon-scale-test branch April 30, 2024 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: test for large shard counts #7475

storage controller: test for large shard counts #7475

jcsp commented Apr 23, 2024 •

edited

github-actions bot commented Apr 23, 2024 •

edited

Postgres 15

Postgres 14

jcsp commented Apr 24, 2024

VladLazar left a comment

jcsp commented Apr 30, 2024

VladLazar left a comment

storage controller: test for large shard counts #7475

storage controller: test for large shard counts #7475

Conversation

jcsp commented Apr 23, 2024 • edited

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Apr 23, 2024 • edited

2955 tests run: 2821 passed, 0 failed, 134 skipped (full report)

Postgres 15

Postgres 14

Code coverage* (full report)

jcsp commented Apr 24, 2024

VladLazar left a comment

Choose a reason for hiding this comment

jcsp commented Apr 30, 2024

VladLazar left a comment

Choose a reason for hiding this comment

jcsp commented Apr 23, 2024 •

edited

github-actions bot commented Apr 23, 2024 •

edited