Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: test for large shard counts #7475

Merged
merged 15 commits into from
Apr 30, 2024
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Apr 23, 2024

Problem

Storage controller was observed to have unexpectedly large memory consumption when loaded with many thousands of shards.

This was recently fixed:

...but we need a general test that the controller is well behaved with thousands of shards.

Closes: #7460
Closes: #7463

Summary of changes

  • Add test test_storage_controller_many_tenants to exercise the system's behaviour with a more substantial workload. This test measures memory consumption and reproduces storage controller using ~500kb memory per shard #7460 before the other changes in this PR.
  • Tweak reconcile_all's return value to make it nonzero if it spawns no reconcilers, but would have spawned some reconcilers if they weren't blocked by the reconcile concurrency limit. This makes the test's reconcile_until_idle behave as expected (i.e. not complete until the system is nice and calm).
  • Fix an issue where tenant migrations would leave a spurious secondary location when migrated to some location that was not already their secondary (this was an existing low-impact bug that tripped up the test's consistency checks).

On the test with 8000 shards, the resident memory per shard is about 20KiB. This is not really per-shard memory: the primary source of memory growth is the number of concurrent network/db clients we create.

With 8000 shards, the test takes 125s to run on my workstation.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Apr 23, 2024
Copy link

github-actions bot commented Apr 23, 2024

2955 tests run: 2821 passed, 0 failed, 134 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_vm_bit_clear_on_heap_lock: debug

Postgres 14

  • test_basebackup_with_high_slru_count[github-actions-selfhosted-sequential-10-13-30]: release

Code coverage* (full report)

  • functions: 28.1% (6574 of 23355 functions)
  • lines: 46.9% (46728 of 99639 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
a2ff93b at 2024-04-30T15:35:30.523Z :recycle:

@jcsp
Copy link
Contributor Author

jcsp commented Apr 24, 2024

On my workstation:

  • Runtime of the test is 1125.35s ( 8000 tenants with 2 shards each)
  • Most of that runtime is the reconciles that happen after turning the world upside-down by marking all the nodes as offline then active. I'm not sure exactly where the slowness is coming from: spot checking individual shards' migrations, I see them taking less than a second, so we should be doing them all in ~120 seconds, but it's taking several times longer than that. Not a blocker for this PR, but worth figuring out in future, as the rate at which we execute migrations will impact the rate at which we can drain nodes when we're in a hurry.

jcsp added a commit that referenced this pull request Apr 25, 2024
## Problem

Storage controller memory can spike very high if we have many tenants
and they all try to reconcile at the same time.

Related:
- #7463
- #7460

Not closing those issues in this PR, because the test coverage for them
will be in #7475

## Summary of changes

- Add a CLI arg `--reconciler-concurrency`, defaulted to 128
- Add a semaphore to Service with this many units
- In `maybe_reconcile_shard`, try to acquire semaphore unit. If we can't
get one, return a ReconcileWaiter for a future sequence number, and push
the TenantShardId onto a channel of delayed IDs.
- In `process_result`, consume from the channel of delayed IDs if there
are semaphore units available and call maybe_reconcile_shard again for
these delayed shards.

This has been tested in #7475,
but will land that PR separately because it contains other changes &
needs the test stabilizing. This change is worth merging sooner, because
it fixes a practical issue with larger shard counts.
jcsp added a commit that referenced this pull request Apr 26, 2024
These are testability/logging improvements spun off from #7475

- Don't log warnings for shutdown errors in compute hook
- Revise logging around heartbeats and reconcile_all so that we aren't
emitting such a large volume of INFO messages under normal quite
conditions.
- Clean up the `last_error` of TenantShard to hold a ReconcileError
instead of a String, and use that properly typed error to suppress
reconciler cancel errors during reconcile_all_now. This is important for
tests that iteratively call that, as otherwise they would get 500 errors
when some reconciler in flight was cancelled (perhaps due to a state
change on the tenant shard starting a new reconciler).
jcsp added a commit that referenced this pull request Apr 29, 2024
…7495)

## Problem

Previously, we try to send compute notifications in startup_reconcile
before completing that function, with a time limit. Any notifications
that don't happen within the time limit result in tenants having their
`pending_compute_notification` flag set, which causes them to spawn a
Reconciler next time the background reconciler loop runs.

This causes two problems:
- Spawning a lot of reconcilers after startup caused a spike in memory
(this is addressed in #7493)
- After #7493, spawning lots of
reconcilers will block some other operations, e.g. a tenant creation
might fail due to lack of reconciler semaphore units while the
controller is busy running all the Reconcilers for its startup compute
notifications.

When the code was first written, ComputeHook didn't have internal
ordering logic to ensure that notifications for a shard were sent in the
right order. Since that was added in
#7088, we can use it to avoid
waiting for notifications to complete in startup_reconcile.

Related to: #7460

## Summary of changes

- Add a `notify_background` method to ComputeHook.
- Call this from startup_reconcile instead of doing notifications inline
- Process completions from `notify_background` in `process_results`, and
if a notification failed then set the `pending_compute_notification`
flag on the shard.

The result is that we will only spawn lots of Reconcilers if the compute
notifications _fail_, not just because they take some significant amount
of time.

Test coverage for this case is in
#7475
@jcsp jcsp added a/test Area: related to testing c/storage/controller Component: Storage Controller labels Apr 29, 2024
@jcsp jcsp changed the title storage controller: test & improvements for large tenant counts storage controller: test for large shard counts Apr 29, 2024
@jcsp jcsp marked this pull request as ready for review April 29, 2024 09:15
@jcsp jcsp requested a review from a team as a code owner April 29, 2024 09:15
@jcsp jcsp requested a review from VladLazar April 29, 2024 09:15
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

storage_controller/src/service.rs Show resolved Hide resolved
storage_controller/src/service.rs Outdated Show resolved Hide resolved
@jcsp jcsp requested a review from VladLazar April 29, 2024 17:38
@jcsp
Copy link
Contributor Author

jcsp commented Apr 30, 2024

When running on slower nodes in CI, this demonstrated that the default heartbeat period is currently not long enough to gracefully handle a pageserver restart.

Opened #7552 to track this.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once tests pass.

jcsp added a commit that referenced this pull request Apr 30, 2024
## Problem

`init_tenant_mgr` blocks the rest of pageserver startup, including
starting the admin API.

This was noticeable in #7475 , where the init_tenant_mgr runtime could
be long enough to trip the controller's 30 second heartbeat timeout.

## Summary of changes

- When detaching tenants during startup, spawn the background deletes as
background tasks instead of doing them inline
- Write all configs before spawning any tenants, so that the config
writes aren't fighting tenants for system resources
- Write configs with some concurrency (16) rather than writing them all
sequentially.
@jcsp jcsp enabled auto-merge (squash) April 30, 2024 14:59
@jcsp jcsp merged commit a74b600 into main Apr 30, 2024
57 checks passed
@jcsp jcsp deleted the jcsp/storcon-scale-test branch April 30, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/controller Component: Storage Controller run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label
Projects
None yet
2 participants