Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: limit Reconciler concurrency #7493

Merged
merged 3 commits into from
Apr 25, 2024
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Apr 24, 2024

Problem

Storage controller memory can spike very high if we have many tenants and they all try to reconcile at the same time.

Related:

Not closing those issues in this PR, because the test coverage for them will be in #7475

Summary of changes

  • Add a CLI arg --reconciler-concurrency, defaulted to 128
  • Add a semaphore to Service with this many units
  • In maybe_reconcile_shard, try to acquire semaphore unit. If we can't get one, return a ReconcileWaiter for a future sequence number, and push the TenantShardId onto a channel of delayed IDs.
  • In process_result, consume from the channel of delayed IDs if there are semaphore units available and call maybe_reconcile_shard again for these delayed shards.

This has been tested in #7475, but will land that PR separately because it contains other changes & needs the test stabilizing. This change is worth merging sooner, because it fixes a practical issue with larger shard counts.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug c/storage/controller Component: Storage Controller labels Apr 24, 2024
@jcsp jcsp requested a review from a team as a code owner April 24, 2024 08:44
@jcsp jcsp requested a review from arssher April 24, 2024 08:44
@jcsp jcsp requested review from petuhovskiy and VladLazar and removed request for arssher April 24, 2024 09:09
Copy link

github-actions bot commented Apr 24, 2024

2766 tests run: 2645 passed, 0 failed, 121 skipped (full report)


Code coverage* (full report)

  • functions: 28.1% (6481 of 23054 functions)
  • lines: 47.0% (46060 of 98092 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
4f025d2 at 2024-04-25T08:06:52.653Z :recycle:

storage_controller/src/tenant_shard.rs Outdated Show resolved Hide resolved
storage_controller/src/service.rs Outdated Show resolved Hide resolved
storage_controller/src/service.rs Show resolved Hide resolved
storage_controller/src/tenant_shard.rs Outdated Show resolved Hide resolved
storage_controller/src/tenant_shard.rs Outdated Show resolved Hide resolved
@jcsp
Copy link
Contributor Author

jcsp commented Apr 25, 2024

Two commits added:

  • One to improve behavior (the earlier code worked, but under some conditions would end up waiting for a background reconcile to kick the delayed shards)
  • One for the review comments on naming etc.

@jcsp jcsp requested a review from VladLazar April 25, 2024 07:21
@jcsp jcsp merged commit e8814b6 into main Apr 25, 2024
53 checks passed
@jcsp jcsp deleted the jcsp/storcon-concurrency-limit branch April 25, 2024 09:46
jcsp added a commit that referenced this pull request Apr 29, 2024
…7495)

## Problem

Previously, we try to send compute notifications in startup_reconcile
before completing that function, with a time limit. Any notifications
that don't happen within the time limit result in tenants having their
`pending_compute_notification` flag set, which causes them to spawn a
Reconciler next time the background reconciler loop runs.

This causes two problems:
- Spawning a lot of reconcilers after startup caused a spike in memory
(this is addressed in #7493)
- After #7493, spawning lots of
reconcilers will block some other operations, e.g. a tenant creation
might fail due to lack of reconciler semaphore units while the
controller is busy running all the Reconcilers for its startup compute
notifications.

When the code was first written, ComputeHook didn't have internal
ordering logic to ensure that notifications for a shard were sent in the
right order. Since that was added in
#7088, we can use it to avoid
waiting for notifications to complete in startup_reconcile.

Related to: #7460

## Summary of changes

- Add a `notify_background` method to ComputeHook.
- Call this from startup_reconcile instead of doing notifications inline
- Process completions from `notify_background` in `process_results`, and
if a notification failed then set the `pending_compute_notification`
flag on the shard.

The result is that we will only spawn lots of Reconcilers if the compute
notifications _fail_, not just because they take some significant amount
of time.

Test coverage for this case is in
#7475
jcsp added a commit that referenced this pull request Apr 30, 2024
## Problem

Storage controller was observed to have unexpectedly large memory
consumption when loaded with many thousands of shards.

This was recently fixed:
- #7493

...but we need a general test that the controller is well behaved with
thousands of shards.

Closes: #7460
Closes: #7463

## Summary of changes

- Add test test_storage_controller_many_tenants to exercise the system's
behaviour with a more substantial workload. This test measures memory
consumption and reproduces #7460 before the other changes in this PR.
- Tweak reconcile_all's return value to make it nonzero if it spawns no
reconcilers, but _would_ have spawned some reconcilers if they weren't
blocked by the reconcile concurrency limit. This makes the test's
reconcile_until_idle behave as expected (i.e. not complete until the
system is nice and calm).
- Fix an issue where tenant migrations would leave a spurious secondary
location when migrated to some location that was not already their
secondary (this was an existing low-impact bug that tripped up the
test's consistency checks).

On the test with 8000 shards, the resident memory per shard is about
20KiB. This is not really per-shard memory: the primary source of memory
growth is the number of concurrent network/db clients we create.

With 8000 shards, the test takes 125s to run on my workstation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants