Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: robustness improvements #7027

Merged
merged 16 commits into from Mar 7, 2024
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Mar 5, 2024

Problem

Closes: #6847
Closes: #7006

Summary of changes

  • Pageserver API calls are wrapped in timeout/retry logic: this prevents a reconciler getting hung on a pageserver API hang, and prevents reconcilers having to totally retry if one API call returns a retryable error (e.g. 503).
  • Add a cancellation token to Node, so that when we mark a node offline we will cancel any API calls in progress to that node, and avoid issuing any more API calls to that offline node.
  • If the dirty locations of a shard are all on offline nodes, then don't spawn a reconciler
  • In re-attach, if we have no observed state object for a tenant then construct one with conf: None (which means "unknown"). Then in Reconciler, implement a TODO for scanning such locations before running, so that we will avoid spuriously incrementing a generation in the case of a node that was offline while we started (this is the case that tripped up test_basebackup_with_high_slru_count fails due to attachment service reconcile race #7006)
  • Refactoring: make Node contents private (and thereby guarantee that updates to availability mode reliably update the cancellation token.)
  • Refactoring: don't pass the whole map of nodes into Reconciler (and thereby remove a bunch of .expect() calls)

Some of this was discovered/tested with a new failure injection test that will come in a separate PR, once it is stable enough for CI.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Mar 5, 2024
Copy link

github-actions bot commented Mar 5, 2024

2567 tests run: 2434 passed, 0 failed, 133 skipped (full report)


Flaky tests (3)

Postgres 15

  • test_pg_regress[None]: debug
  • test_empty_tenant_size: debug

Postgres 14

  • test_multi_attach: debug

Code coverage* (full report)

  • functions: 28.8% (7009 of 24365 functions)
  • lines: 47.5% (43162 of 90949 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
d59b3d1 at 2024-03-07T14:52:07.086Z :recycle:

@jcsp jcsp added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Mar 5, 2024
@jcsp jcsp requested a review from VladLazar March 5, 2024 18:07
@jcsp jcsp marked this pull request as ready for review March 5, 2024 18:07
@jcsp jcsp requested a review from a team as a code owner March 5, 2024 18:07
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

control_plane/attachment_service/src/tenant_state.rs Outdated Show resolved Hide resolved
control_plane/attachment_service/src/service.rs Outdated Show resolved Hide resolved
@jcsp jcsp enabled auto-merge (squash) March 6, 2024 17:02
@jcsp jcsp removed the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Mar 7, 2024
@jcsp jcsp merged commit d5a6a2a into main Mar 7, 2024
58 checks passed
@jcsp jcsp deleted the jcsp/ha-testing-fixes-1 branch March 7, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/bug Issue Type: Bug t/feature Issue type: feature, for new features or requests
Projects
None yet
2 participants