Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: find gc cutoff points without holding Tenant::gc_cs #7585

Merged
merged 14 commits into from
May 3, 2024

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented May 2, 2024

The current implementation of finding timeline gc cutoff Lsn(s) is done while holding Tenant::gc_cs. In recent incidents long create branch times were caused by holding the Tenant::gc_cs over extremely long Timeline::find_lsn_by_timestamp. The fix is to find the GC cutoff values before taking the Tenant::gc_cs lock. This change is safe to do because the GC cutoff values and the branch points have no dependencies on each other. In the case of Timeline::find_gc_cutoff taking a long time with this change, we should no longer see Tenant::gc_cs interfering with branch creation.

Additionally, the Tenant::refresh_gc_info is now tolerant of timeline deletions (or any other failures to find the pitr_cutoff). This helps with the synthetic size calculation being constantly completed instead of having a break for a timely timeline deletion.

Fixes: #7560
Fixes: #7587

Copy link

github-actions bot commented May 2, 2024

2880 tests run: 2759 passed, 0 failed, 121 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 28.0% (6620 of 23603 functions)
  • lines: 46.7% (47077 of 100708 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
90ece17 at 2024-05-03T12:15:53.035Z :recycle:

@koivunej koivunej force-pushed the joonas/find_gc_cutoffs_without_gc_cs branch from 3c979ad to 7d161ca Compare May 2, 2024 11:55
@koivunej koivunej marked this pull request as ready for review May 2, 2024 13:23
@koivunej koivunej requested a review from a team as a code owner May 2, 2024 13:23
@koivunej koivunej requested review from arpad-m and removed request for a team May 2, 2024 13:23
@problame problame requested review from problame and removed request for arpad-m May 3, 2024 08:33
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh_gc_info_internal: I don't like that we now have a timelines variable that only contains the timelines for which we produced GcInfo, but later also use self.timelines. I think it's correct, but, it's a super easy footgun down the line, because they can be so easily confused. Maybe newtype magic could help.

But, this is not a hard ask. A comment warning future developers about this would be nice, though.


The only thing I'm insistent on is to factor out the questionmark business, as asked for in #7585 (comment)

Approving on that condition.

pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Show resolved Hide resolved
@koivunej koivunej force-pushed the joonas/find_gc_cutoffs_without_gc_cs branch from 7d161ca to c4183eb Compare May 3, 2024 09:36
Base automatically changed from joonas/split_update_gc_info to main May 3, 2024 10:11
@koivunej koivunej force-pushed the joonas/find_gc_cutoffs_without_gc_cs branch from d923e01 to 873f4ab Compare May 3, 2024 11:19
@koivunej koivunej merged commit ed9a114 into main May 3, 2024
50 checks passed
@koivunej koivunej deleted the joonas/find_gc_cutoffs_without_gc_cs branch May 3, 2024 11:57
koivunej added a commit that referenced this pull request May 6, 2024
#7585 introduced test case for deletions while synthetic size is being
calculated. The test has a race against deletion, but we only accept one
outcome. Fix it to accept 404 as well, as we cannot control from outside
which outcome happens.

Evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7456/8970595458/index.html#/testresult/32a5b2f8c4094bdb
conradludgate pushed a commit that referenced this pull request May 8, 2024
The current implementation of finding timeline gc cutoff Lsn(s) is done
while holding `Tenant::gc_cs`. In recent incidents long create branch
times were caused by holding the `Tenant::gc_cs` over extremely long
`Timeline::find_lsn_by_timestamp`. The fix is to find the GC cutoff
values before taking the `Tenant::gc_cs` lock. This change is safe to do
because the GC cutoff values and the branch points have no dependencies
on each other. In the case of `Timeline::find_gc_cutoff` taking a long
time with this change, we should no longer see `Tenant::gc_cs`
interfering with branch creation.

Additionally, the `Tenant::refresh_gc_info` is now tolerant of timeline
deletions (or any other failures to find the pitr_cutoff). This helps
with the synthetic size calculation being constantly completed instead
of having a break for a timely timeline deletion.

Fixes: #7560
Fixes: #7587
conradludgate pushed a commit that referenced this pull request May 8, 2024
#7585 introduced test case for deletions while synthetic size is being
calculated. The test has a race against deletion, but we only accept one
outcome. Fix it to accept 404 as well, as we cannot control from outside
which outcome happens.

Evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7456/8970595458/index.html#/testresult/32a5b2f8c4094bdb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants