Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #7368

Merged
merged 4 commits into from
May 21, 2024

Conversation

knizhnik
Copy link
Contributor

refer #6211 #6357

Problem

Summary of changes

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Copy link

github-actions bot commented Apr 12, 2024

3090 tests run: 2963 passed, 0 failed, 127 skipped (full report)


Code coverage* (full report)

  • functions: 31.4% (6414 of 20429 functions)
  • lines: 48.1% (49319 of 102626 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f02aeb5 at 2024-05-21T12:28:39.906Z :recycle:

Copy link
Contributor

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Compute-related changes are OK after fixing the style issue.

pgxn/neon/walproposer_pg.c Outdated Show resolved Hide resolved
@knizhnik
Copy link
Contributor Author

This PR is not passing tests because just it is not enough to fix the problem with accessing too old version which was collected by GC. #6718 should be committed first.

@arssher arssher mentioned this pull request Apr 13, 2024
5 tasks
Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, thanks, but a few minor remarks.

safekeeper/src/send_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/send_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/timeline.rs Outdated Show resolved Hide resolved
@VladLazar
Copy link
Contributor

Removing my review request, since it's unclear to me what the state of this pr is. Please re-request review if needed.

@VladLazar VladLazar removed their request for review May 3, 2024 09:10
@arssher
Copy link
Contributor

arssher commented May 6, 2024

Generally LGTM. However the new test still fails, which is interesting. Note that it was untrivial to get to the source problem because of nesting errors masking each other: 1) first select sum(abalance) from pgbench_accounts failed with tried to request a page version that was garbage collected, which terminated the loop, 2) this killed endpoints which failed run_pgbench thread which raised PytestUnhandledThreadExceptionWarning exception, 3) finally assert_no_errors checked also failed during shutdown due to tried to request a page version that was garbage collected.

So the root cause is what this PR tried to prevent: tried to request a page version that was garbage collected. Why it fails? Well, due to missing #7377 (PR not rebased yet), as some comment above shows.

@knizhnik knizhnik force-pushed the propagate_replica_flush_lsn_to_ps branch from 13bcccb to f692109 Compare May 14, 2024 11:46
@arssher arssher force-pushed the propagate_replica_flush_lsn_to_ps branch 3 times, most recently from 2e736c0 to f0f36be Compare May 20, 2024 12:51
arssher and others added 4 commits May 21, 2024 14:43
To avoid pageserver gc'ing data needed by standby, propagate standby apply LSN
through standby -> safekeeper -> broker -> pageserver flow and hold off GC for
it. Iteration of GC resets the value to remove the horizon when standby goes
away -- pushes are assumed to happen at least once between gc iterations. As a
safety guard max allowed lag compared to normal GC horizon is hardcoded as 10GB.
Add test for the feature.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Hot standby feedback xmins can be greater than next_xid due to sparse update of
nextXid on pageserver (to do less writes it advances next xid on
1024). ProcessStandbyHSFeedback ignores such xids from the future; to fix,
minimize received xmin to next_xid.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
@arssher arssher force-pushed the propagate_replica_flush_lsn_to_ps branch from f0f36be to f02aeb5 Compare May 21, 2024 11:44
@arssher arssher merged commit d43dcce into main May 21, 2024
55 checks passed
@arssher arssher deleted the propagate_replica_flush_lsn_to_ps branch May 21, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants