-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On-demand WAL download for walsender #6872
Conversation
FYI there are failpoints: https://github.com/neondatabase/neon/blob/main/libs/utils/src/failpoint_support.rs |
2886 tests run: 2759 passed, 0 failed, 127 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
f28aeaf at 2024-05-06T10:38:43.471Z :recycle: |
c6c9d51
to
353a37c
Compare
81a5966
to
a9f8b62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by "Fails to reproduce the issue"; test passes, and before fp is disabled vanilla compute log is full of errors.
Yes I've since commented in the issue. Are the logs really an issue? It seems like this problem self-heals when the safekeeper comes back |
Ah, I see, 6371
Nope, I just meant that it indeed reproduces it. |
So is this something worth fixing then, if it self-heals? |
a9f8b62
to
1368dad
Compare
Yes. 1) If subscriber lags a lot it might take a long time to download everything, and currently it blocks writing because walproposer does this 2) compute can run out of disk space. |
43f552c
to
2f29383
Compare
bd1edb5
to
d90d13e
Compare
9a70cab
to
d0efe33
Compare
8a8b722
to
ed1b3f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs fixing some minor cosmetic issues and can be merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direction is right, but there are some places to fix. I'll also try to run and see how it works. Also pgindent should be run (see pgindent target in neon extension makefile).
Yes I ran pgindent locally but it reformatted a lot more stuff than just this PR, so I was going to make a followup PR to reformat the whole extension |
walproposer code should be decent. You can run pgindent on selected files by slightly adjusting the command makefile uses, that's what I did the last time. |
0b762fe
to
c04a0a9
Compare
c04a0a9
to
eb3668a
Compare
- Remove wrong bump of mineLastElectedTerm before election. - Fix updating mineLastElectedTerm for sync-safekeepers. - add some comments - publish donor lsn in shmem as well. - set seg.ws_tli in remote reads, xlogreader uses it for reporting - but don't set ws_file as it is not used by anything and we don't have it in remote case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made one more pass, making fix to mineLastElectedTerm and doing some cleanup at
f28aeaf
please have a look.
It LGTM now. But let's let it soak on staging during this week.
Thank you very much for the reviews and the help! |
## Problem There's allegedly a bug where if we connect a subscriber before WAL is downloaded from the safekeeper, it creates an error. ## Summary of changes Adds support for pausing safekeepers from sending WAL to computes, and then creates a compute and attaches a subscriber while it's in this paused state. Fails to reproduce the issue, but probably a good test to have --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Problem
There's allegedly a bug where if we connect a subscriber before WAL is downloaded from the safekeeper, it creates an error.
Summary of changes
Adds support for pausing safekeepers from sending WAL to computes, and then creates a compute and attaches a subscriber while it's in this paused state. Fails to reproduce the issue, but probably a good test to have