Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-demand WAL download for walsender #6872

Merged
merged 2 commits into from
May 6, 2024
Merged

Conversation

save-buffer
Copy link
Contributor

@save-buffer save-buffer commented Feb 22, 2024

Problem

There's allegedly a bug where if we connect a subscriber before WAL is downloaded from the safekeeper, it creates an error.

Summary of changes

Adds support for pausing safekeepers from sending WAL to computes, and then creates a compute and attaches a subscriber while it's in this paused state. Fails to reproduce the issue, but probably a good test to have

@save-buffer save-buffer requested a review from a team as a code owner February 22, 2024 01:18
@save-buffer save-buffer requested review from petuhovskiy and removed request for a team February 22, 2024 01:18
@save-buffer save-buffer marked this pull request as draft February 22, 2024 01:19
@petuhovskiy
Copy link
Member

FYI there are failpoints: https://github.com/neondatabase/neon/blob/main/libs/utils/src/failpoint_support.rs
I never used them, but they look similar to what you are trying to do with REPLICATION_PAUSED

Copy link

github-actions bot commented Feb 22, 2024

2886 tests run: 2759 passed, 0 failed, 127 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_vm_bit_clear_on_heap_lock: debug

Postgres 14

  • test_partial_evict_tenant[relative_spare]: release
  • test_scrubber_tenant_snapshot[4]: debug

Code coverage* (full report)

  • functions: 31.4% (6241 of 19870 functions)
  • lines: 47.1% (46731 of 99180 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f28aeaf at 2024-05-06T10:38:43.471Z :recycle:

Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by "Fails to reproduce the issue"; test passes, and before fp is disabled vanilla compute log is full of errors.

safekeeper/src/send_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/send_wal.rs Show resolved Hide resolved
test_runner/regress/test_logical_replication.py Outdated Show resolved Hide resolved
@save-buffer
Copy link
Contributor Author

Yes I've since commented in the issue. Are the logs really an issue? It seems like this problem self-heals when the safekeeper comes back

@arssher
Copy link
Contributor

arssher commented Mar 4, 2024

Yes I've since commented in the issue.

Ah, I see, 6371

Are the logs really an issue?

Nope, I just meant that it indeed reproduces it.

@save-buffer
Copy link
Contributor Author

So is this something worth fixing then, if it self-heals?

@arssher
Copy link
Contributor

arssher commented Mar 4, 2024

So is this something worth fixing then, if it self-heals?

Yes. 1) If subscriber lags a lot it might take a long time to download everything, and currently it blocks writing because walproposer does this 2) compute can run out of disk space.

safekeeper/src/send_wal.rs Outdated Show resolved Hide resolved
@save-buffer save-buffer changed the title Add test to exercise creating subscription before safekeeper ready On-demand WAL download for walsendee Apr 5, 2024
@save-buffer save-buffer changed the title On-demand WAL download for walsendee On-demand WAL download for walsender Apr 5, 2024
Copy link
Member

@petuhovskiy petuhovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs fixing some minor cosmetic issues and can be merged

Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direction is right, but there are some places to fix. I'll also try to run and see how it works. Also pgindent should be run (see pgindent target in neon extension makefile).

pgxn/neon/walsender_hooks.c Outdated Show resolved Hide resolved
pgxn/neon/walsender_hooks.c Outdated Show resolved Hide resolved
pgxn/neon/walsender_hooks.c Outdated Show resolved Hide resolved
pgxn/neon/walsender_hooks.c Outdated Show resolved Hide resolved
pgxn/neon/walsender_hooks.c Outdated Show resolved Hide resolved
pgxn/neon/walsender_hooks.c Outdated Show resolved Hide resolved
pgxn/neon/walproposer.c Outdated Show resolved Hide resolved
@save-buffer
Copy link
Contributor Author

Also pgindent should be run

Yes I ran pgindent locally but it reformatted a lot more stuff than just this PR, so I was going to make a followup PR to reformat the whole extension

@arssher
Copy link
Contributor

arssher commented May 1, 2024

Yes I ran pgindent locally but it reformatted a lot more stuff than just this PR

walproposer code should be decent. You can run pgindent on selected files by slightly adjusting the command makefile uses, that's what I did the last time.

pgxn/neon/walsender_hooks.c Outdated Show resolved Hide resolved
- Remove wrong bump of mineLastElectedTerm before election.
- Fix updating mineLastElectedTerm for sync-safekeepers.
- add some comments
- publish donor lsn in shmem as well.
- set seg.ws_tli in remote reads, xlogreader uses it for reporting
- but don't set ws_file as it is not used by anything and we don't have
  it in remote case.
Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made one more pass, making fix to mineLastElectedTerm and doing some cleanup at
f28aeaf
please have a look.

It LGTM now. But let's let it soak on staging during this week.

@save-buffer save-buffer merged commit 7dd58e1 into main May 6, 2024
52 of 53 checks passed
@save-buffer save-buffer deleted the sasha_replication_tests branch May 6, 2024 17:54
@save-buffer
Copy link
Contributor Author

Thank you very much for the reviews and the help!

conradludgate pushed a commit that referenced this pull request May 8, 2024
## Problem
There's allegedly a bug where if we connect a subscriber before WAL is
downloaded from the safekeeper, it creates an error.

## Summary of changes
Adds support for pausing safekeepers from sending WAL to computes, and
then creates a compute and attaches a subscriber while it's in this
paused state. Fails to reproduce the issue, but probably a good test to
have

---------

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants