Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not produce error if gin page is not restored in redo #7876

Merged
merged 6 commits into from
May 29, 2024
Merged

Conversation

knizhnik
Copy link
Contributor

Problem

See https://github.com/neondatabase/cloud/issues/10845

Summary of changes

Do not report error if GIN page is not restored

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@knizhnik knizhnik requested a review from a team as a code owner May 24, 2024 13:55
@knizhnik knizhnik requested a review from save-buffer May 24, 2024 13:55
@kelvich
Copy link
Contributor

kelvich commented May 24, 2024

Thank you! As usual: can you please add a test that fails without this patch and passes with it

@kelvich
Copy link
Contributor

kelvich commented May 24, 2024

@hlinnaka @lubennikovaav @MMeent any thoughts? 3 one-line patches.

@kelvich
Copy link
Contributor

kelvich commented May 24, 2024

FYI: there is a #7875 that is an alternative solution

Copy link
Contributor

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, though preferably we should check we're on a Neon read replica before deciding not to error out.

Copy link

github-actions bot commented May 24, 2024

3150 tests run: 3017 passed, 0 failed, 133 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_timeline_deletion_with_files_stuck_in_upload_queue: debug

Code coverage* (full report)

  • functions: 31.4% (6490 of 20663 functions)
  • lines: 48.4% (50217 of 103724 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
4236b6a at 2024-05-29T19:17:05.571Z :recycle:

@knizhnik
Copy link
Contributor Author

I think this is fine, though preferably we should check we're on a Neon read replica before deciding not to error out.

I know how to check that:

  1. We are using Neon SMGR
  2. It is read only replica
    If combination of both this checks means that it is "Neon replica", then I can add this check.
    Although it is now quite clear to me why do we need to check of it is Neon in Neon clone of Postgres?

One more moment: this redo code is also used by walredo process at PS. It is definitely not a replica. But we should not also produce error in this case (most likely it will never happen because this WAL record is always applied to target page).

@knizhnik
Copy link
Contributor Author

Sorry, but this PR requires 3 more approvals for Postgres submodules.

@kelvich
Copy link
Contributor

kelvich commented May 26, 2024

Can you please add test that fails without this patch and passes with it.

And +1 on adding a check for replica.

Although it is now quite clear to me why do we need to check of it is Neon in Neon clone of Postgres?

E.g. to not to lose ability to run make check tests.

@knizhnik
Copy link
Contributor Author

Can you please add test that fails without this patch and passes with it.

And +1 on adding a check for replica.

Although it is now quite clear to me why do we need to check of it is Neon in Neon clone of Postgres?

E.g. to not to lose ability to run make check tests.

I first of all thought about checking RecoveryInProgress(). But it has no sense because this function is called only from redo handler (so it is always called in recovery).
Do we special need to check if it is replica?
But first of all I do not know right way to check it:(

There is HotStandbyState standbyState;
and

#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)

But it seems to be not correct because replica has to apply WAL records even before it reaches consistent snapshot.

And do not remember about walredo process: it is not a replica, but we still need to mute this error.

@kelvich
Copy link
Contributor

kelvich commented May 26, 2024

And do not remember about walredo process: it is not a replica, but we still need to mute this error.

hm, do we? is that normal for XLogReadBufferForRedo in wal-redo to return values other than BLK_RESTORED?

RecoveryInProgress() is also used to prohibit RW transactions on replica and to return cannot execute %s in a read-only transaction error. So RecoveryInProgress() && uses_neon_smgr should be okay?

@knizhnik
Copy link
Contributor Author

walredo is using it own filter:

static bool
redo_block_filter(XLogReaderState *record, uint8 block_id)
{
       ...
	/*
	 * If this block isn't one we are currently restoring, then return 'true'
	 * so that this gets ignored
	 */
	return !BUFFERTAGS_EQUAL(target_tag, target_redo_tag);
}

It restores only pages requested for walredo.
In case of WAL records affecting multiple pages, we also skipping some pages.
But XLOG_GIN_VACUUM_PAGE updates only one record so it is not a problem.

Still not sure which check to insert here...

@kelvich
Copy link
Contributor

kelvich commented May 26, 2024

ok, let's skip a check then. but test is still needed -- I think it shouldn't be a problem to reproduce that error?

@knizhnik
Copy link
Contributor Author

ok, let's skip a check then. but test is still needed -- I think it shouldn't be a problem to reproduce that error?

Test added

@kelvich
Copy link
Contributor

kelvich commented May 29, 2024

@neondatabase/compute pls ack it -- there is a customer who needs this fix

Copy link
Contributor

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message in the gin redo code is now incorrect. Please revert the check to != BLK_RESTORED

@knizhnik
Copy link
Contributor Author

The log message in the gin redo code is now incorrect. Please revert the check to != BLK_RESTORED

Sorry, the simpler fix is, the higher probability to make stupid error:(
It was result of two my attempts to fix unlock of invalid buffer which are somehow combined.
Reverted the change in predicate.

@MMeent
Copy link
Contributor

MMeent commented May 29, 2024

Next time, please include links to the PG PRs if you have them, it helps determine the scope of PRs to review

@knizhnik knizhnik merged commit 7ac11d3 into main May 29, 2024
48 checks passed
@knizhnik knizhnik deleted the fix_gin_redo branch May 29, 2024 19:18
bayandin pushed a commit that referenced this pull request May 31, 2024
## Problem

See neondatabase/cloud#10845

## Summary of changes

Do not report error if GIN page is not restored

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants