Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80 #7791

Closed
hlinnaka opened this issue May 17, 2024 · 7 comments
Assignees
Labels
t/bug Issue Type: Bug

Comments

@hlinnaka
Copy link
Contributor

Regression test failure seen in read replica:

PG:2024-05-15 07:08:31.982 GMT [240369] ERROR:  [NEON_SMGR] [shard 0] could not read block 4925 in rel 1663/12972/16396.0 from page server at lsn 0/0B0AFDD0
PG:2024-05-15 07:08:31.982 GMT [240369] DETAIL:  page server returned error: Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80

@knizhnik can you update this description with any other details, please?

@hlinnaka hlinnaka added t/bug Issue Type: Bug c/cloud/compute labels May 17, 2024
@hlinnaka
Copy link
Contributor Author

In neon_get_request_lsns, we do this:

	if (RecoveryInProgress())
	{
		/* Request the page at the last replayed LSN. */
		result.request_lsn = GetXLogReplayRecPtr(NULL);
		result.not_modified_since = last_written_lsn;
		result.effective_request_lsn = result.request_lsn;
		Assert(last_written_lsn <= result.request_lsn);

		neon_log(DEBUG1, "neon_get_request_lsns request lsn %X/%X, not_modified_since %X/%X",
				 LSN_FORMAT_ARGS(result.request_lsn), LSN_FORMAT_ARGS(result.not_modified_since));
	}

First observation: In a debug build, we should hit that assertion instead of hitting essentially the same check in the pageserver.

I think you could hit that case in the follow scenario:

  1. Startup process starts to replay a record that modifies multiple pages, say 1 and 2.
  2. It locks page 1, applies the changes, updates last-written LSN for page 1, and releases the lock. It gets blocked reading page 2.
  3. Another backend runs a query that uses a lot of buffers, pushing page 1 out of the buffer cache
  4. Another backend runs a query that reads page 1

In step 4, it would see last_written_lsn as already updated for the WAL record's LSN. But because the WAL record hasn't been fully replayed yet, GetXLogReplayRecPtr() returns the LSN before that WAL record.

We have this later in neon_read_at_lsn:


	/*
	 * The redo process does not lock pages that it needs to replay but are
	 * not in the shared buffers, so a concurrent process may request the page
	 * after redo has decided it won't redo that page and updated the LwLSN
	 * for that page. If we're in hot standby we need to take care that we
	 * don't return until after REDO has finished replaying up to that LwLSN,
	 * as the page should have been locked up to that point.
	 *
	 * See also the description on neon_redo_read_buffer_filter below.
	 *
	 * NOTE: It is possible that the WAL redo process will still do IO due to
	 * concurrent failed read IOs. Those IOs should never have a request_lsn
	 * that is as large as the WAL record we're currently replaying, if it
	 * weren't for the behaviour of the LwLsn cache that uses the highest
	 * value of the LwLsn cache when the entry is not found.
	 */
	if (RecoveryInProgress() && !(MyBackendType == B_STARTUP))
		XLogWaitForReplayOf(request_lsns.request_lsn);

That seems to addressing the same issue, but the damage has already been done, because we already set request_lsns.request_lsn to too low a value.

We should create a test case for this. Might be tricky..

@hlinnaka hlinnaka self-assigned this May 17, 2024
@hlinnaka
Copy link
Contributor Author

Ok, I was able to write a python test to reproduce this.

@hlinnaka
Copy link
Contributor Author

I don't have a fix yet, and don't fully understand the problem yet, but looking at the code, I can two things which look a bit suspicious:

  1. In neon_redo_read_buffer_filter, we do this:
	/* In both cases st lwlsn past this WAL record */
	SetLastWrittenLSNForBlock(end_recptr, rinfo, forknum, blkno);

	/*
	 * we don't have the buffer in memory, update lwLsn past this record, also
	 * evict page from file cache
	 */
	if (no_redo_needed)
		lfc_evict(rinfo, forknum, blkno);

This is the function that checks if a page needs to be read into memory in WAL redo to update the in-cache copy of the page, or if we can just skip the page if it's not in cache. It's necessary to update the last-written LSN when the page isn't cached, but it's pretty surprising that we update it also when it is. If the page gets evicted from the page cache between this and the ReadBuffer() call to actually read it into the page cache, will the startup process request the already-updated version of the page from the pageserver? Is that OK? The redo functions can generally handle seeing a page version that's newer than the record we're replaying, because that's normal during crash recovery. But might something get confused if that happens during hot standby?

Also, what if a backend reads the same page, with the already updated last-written LSN, before the startup process does? Is that OK?

Also, because it's just a cache, after we update the last-written cache for one page, we might see that newer value for reads on different pages too. Unlikely in practice, but the cache might overflow at the crucial moment so that the new last-written value makes its way to the per-relation or global last-written LSN value, and is used for a subsequent read of a different page. Maybe that's OK? I'm not sure. If a WAL record modifies two pages, is it possible that a backend would read the old version of one of the pages, and new version of another? That's usually handled by the redo functions acquiring and holding locks on pages in the right order, but might this allow you to see the new version of one page before the redo function has acquired the lock on it?

Could we update the last-written LSN value only when we skip updating the page? Any disadvantage?

In short, this feels wrong, but I cannot put a finger on a exact scenario where it would lead to problems.

  1. neon_wallog_page() does two things: 1. it creates a full-page image WAL record of the page, if necessary. 2. It updates the last-written LSN of the page. There's an early exit from the function if ShutdownRequestPending or XLogInsertAllowed(), which makes sense for the WAL-logging part, but isn't that wrong for updating the last-written LSN? Surely we should still maintain the last-written LSN cache, even if shutdown has been requested? You're going to be shut down soon, but you might still request pages until then. And shouldn't we update the last-written LSN in a standby (XLogInsertAllowed() == false) the same as in a primary? I guess we don't need to, because we update it when the record is replayed already (see point 1), but feels iffy. Would be nice to at least add an assertion there, that the last-written LSN for the page being evicted is already up to date.

@knizhnik
Copy link
Contributor

  1. In neon_redo_read_buffer_filter, we do this:

Originally this code was written by @MMeent and looks slightly different: it set lowland to end_recptr for absent page and to prev_end_recptr for present page. Current version was committed by me in [195d493].

Looking once again at the code and reading your comments, I have to agree that setting lwlsn to end_recptr in both cases is not correct. There is no any lock between setting this lwlsn and reading buffer. So we can fetch more recent version of the page and then get the error trying to apply wal record once again.

Should we ever set lwlsn when page is absent? Usually we update lwlsn when page is evicted in neon_wallog_page. But for replica it doesn't happen because of the following check in neon_wallog_page:

	/* Don't log any pages if we're not allowed to do so. */
	if (!XLogInsertAllowed())
		return;

Please notice that XLogInsertAllowed returns false in recovery mode.
It was the source of the issue #4516 which was intended to be fixed by #4579.
And this check was added by @MMeent in [e6ec240]

It seems to me that we should remove this check and remove assignment of lwlsn for present pages.
But may be it violates some other scenarios. Will be better to get comments from @MMeent...

@knizhnik
Copy link
Contributor

  1. neon_wallog_page() does two things:

I wrote my comment above before reading 2. And come to the same conclusion.

knizhnik pushed a commit that referenced this issue May 18, 2024
@knizhnik
Copy link
Contributor

knizhnik commented May 18, 2024

I have implemented proposed changes in #7794
And it pass test from #7795

hlinnaka added a commit that referenced this issue May 22, 2024
Don't set last-written LSN of a page when the record is replayed, only
when the page is evicted from cache. For comparison, we don't update
the last-written LSN on every page modification on the primary either,
only when the page is evicted. Do update the last-written LSN when the
page update is skipped in WAL redo, however.

In neon_get_request_lsns(), don't be surprised if the last-written LSN
is equal to the record being replayed. Use the LSN of the record being
replayed as the request LSN in that case. Add a long comment
explaining how that can happen.

In neon_wallog_page, update last-written LSN also when Shutdown has
been requested. We might still fetch and evict pages for a while,
after shutdown has been requested, so we better continue to do that
correctly.

Enable the check that we don't evict a page with zero LSN also in
standby, but make it a LOG message instead of PANIC

Fixes issue #7791
hlinnaka added a commit that referenced this issue May 22, 2024
Don't set last-written LSN of a page when the record is replayed, only
when the page is evicted from cache. For comparison, we don't update
the last-written LSN on every page modification on the primary either,
only when the page is evicted. Do update the last-written LSN when the
page update is skipped in WAL redo, however.

In neon_get_request_lsns(), don't be surprised if the last-written LSN
is equal to the record being replayed. Use the LSN of the record being
replayed as the request LSN in that case. Add a long comment
explaining how that can happen.

In neon_wallog_page, update last-written LSN also when Shutdown has
been requested. We might still fetch and evict pages for a while,
after shutdown has been requested, so we better continue to do that
correctly.

Enable the check that we don't evict a page with zero LSN also in
standby, but make it a LOG message instead of PANIC

Fixes issue #7791
hlinnaka added a commit that referenced this issue May 22, 2024
Don't set last-written LSN of a page when the record is replayed, only
when the page is evicted from cache. For comparison, we don't update
the last-written LSN on every page modification on the primary either,
only when the page is evicted. Do update the last-written LSN when the
page update is skipped in WAL redo, however.

In neon_get_request_lsns(), don't be surprised if the last-written LSN
is equal to the record being replayed. Use the LSN of the record being
replayed as the request LSN in that case. Add a long comment
explaining how that can happen.

In neon_wallog_page, update last-written LSN also when Shutdown has
been requested. We might still fetch and evict pages for a while,
after shutdown has been requested, so we better continue to do that
correctly.

Enable the check that we don't evict a page with zero LSN also in
standby, but make it a LOG message instead of PANIC

Fixes issue #7791
@hlinnaka
Copy link
Contributor Author

This was fixed by #7795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants