Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80 #7791

hlinnaka · 2024-05-17T12:19:56Z

Regression test failure seen in read replica:

PG:2024-05-15 07:08:31.982 GMT [240369] ERROR:  [NEON_SMGR] [shard 0] could not read block 4925 in rel 1663/12972/16396.0 from page server at lsn 0/0B0AFDD0
PG:2024-05-15 07:08:31.982 GMT [240369] DETAIL:  page server returned error: Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80

@knizhnik can you update this description with any other details, please?

The text was updated successfully, but these errors were encountered:

hlinnaka · 2024-05-17T12:27:42Z

In neon_get_request_lsns, we do this:

	if (RecoveryInProgress())
	{
		/* Request the page at the last replayed LSN. */
		result.request_lsn = GetXLogReplayRecPtr(NULL);
		result.not_modified_since = last_written_lsn;
		result.effective_request_lsn = result.request_lsn;
		Assert(last_written_lsn <= result.request_lsn);

		neon_log(DEBUG1, "neon_get_request_lsns request lsn %X/%X, not_modified_since %X/%X",
				 LSN_FORMAT_ARGS(result.request_lsn), LSN_FORMAT_ARGS(result.not_modified_since));
	}

First observation: In a debug build, we should hit that assertion instead of hitting essentially the same check in the pageserver.

I think you could hit that case in the follow scenario:

Startup process starts to replay a record that modifies multiple pages, say 1 and 2.
It locks page 1, applies the changes, updates last-written LSN for page 1, and releases the lock. It gets blocked reading page 2.
Another backend runs a query that uses a lot of buffers, pushing page 1 out of the buffer cache
Another backend runs a query that reads page 1

In step 4, it would see last_written_lsn as already updated for the WAL record's LSN. But because the WAL record hasn't been fully replayed yet, GetXLogReplayRecPtr() returns the LSN before that WAL record.

We have this later in neon_read_at_lsn:


	/*
	 * The redo process does not lock pages that it needs to replay but are
	 * not in the shared buffers, so a concurrent process may request the page
	 * after redo has decided it won't redo that page and updated the LwLSN
	 * for that page. If we're in hot standby we need to take care that we
	 * don't return until after REDO has finished replaying up to that LwLSN,
	 * as the page should have been locked up to that point.
	 *
	 * See also the description on neon_redo_read_buffer_filter below.
	 *
	 * NOTE: It is possible that the WAL redo process will still do IO due to
	 * concurrent failed read IOs. Those IOs should never have a request_lsn
	 * that is as large as the WAL record we're currently replaying, if it
	 * weren't for the behaviour of the LwLsn cache that uses the highest
	 * value of the LwLsn cache when the entry is not found.
	 */
	if (RecoveryInProgress() && !(MyBackendType == B_STARTUP))
		XLogWaitForReplayOf(request_lsns.request_lsn);

That seems to addressing the same issue, but the damage has already been done, because we already set request_lsns.request_lsn to too low a value.

We should create a test case for this. Might be tricky..

hlinnaka · 2024-05-17T14:34:55Z

Ok, I was able to write a python test to reproduce this.

hlinnaka · 2024-05-17T18:04:18Z

I don't have a fix yet, and don't fully understand the problem yet, but looking at the code, I can two things which look a bit suspicious:

In neon_redo_read_buffer_filter, we do this:

	/* In both cases st lwlsn past this WAL record */
	SetLastWrittenLSNForBlock(end_recptr, rinfo, forknum, blkno);

	/*
	 * we don't have the buffer in memory, update lwLsn past this record, also
	 * evict page from file cache
	 */
	if (no_redo_needed)
		lfc_evict(rinfo, forknum, blkno);

This is the function that checks if a page needs to be read into memory in WAL redo to update the in-cache copy of the page, or if we can just skip the page if it's not in cache. It's necessary to update the last-written LSN when the page isn't cached, but it's pretty surprising that we update it also when it is. If the page gets evicted from the page cache between this and the ReadBuffer() call to actually read it into the page cache, will the startup process request the already-updated version of the page from the pageserver? Is that OK? The redo functions can generally handle seeing a page version that's newer than the record we're replaying, because that's normal during crash recovery. But might something get confused if that happens during hot standby?

Also, what if a backend reads the same page, with the already updated last-written LSN, before the startup process does? Is that OK?

Also, because it's just a cache, after we update the last-written cache for one page, we might see that newer value for reads on different pages too. Unlikely in practice, but the cache might overflow at the crucial moment so that the new last-written value makes its way to the per-relation or global last-written LSN value, and is used for a subsequent read of a different page. Maybe that's OK? I'm not sure. If a WAL record modifies two pages, is it possible that a backend would read the old version of one of the pages, and new version of another? That's usually handled by the redo functions acquiring and holding locks on pages in the right order, but might this allow you to see the new version of one page before the redo function has acquired the lock on it?

Could we update the last-written LSN value only when we skip updating the page? Any disadvantage?

In short, this feels wrong, but I cannot put a finger on a exact scenario where it would lead to problems.

neon_wallog_page() does two things: 1. it creates a full-page image WAL record of the page, if necessary. 2. It updates the last-written LSN of the page. There's an early exit from the function if ShutdownRequestPending or XLogInsertAllowed(), which makes sense for the WAL-logging part, but isn't that wrong for updating the last-written LSN? Surely we should still maintain the last-written LSN cache, even if shutdown has been requested? You're going to be shut down soon, but you might still request pages until then. And shouldn't we update the last-written LSN in a standby (XLogInsertAllowed() == false) the same as in a primary? I guess we don't need to, because we update it when the record is replayed already (see point 1), but feels iffy. Would be nice to at least add an assertion there, that the last-written LSN for the page being evicted is already up to date.

knizhnik · 2024-05-18T11:10:47Z

In neon_redo_read_buffer_filter, we do this:

Originally this code was written by @MMeent and looks slightly different: it set lowland to end_recptr for absent page and to prev_end_recptr for present page. Current version was committed by me in [195d493].

Looking once again at the code and reading your comments, I have to agree that setting lwlsn to end_recptr in both cases is not correct. There is no any lock between setting this lwlsn and reading buffer. So we can fetch more recent version of the page and then get the error trying to apply wal record once again.

Should we ever set lwlsn when page is absent? Usually we update lwlsn when page is evicted in neon_wallog_page. But for replica it doesn't happen because of the following check in neon_wallog_page:

	/* Don't log any pages if we're not allowed to do so. */
	if (!XLogInsertAllowed())
		return;

Please notice that XLogInsertAllowed returns false in recovery mode.
It was the source of the issue #4516 which was intended to be fixed by #4579.
And this check was added by @MMeent in [e6ec240]

It seems to me that we should remove this check and remove assignment of lwlsn for present pages.
But may be it violates some other scenarios. Will be better to get comments from @MMeent...

knizhnik · 2024-05-18T11:13:45Z

neon_wallog_page() does two things:

I wrote my comment above before reading 2. And come to the same conclusion.

refer #7791

knizhnik · 2024-05-18T12:02:35Z

I have implemented proposed changes in #7794
And it pass test from #7795

Don't set last-written LSN of a page when the record is replayed, only when the page is evicted from cache. For comparison, we don't update the last-written LSN on every page modification on the primary either, only when the page is evicted. Do update the last-written LSN when the page update is skipped in WAL redo, however. In neon_get_request_lsns(), don't be surprised if the last-written LSN is equal to the record being replayed. Use the LSN of the record being replayed as the request LSN in that case. Add a long comment explaining how that can happen. In neon_wallog_page, update last-written LSN also when Shutdown has been requested. We might still fetch and evict pages for a while, after shutdown has been requested, so we better continue to do that correctly. Enable the check that we don't evict a page with zero LSN also in standby, but make it a LOG message instead of PANIC Fixes issue #7791

hlinnaka · 2024-05-25T07:21:15Z

This was fixed by #7795

hlinnaka added t/bug Issue Type: Bug c/cloud/compute labels May 17, 2024

hlinnaka self-assigned this May 17, 2024

hlinnaka mentioned this issue May 17, 2024

Fix and add test case for issue 7791 #7795

Merged

This was referenced May 17, 2024

zenith_test_evict mode and clear_buffer_cache() invalidate buffers incorrectly. #7802

Closed

Fix race condition in XLogWaitForReplayOf() #7804

Open

knizhnik pushed a commit that referenced this issue May 18, 2024

Update lwlsn for replica too in neon_wallog_page

0a0e834

refer #7791

hlinnaka closed this as completed May 25, 2024

stepashka added the c/compute Component: compute, excluding postgres itself label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80 #7791

Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80 #7791

hlinnaka commented May 17, 2024

hlinnaka commented May 17, 2024

hlinnaka commented May 17, 2024

hlinnaka commented May 17, 2024

knizhnik commented May 18, 2024

knizhnik commented May 18, 2024

knizhnik commented May 18, 2024 •

edited

hlinnaka commented May 25, 2024

Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80 #7791

Bad request: invalid request with request LSN 0/B0AFDD0 and not_modified_since 0/B0AFE80 #7791

Comments

hlinnaka commented May 17, 2024

hlinnaka commented May 17, 2024

hlinnaka commented May 17, 2024

hlinnaka commented May 17, 2024

knizhnik commented May 18, 2024

knizhnik commented May 18, 2024

knizhnik commented May 18, 2024 • edited

hlinnaka commented May 25, 2024

knizhnik commented May 18, 2024 •

edited