Skip to content

Commit

Permalink
Don't trust unvalidated xl_tot_len.
Browse files Browse the repository at this point in the history
xl_tot_len comes first in a WAL record.  Usually we don't trust it to be
the true length until we've validated the record header.  If the record
header was split across two pages, previously we wouldn't do the
validation until after we'd already tried to allocate enough memory to
hold the record, which was bad because it might actually be garbage
bytes from a recycled WAL file, so we could try to allocate a lot of
memory.  Release 15 made it worse.

Since 70b4f82, we'd at least generate an end-of-WAL condition if the
garbage 4 byte value happened to be > 1GB, but we'd still try to
allocate up to 1GB of memory bogusly otherwise.  That was an
improvement, but unfortunately release 15 tries to allocate another
object before that, so you could get a FATAL error and recovery could
fail.

We can fix both variants of the problem more fundamentally using
pre-existing page-level validation, if we just re-order some logic.

The new order of operations in the split-header case defers all memory
allocation based on xl_tot_len until we've read the following page.  At
that point we know that its first few bytes are not recycled data, by
checking its xlp_pageaddr, and that its xlp_rem_len agrees with
xl_tot_len on the preceding page.  That is strong evidence that
xl_tot_len was truly the start of a record that was logged.

This problem was most likely to occur on a standby, because
walreceiver.c recycles WAL files without zeroing out trailing regions of
each page.  We could fix that too, but it wouldn't protect us from rare
crash scenarios where the trailing zeroes don't make it to disk.

With reliable xl_tot_len validation in place, the ancient policy of
considering malloc failure to indicate corruption at end-of-WAL seems
quite surprising, but changing that is left for later work.

Also included is a new TAP test to exercise various cases of end-of-WAL
detection by writing contrived data into the WAL from Perl.

Back-patch to 12.  We decided not to put this change into the final
release of 11.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Michael Paquier <michael@paquier.xyz>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code)
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
  • Loading branch information
macdice committed Sep 22, 2023
1 parent 755eb44 commit bae868c
Show file tree
Hide file tree
Showing 4 changed files with 569 additions and 56 deletions.
123 changes: 67 additions & 56 deletions src/backend/access/transam/xlogreader.c
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,9 @@ XLogReaderFree(XLogReaderState *state)
* XLOG_BLCKSZ, and make sure it's at least 5*Max(BLCKSZ, XLOG_BLCKSZ) to start
* with. (That is enough for all "normal" records, but very large commit or
* abort records might need more space.)
*
* Note: This routine should *never* be called for xl_tot_len until the header
* of the record has been fully validated.
*/
static bool
allocate_recordbuf(XLogReaderState *state, uint32 reclength)
Expand All @@ -201,25 +204,6 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));

#ifndef FRONTEND

/*
* Note that in much unlucky circumstances, the random data read from a
* recycled segment can cause this routine to be called with a size
* causing a hard failure at allocation. For a standby, this would cause
* the instance to stop suddenly with a hard failure, preventing it to
* retry fetching WAL from one of its sources which could allow it to move
* on with replay without a manual restart. If the data comes from a past
* recycled segment and is still valid, then the allocation may succeed
* but record checks are going to fail so this would be short-lived. If
* the allocation fails because of a memory shortage, then this is not a
* hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
*/
if (!AllocSizeIsValid(newSize))
return false;

#endif

if (state->readRecordBuf)
pfree(state->readRecordBuf);
state->readRecordBuf =
Expand Down Expand Up @@ -669,41 +653,26 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
}
else
{
/* XXX: more validation should be done here */
if (total_len < SizeOfXLogRecord)
{
report_invalid_record(state,
"invalid record length at %X/%X: expected at least %u, got %u",
LSN_FORMAT_ARGS(RecPtr),
(uint32) SizeOfXLogRecord, total_len);
goto err;
}
/* We'll validate the header once we have the next page. */
gotheader = false;
}

/*
* Find space to decode this record. Don't allow oversized allocation if
* the caller requested nonblocking. Otherwise, we *have* to try to
* decode the record now because the caller has nothing else to do, so
* allow an oversized record to be palloc'd if that turns out to be
* necessary.
* Try to find space to decode this record, if we can do so without
* calling palloc. If we can't, we'll try again below after we've
* validated that total_len isn't garbage bytes from a recycled WAL page.
*/
decoded = XLogReadRecordAlloc(state,
total_len,
!nonblocking /* allow_oversized */ );
if (decoded == NULL)
false /* allow_oversized */ );
if (decoded == NULL && nonblocking)
{
/*
* There is no space in the decode buffer. The caller should help
* with that problem by consuming some records.
* There is no space in the circular decode buffer, and the caller is
* only reading ahead. The caller should consume existing records to
* make space.
*/
if (nonblocking)
return XLREAD_WOULDBLOCK;

/* We failed to allocate memory for an oversized record. */
report_invalid_record(state,
"out of memory while trying to decode a record of length %u", total_len);
goto err;
return XLREAD_WOULDBLOCK;
}

len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
Expand All @@ -718,16 +687,11 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
assembled = true;

/*
* Enlarge readRecordBuf as needed.
* We always have space for a couple of pages, enough to validate a
* boundary-spanning record header.
*/
if (total_len > state->readRecordBufSize &&
!allocate_recordbuf(state, total_len))
{
/* We treat this as a "bogus data" condition */
report_invalid_record(state, "record length %u at %X/%X too long",
total_len, LSN_FORMAT_ARGS(RecPtr));
goto err;
}
Assert(state->readRecordBufSize >= XLOG_BLCKSZ * 2);
Assert(state->readRecordBufSize >= len);

/* Copy the first fragment of the record from the first page. */
memcpy(state->readRecordBuf,
Expand Down Expand Up @@ -824,8 +788,35 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
goto err;
gotheader = true;
}
} while (gotlen < total_len);

/*
* We might need a bigger buffer. We have validated the record
* header, in the case that it split over a page boundary. We've
* also cross-checked total_len against xlp_rem_len on the second
* page, and verified xlp_pageaddr on both.
*/
if (total_len > state->readRecordBufSize)
{
char save_copy[XLOG_BLCKSZ * 2];

/*
* Save and restore the data we already had. It can't be more
* than two pages.
*/
Assert(gotlen <= lengthof(save_copy));
Assert(gotlen <= state->readRecordBufSize);
memcpy(save_copy, state->readRecordBuf, gotlen);
if (!allocate_recordbuf(state, total_len))
{
/* We treat this as a "bogus data" condition */
report_invalid_record(state, "record length %u at %X/%X too long",
total_len, LSN_FORMAT_ARGS(RecPtr));
goto err;
}
memcpy(state->readRecordBuf, save_copy, gotlen);
buffer = state->readRecordBuf + gotlen;
}
} while (gotlen < total_len);
Assert(gotheader);

record = (XLogRecord *) state->readRecordBuf;
Expand Down Expand Up @@ -867,6 +858,28 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
}

/*
* If we got here without a DecodedXLogRecord, it means we needed to
* validate total_len before trusting it, but by now now we've done that.
*/
if (decoded == NULL)
{
Assert(!nonblocking);
decoded = XLogReadRecordAlloc(state,
total_len,
true /* allow_oversized */ );
if (decoded == NULL)
{
/*
* We failed to allocate memory for an oversized record. As
* above, we currently treat this as a "bogus data" condition.
*/
report_invalid_record(state,
"out of memory while trying to decode a record of length %u", total_len);
goto err;
}
}

if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
{
/* Record the location of the next record. */
Expand Down Expand Up @@ -895,8 +908,6 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
state->decode_queue_head = decoded;
return XLREAD_SUCCESS;
}
else
return XLREAD_FAIL;

err:
if (assembled)
Expand Down
41 changes: 41 additions & 0 deletions src/test/perl/PostgreSQL/Test/Utils.pm
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ our @EXPORT = qw(
chmod_recursive
check_pg_config
dir_symlink
scan_server_header
system_or_bail
system_log
run_log
Expand Down Expand Up @@ -702,6 +703,46 @@ sub chmod_recursive

=pod
=item scan_server_header(header_path, regexp)
Returns an array that stores all the matches of the given regular expression
within the PostgreSQL installation's C<header_path>. This can be used to
retrieve specific value patterns from the installation's header files.
=cut

sub scan_server_header
{
my ($header_path, $regexp) = @_;

my ($stdout, $stderr);
my $result = IPC::Run::run [ 'pg_config', '--includedir-server' ], '>',
\$stdout, '2>', \$stderr
or die "could not execute pg_config";
chomp($stdout);
$stdout =~ s/\r$//;

open my $header_h, '<', "$stdout/$header_path" or die "$!";

my @match = undef;
while (<$header_h>)
{
my $line = $_;

if (@match = $line =~ /^$regexp/)
{
last;
}
}

close $header_h;
die "could not find match in header $header_path\n"
unless @match;
return @match;
}

=pod
=item check_pg_config(regexp)
Return the number of matches of the given regular expression
Expand Down
1 change: 1 addition & 0 deletions src/test/recovery/meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ tests += {
't/036_truncated_dropped.pl',
't/037_invalid_database.pl',
't/038_save_logical_slots_shutdown.pl',
't/039_end_of_wal.pl',
],
},
}

0 comments on commit bae868c

Please sign in to comment.