Skip to content

Commit

Permalink
Apply a better fix to mdunlinkfork().
Browse files Browse the repository at this point in the history
Replace the stopgap fix I made in 0e758ae with a cleaner one.

The real problem with 4ab5dae is that it contorted this function's
logic substantially, by introducing a third code path that required
different behavior in the function's main loop.  That seems quite
unnecessary on closer inspection: the new IsBinaryUpgrade case can
just share the behavior of the other immediate-unlink cases.  Hence,
revert 4ab5dae and most of 0e758ae (keeping the latter's
save/restore errno fix), and add IsBinaryUpgrade to the set of
conditions tested to choose immediate unlink.

Also fix some additional places with sloppy handling of errno,
to ensure we have an invariant that we always continue processing
after any non-ENOENT failure of do_truncate.  I doubt that that's
fixing any bug of field importance, so I don't feel it necessary to
back-patch; but we might as well get it right while we're here.

Also improve the comments, which had drifted a bit from what the
code actually does, and neglected to mention some important
considerations.

Back-patch to v15, not because this is fixing any bug but because
it doesn't seem like a good idea for v15's mdunlinkfork logic to be
significantly different from both v14 and v16.

Discussion: https://postgr.es/m/3797575.1667924888@sss.pgh.pa.us
  • Loading branch information
tglsfdc committed Nov 9, 2022
1 parent e70cd16 commit 7b66105
Showing 1 changed file with 45 additions and 37 deletions.
82 changes: 45 additions & 37 deletions src/backend/storage/smgr/md.c
Expand Up @@ -257,11 +257,23 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
* next checkpoint, we prevent reassignment of the relfilenode number until
* it's safe, because relfilenode assignment skips over any existing file.
*
* Additional segments, if any, are truncated and then unlinked. The reason
* for truncating is that other backends may still hold open FDs for these at
* the smgr level, so that the kernel can't remove the file yet. We want to
* reclaim the disk space right away despite that.
*
* We do not need to go through this dance for temp relations, though, because
* we never make WAL entries for temp rels, and so a temp rel poses no threat
* to the health of a regular rel that has taken over its relfilenode number.
* The fact that temp rels and regular rels have different file naming
* patterns provides additional safety.
* patterns provides additional safety. Other backends shouldn't have open
* FDs for them, either.
*
* We also don't do it while performing a binary upgrade. There is no reuse
* hazard in that case, since after a crash or even a simple ERROR, the
* upgrade fails and the whole cluster must be recreated from scratch.
* Furthermore, it is important to remove the files from disk immediately,
* because we may be about to reuse the same relfilenode number.
*
* All the above applies only to the relation's main fork; other forks can
* just be removed immediately, since they are not needed to prevent the
Expand All @@ -274,6 +286,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
* for later, since during redo there's no possibility of creating a
* conflicting relation.
*
* Note: we currently just never warn about ENOENT at all. We could warn in
* the main-fork, non-isRedo case, but it doesn't seem worth the trouble.
*
* Note: any failure should be reported as WARNING not ERROR, because
* we are usually not in a transaction anymore when this is called.
*/
Expand Down Expand Up @@ -319,19 +334,19 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
{
char *path;
int ret;
BlockNumber segno = 0;
int save_errno;

path = relpath(rnode, forkNum);

/*
* Delete or truncate the first segment.
* Truncate and then unlink the first segment, or just register a request
* to unlink it later, as described in the comments for mdunlink().
*/
if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
if (isRedo || IsBinaryUpgrade || forkNum != MAIN_FORKNUM ||
RelFileNodeBackendIsTemp(rnode))
{
if (!RelFileNodeBackendIsTemp(rnode))
{
int save_errno;

/* Prevent other backends' fds from holding on to the disk space */
ret = do_truncate(path);

Expand All @@ -344,63 +359,56 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
ret = 0;

/* Next unlink the file, unless it was already found to be missing */
if (ret == 0 || errno != ENOENT)
if (ret >= 0 || errno != ENOENT)
{
ret = unlink(path);
if (ret < 0 && errno != ENOENT)
{
save_errno = errno;
ereport(WARNING,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
segno++;
errno = save_errno;
}
}
}
else
{
/* Prevent other backends' fds from holding on to the disk space */
ret = do_truncate(path);

/*
* Except during a binary upgrade, register request to unlink first
* segment later, rather than now.
*
* If we're performing a binary upgrade, the dangers described in the
* header comments for mdunlink() do not exist, since after a crash or
* even a simple ERROR, the upgrade fails and the whole new cluster
* must be recreated from scratch. And, on the other hand, it is
* important to remove the files from disk immediately, because we may
* be about to reuse the same relfilenode.
*/
if (!IsBinaryUpgrade)
{
register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
segno++;
}
/* Register request to unlink first segment later */
save_errno = errno;
register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
errno = save_errno;
}

/*
* Delete any remaining segments (we might or might not have dealt with
* the first one above).
* Delete any additional segments.
*
* Note that because we loop until getting ENOENT, we will correctly
* remove all inactive segments as well as active ones. Ideally we'd
* continue the loop until getting exactly that errno, but that risks an
* infinite loop if the problem is directory-wide (for instance, if we
* suddenly can't read the data directory itself). We compromise by
* continuing after a non-ENOENT truncate error, but stopping after any
* unlink error. If there is indeed a directory-wide problem, additional
* unlink attempts wouldn't work anyway.
*/
if (ret >= 0)
if (ret >= 0 || errno != ENOENT)
{
char *segpath = (char *) palloc(strlen(path) + 12);
BlockNumber segno;

/*
* Note that because we loop until getting ENOENT, we will correctly
* remove all inactive segments as well as active ones.
*/
for (;; segno++)
for (segno = 1;; segno++)
{
if (segno == 0)
strcpy(segpath, path);
else
sprintf(segpath, "%s.%u", path, segno);
sprintf(segpath, "%s.%u", path, segno);

if (!RelFileNodeBackendIsTemp(rnode))
{
/*
* Prevent other backends' fds from holding on to the disk
* space.
* space. We're done if we see ENOENT, though.
*/
if (do_truncate(segpath) < 0 && errno == ENOENT)
break;
Expand Down

0 comments on commit 7b66105

Please sign in to comment.