Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in XLogWaitForReplayOf() #7804

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hlinnaka
Copy link
Contributor

ConditionVariablePrepareToSleep has this comment on it:

  • Caution: "before entering the loop" means you must test the exit
  • condition between calling ConditionVariablePrepareToSleep and calling
  • ConditionVariableSleep. If that is inconvenient, omit calling
  • ConditionVariablePrepareToSleep.

We were not obeying that: we did not test the exit condition correctly between the ConditionVariablePrepareToSleep and ConditionVariableSleep calls, because the test that we had in between them only checked the local 'replayRecPtr' variable, without updating it from shared memory.

That wasn't too serious, because the loop includes a 10 second timeout, and would retry and succeed if the original update was missed. Still, better fix it.

To fix, just remove the ConditionVariablePrepareToSleep() call. As the comment says, that's also correct, and even more efficient if we assume that sleeping is rare.

@hlinnaka
Copy link
Contributor Author

I spotted this while debugging #7791, but I don't think this has been causing any noticeable problems in practice.

Copy link

github-actions bot commented May 18, 2024

3078 tests run: 2952 passed, 0 failed, 126 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_vm_bit_clear_on_heap_lock: debug

Postgres 14

  • test_secondary_background_downloads: debug

Code coverage* (full report)

  • functions: 31.3% (6382 of 20401 functions)
  • lines: 47.7% (48630 of 101897 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
8dc1c48 at 2024-05-19T18:55:45.098Z :recycle:

@hlinnaka hlinnaka force-pushed the fix-XLogWaitForReplayOf-race branch from 0d47a19 to 1bb66fe Compare May 18, 2024 20:16
@hlinnaka hlinnaka marked this pull request as ready for review May 19, 2024 17:42
@hlinnaka hlinnaka requested a review from a team as a code owner May 19, 2024 17:42
@hlinnaka hlinnaka requested a review from knizhnik May 19, 2024 17:42
Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why removing replayRecPtr = GetXLogReplayRecPtr(NULL); assignment in XLogWaitForReplayOf is correct. Can it cause unintentional wait for 10 seconds?

Aslo not quite clear to me why bufmgr fixes are present only pg16 version?

ConditionVariablePrepareToSleep has this comment on it:

> * Caution: "before entering the loop" means you *must* test the exit
> * condition between calling ConditionVariablePrepareToSleep and calling
> * ConditionVariableSleep.  If that is inconvenient, omit calling
> * ConditionVariablePrepareToSleep.

We were not obeying that: we did not test the exit condition correctly
between the ConditionVariablePrepareToSleep and ConditionVariableSleep
calls, because the test that we had in between them only checked the
local 'replayRecPtr' variable, without updating it from shared memory.

That wasn't too serious, because the loop includes a 10 second
timeout, and would retry and succeed if the original update was
missed. Still, better fix it.

To fix, just remove the ConditionVariablePrepareToSleep() call. As the
comment says, that's also correct, and even more efficient if we
assume that sleeping is rare.
@hlinnaka hlinnaka force-pushed the fix-XLogWaitForReplayOf-race branch from 1bb66fe to 8dc1c48 Compare May 19, 2024 18:07
@hlinnaka
Copy link
Contributor Author

I am not sure why removing replayRecPtr = GetXLogReplayRecPtr(NULL); assignment in XLogWaitForReplayOf is correct. Can it cause unintentional wait for 10 seconds?

It seems correct to me. Because I'm also removing the call to ConditionVariablePrepareToSleep in this PR, the ConditionVariableTimedSleep call in the first iteration of the loop will return immediately.

Aslo not quite clear to me why bufmgr fixes are present only pg16 version?

That was a mistake, there were not supposed to be any bufmgr changs for pg16 either. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants