New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: fix cont lsn jump on vectored read path #7412
Conversation
2766 tests run: 2648 passed, 0 failed, 118 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
11a1d23 at 2024-04-18T15:24:36.919Z :recycle: |
383699f
to
5cd17cf
Compare
5cd17cf
to
ee25fdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test looks convincing enough, but I cannot make the connection to the actual vectored read path change (min
).
Does the old Timeline::get_value_reconstruct_data have this problem? There, we move to ancestor if cont_lsn is less than ancestor_lsn, and cont_lsn is originally request_lsn + 1
. Okay, I think these are now aligned.
It's quite tricky to see the alignment between the read paths since they're structured differently. I think going through the example in the cover letter is the best way of understanding the issue. The test produces a very similar setup, albeit with a lot of boilerplate. Let me know if you wish to chat about it - the bug is quite obvious once you know about it :) |
I'll merge now to get some validation for the change before next weeks release. |
Great explanation and test case! It took me a while to understand why the non-vectored path is not suffering from the same bug. I believe it's because `get_reconstruct_data' has this in the loop:
Wouldn't hurt to explicitly test for the same scenario in the non-vectored case too. Or maybe we already have a test for that? It would be useful to cross-check that |
Right. The non-vectored read path does not touch the
We already have this cross checking mechanism in place (see For context, I caught this bug with the cross validation mechanism on my wip branch to migrate the get page requests to use the vectored path. The non-vectored path was returning the correct result, but it's a fair point that the test could be updated to exercise the non-vectored read path too. |
Problem
Vectored read path may return an image that's newer than the request lsn under certain circumstances.
The vectored read path inspects each ancestor timeline one by one starting from the current one.
When moving into the ancestor timeline, the current code resets the current search lsn (called
cont_lsn
in code)to the lsn of the ancestor timeline (here).
For instance, if the request lsn was 200, we would:
cont_lsn=500
Myself and Christian find it very unlikely for this to have happened in prod since the vectored read path
is always used at the last record lsn.
This issue was found by a regress test during the work to migrate get page handling to use the vectored
implementation. I've applied my fix to that wip branch and it fixed the issue.
Summary of changes
The fix is to set the current search lsn to the min between the requested LSN and the ancestor lsn.
Hence, at step 2 above we would set the current search lsn to 200 and ignore the images above that.
A test illustrating the bug is also included. Fails without the patch and passes with it.
Checklist before requesting a review
Checklist before merging