-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Fix corner case assertion failure in tabix read #9304
Conversation
Better attempt at fixing this issue compared to #9298, will benchmark. |
Never mind. Still doesn't work. |
I think we're in this edge case when |
I do think the solution you've described in the PR message seems better, though, if we can make this work. |
how about this:
|
I don't think in the change you've proposed, |
er, yeah. bufferLen - 1, I think. |
2cb39ba
to
ea811da
Compare
Ok, I think this is better. Every read, we cache both file pointers that can refer to the end of the current block. If we would use the smaller one, use the bigger one instead. |
think change is fine as is. feel like I've addressed the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change needs additional thought. I'm convinced that the previous code was completely wrong when lines are bigger than the buffer (as you pointed out, bufferCursor was never zero) but the behavior of this code isn't quite as simple as I'd like -- I want to write down the invariants about what we're assuming and what we're doing in a comment in each block.
I'm also now having second thoughts about whether using block end / next block offsets is simpler than maintaining a simple invariant, which is that we've never fully exhausted a block when we return from readline().
We also need tests. I think you can probably generate a failing case by concatenating individually-bgzipped lines?
We should also test the case where a GVCF has a big line - you can probably add an info field that's a string, and just append 128k or so of zeroes (will compress well, but span multiple blocks)
bc3d027
to
dd757b1
Compare
Cotton has a vcf he generated for testing BGZipCodec: see BGZipCodecSuite. This has a lot of the edge cases. |
I'll tabix it and see what happens. |
I just realized, |
I'll push my notes up that represent my understanding of how all this works and the key invariants as I see them. |
Thanks, Chris! We can also discuss at 1. |
And my latest changes broke it, but I have a fix, that pretty much does exactly what we both want I think. |
96c6850
to
8c3bcd4
Compare
We have to keep track of the block offsets. We actually can't refresh the buffer, reason being, the stupid empty block at the end of bgzip files. UUUUUUUUGGGGGGGGGGGGGGGGGGGGGGGGGG |
We very sanely skip empty blocks, and also, very sanely decompress the next block after reading the current block. This screws up |
Welp, I'm mad. Sorry for the spam tim. |
The refreshBuffer there throws an exception? |
No, we just automatically end up reading more than we expected. Another possible solution is to make the Investigating. |
We end up getting exceptions like this, happens every time we get to the end of the file:
These virtual offsets correspond to these pairs of absolute|decompressed offset
Looking at these, they differ by 28, which is the size of the empty bgzip block at the end. And so when we refresh: we go too far. |
ack. This stuff is a huge pain. |
This seems like it has to be the solution. I also suspect empty blocks in the middle will be a problem. |
#9309 up to solve these issues. If I can get all the tests to pass, I'll feel pretty good about stacking this on top of that. |
Yes, agreed. |
Hilariously, because this continues to be the most fun(/s), these fixes cannot exist without one another. |
8c3bcd4
to
16ff6de
Compare
Okay. Hopefully this is good. I also pushed an extra branch up with a small debugging class I wrote to decompose a virtual offset in messages. That work is in c323fa3. Let me know if you want to add it. |
16ff6de
to
2afa381
Compare
It doesn't work. Could use some help with handling the BGZip changes along with GenericLines. |
Ack! ok, will look tomorrow, thanks for the hard work on this. |
The exact issue here is double counting some lines in the BGzip tests in particular. |
The tabix line iterator works by building a list of virtual offsets from the index that correspond to a requested interval. We recently discovered an issue that would lead to a runtime exception if the following conditions held: * An offset pair ended exactly on a block boundary. * The block boundary was exactly the start of a line. In a blocked file with virtual offsets, there are two ways to point to the start of every block, (previous block start offset, previous block size) or (current block start offset, 0). Because of the way curOff is calculated in TabixLineIterator, this would lead to a situation where curOff was the (previous block start offset, previous block size) value, causing the jump to next chunk comparision `!TbiOrd.less64(curOff, offsets(i)._2)` to fail when it should succeed, causing an extra line to be read, which then makes the assertion check in the next iteration. The key invariant of TabixLineIterator must be to keep bufferCursor inside the block currently being read. We therefore make a check to see if we need to refresh the buffer in any circumstance where we read a line. Coupled with a reflow of control in readLine to increase readability and fix a bug where lines greater than 64k would not be read properly.
I used something a little more interesting than just a repeated character. Unzip the file if you want to see :P
This assertion is a sanity check to make sure that if the reader reads to the end of a line, it does so exactly. We don't position the reader exactly, so the assertion is invalid in our case. We also don't care, if the offset is a little past the end of a line (as it will basically only happen for empty blocks). We're always going to either stop reading or seek at this point.
2afa381
to
68a604e
Compare
Can't fail the assertion if it's not there to check. We still need the other parts of this change, because otherwise, we were reading too much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, very clear comments. Hopefully this is the last time we'll be dealing with this code for a long while.
Thanks Chris!
The tabix line iterator works by building a list of virtual offsets from
the index that correspond to a requested interval. We recently
discovered an issue that would lead to a runtime exception if the
following conditions held:
In a blocked file with virtual offsets, there are two ways to point to
the start of every block, (previous block start offset, previous block size) or
(current block start offset, 0).
Because of the way curOff is calculated in TabixLineIterator, this would
lead to a situation where curOff was the (previous block start offset, previous block size)
value, causing the jump to next chunk comparision
!TbiOrd.less64(curOff, offsets(i)._2)
to fail when it should succeed, causing an extra line to be read, which
then makes the assertion check in the next iteration.
The key invariant of TabixLineIterator must be to keep bufferCursor
inside the block currently being read. We therefore make a check to see
if we need to refresh the buffer in any circumstance where we read
a line. Coupled with a reflow of control in readLine to increase
readability and fix a bug where lines greater than 64k would not be read
properly.