Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Fix corner case assertion failure in tabix read #9304

Merged
merged 3 commits into from
Aug 20, 2020

Conversation

chrisvittal
Copy link
Collaborator

@chrisvittal chrisvittal commented Aug 18, 2020

The tabix line iterator works by building a list of virtual offsets from
the index that correspond to a requested interval. We recently
discovered an issue that would lead to a runtime exception if the
following conditions held:

  • An offset pair ended exactly on a block boundary.
  • The block boundary was exactly the start of a line.

In a blocked file with virtual offsets, there are two ways to point to
the start of every block, (previous block start offset, previous block size) or
(current block start offset, 0).

Because of the way curOff is calculated in TabixLineIterator, this would
lead to a situation where curOff was the (previous block start offset, previous block size)
value, causing the jump to next chunk comparision !TbiOrd.less64(curOff, offsets(i)._2)
to fail when it should succeed, causing an extra line to be read, which
then makes the assertion check in the next iteration.

The key invariant of TabixLineIterator must be to keep bufferCursor
inside the block currently being read. We therefore make a check to see
if we need to refresh the buffer in any circumstance where we read
a line. Coupled with a reflow of control in readLine to increase
readability and fix a bug where lines greater than 64k would not be read
properly.

@chrisvittal
Copy link
Collaborator Author

Better attempt at fixing this issue compared to #9298, will benchmark.

@chrisvittal
Copy link
Collaborator Author

Never mind. Still doesn't work.

@tpoterba
Copy link
Contributor

I think we're in this edge case when bufferCursor == bufferLen where we're returning a str. We could set a flag and use that in the control flow below?

@tpoterba
Copy link
Contributor

I do think the solution you've described in the PR message seems better, though, if we can make this work.

@tpoterba
Copy link
Contributor

how about this:

@@ -400,7 +400,10 @@ final class TabixLineIterator(
         }
       } else {
         val str = decodeString(start, bufferCursor)
-        bufferCursor += 1
+        if (bufferCursor == bufferLen) {
+          refreshBuffer(0)
+          bufferCursor = 0
+        } else bufferCursor += 1
         return str
       }
     }

@chrisvittal chrisvittal reopened this Aug 19, 2020
@chrisvittal
Copy link
Collaborator Author

I don't think in the change you've proposed, buferCursor == bufferLen is ever true. That branch is taken when bufferCursor < bufLen.

@tpoterba
Copy link
Contributor

er, yeah. bufferLen - 1, I think.

@chrisvittal
Copy link
Collaborator Author

Ok, I think this is better. Every read, we cache both file pointers that can refer to the end of the current block. If we would use the smaller one, use the bigger one instead.

@chrisvittal chrisvittal marked this pull request as ready for review August 19, 2020 13:38
@chrisvittal chrisvittal dismissed tpoterba’s stale review August 19, 2020 13:50

think change is fine as is. feel like I've addressed the comments.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs additional thought. I'm convinced that the previous code was completely wrong when lines are bigger than the buffer (as you pointed out, bufferCursor was never zero) but the behavior of this code isn't quite as simple as I'd like -- I want to write down the invariants about what we're assuming and what we're doing in a comment in each block.

I'm also now having second thoughts about whether using block end / next block offsets is simpler than maintaining a simple invariant, which is that we've never fully exhausted a block when we return from readline().

We also need tests. I think you can probably generate a failing case by concatenating individually-bgzipped lines?

We should also test the case where a GVCF has a big line - you can probably add an info field that's a string, and just append 128k or so of zeroes (will compress well, but span multiple blocks)

@tpoterba
Copy link
Contributor

Cotton has a vcf he generated for testing BGZipCodec: see BGZipCodecSuite. This has a lot of the edge cases.

@chrisvittal
Copy link
Collaborator Author

Cotton has a vcf he generated for testing BGZipCodec: see BGZipCodecSuite. This has a lot of the edge cases.

I'll tabix it and see what happens.

@chrisvittal
Copy link
Collaborator Author

I just realized, refreshBuffer is generally always accompanied by a (possibly implicit), bufferCursor = start, so that's a simplification that I'm going to try.

@chrisvittal
Copy link
Collaborator Author

I'll push my notes up that represent my understanding of how all this works and the key invariants as I see them.

@tpoterba
Copy link
Contributor

Thanks, Chris! We can also discuss at 1.

@chrisvittal
Copy link
Collaborator Author

And my latest changes broke it, but I have a fix, that pretty much does exactly what we both want I think.

@chrisvittal
Copy link
Collaborator Author

We have to keep track of the block offsets. We actually can't refresh the buffer, reason being, the stupid empty block at the end of bgzip files.

UUUUUUUUGGGGGGGGGGGGGGGGGGGGGGGGGG

@chrisvittal
Copy link
Collaborator Author

We very sanely skip empty blocks, and also, very sanely decompress the next block after reading the current block. This screws up virtualFileOffsetAtLastRead

@chrisvittal
Copy link
Collaborator Author

Welp, I'm mad. Sorry for the spam tim.

@tpoterba
Copy link
Contributor

We have to keep track of the block offsets. We actually can't refresh the buffer, reason being, the stupid empty block at the end of bgzip files.

The refreshBuffer there throws an exception?

@chrisvittal
Copy link
Collaborator Author

No, we just automatically end up reading more than we expected.

Another possible solution is to make the read methods on bgzipinputstream lazy so that they don't decompress the next block until it's requested, but then we might not handle empty blocks in the middle of the file properly.

Investigating.

@chrisvittal
Copy link
Collaborator Author

We end up getting exceptions like this, happens every time we get to the end of the file:

RuntimeException: error reading tabix-indexed file src/test/resources/gvcfs/recoding/HG00187.hg38.g.vcf.gz: i=0, curOff=2469855232, expected=2468020224

These virtual offsets correspond to these pairs of absolute|decompressed offset

2469855232 => 37687|0
2468020224 => 37659|0

Looking at these, they differ by 28, which is the size of the empty bgzip block at the end. And so when we refresh: we go too far.

@tpoterba
Copy link
Contributor

ack. This stuff is a huge pain.

@tpoterba
Copy link
Contributor

Another possible solution is to make the read methods on bgzipinputstream lazy so that they don't decompress the next block until it's requested, but then we might not handle empty blocks in the middle of the file properly.

This seems like it has to be the solution. I also suspect empty blocks in the middle will be a problem.

@chrisvittal
Copy link
Collaborator Author

#9309 up to solve these issues. If I can get all the tests to pass, I'll feel pretty good about stacking this on top of that.

@tpoterba
Copy link
Contributor

#9309 up to solve these issues. If I can get all the tests to pass, I'll feel pretty good about stacking this on top of that.

Yes, agreed.

@chrisvittal
Copy link
Collaborator Author

Hilariously, because this continues to be the most fun(/s), these fixes cannot exist without one another.

@chrisvittal
Copy link
Collaborator Author

Okay. Hopefully this is good. I also pushed an extra branch up with a small debugging class I wrote to decompose a virtual offset in messages. That work is in c323fa3. Let me know if you want to add it.

@chrisvittal
Copy link
Collaborator Author

It doesn't work. Could use some help with handling the BGZip changes along with GenericLines.

@tpoterba
Copy link
Contributor

Ack! ok, will look tomorrow, thanks for the hard work on this.

@chrisvittal
Copy link
Collaborator Author

The exact issue here is double counting some lines in the BGzip tests in particular.

The tabix line iterator works by building a list of virtual offsets from
the index that correspond to a requested interval. We recently
discovered an issue that would lead to a runtime exception if the
following conditions held:

* An offset pair ended exactly on a block boundary.
* The block boundary was exactly the start of a line.

In a blocked file with virtual offsets, there are two ways to point to
the start of every block, (previous block start offset, previous block size) or
(current block start offset, 0).

Because of the way curOff is calculated in TabixLineIterator, this would
lead to a situation where curOff was the (previous block start offset, previous block size)
value, causing the jump to next chunk comparision `!TbiOrd.less64(curOff, offsets(i)._2)`
to fail when it should succeed, causing an extra line to be read, which
then makes the assertion check in the next iteration.

The key invariant of TabixLineIterator must be to keep bufferCursor
inside the block currently being read. We therefore make a check to see
if we need to refresh the buffer in any circumstance where we read
a line. Coupled with a reflow of control in readLine to increase
readability and fix a bug where lines greater than 64k would not be read
properly.
I used something a little more interesting than just a repeated
character. Unzip the file if you want to see :P
This assertion is a sanity check to make sure that if the reader reads
to the end of a line, it does so exactly. We don't position the reader
exactly, so the assertion is invalid in our case. We also don't care, if
the offset is a little past the end of a line (as it will basically only
happen for empty blocks). We're always going to either stop reading or
seek at this point.
@chrisvittal
Copy link
Collaborator Author

Can't fail the assertion if it's not there to check. We still need the other parts of this change, because otherwise, we were reading too much.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, very clear comments. Hopefully this is the last time we'll be dealing with this code for a long while.

Thanks Chris!

@danking danking merged commit da03a4b into hail-is:main Aug 20, 2020
@tpoterba tpoterba mentioned this pull request Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants