Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Rewrite VCF INFO Parser to not use htsjdk #5828

Merged
merged 16 commits into from Apr 13, 2019

Conversation

@chrisvittal
Copy link
Collaborator

@chrisvittal chrisvittal commented Apr 9, 2019

We should now never call into htsjdk during line parsing, only during header parsing.

@chrisvittal chrisvittal changed the title Better vcf parsing [hail] Rewrite VCF INFO Parser to not use htsjdk Apr 9, 2019
@chrisvittal chrisvittal requested a review from jigold Apr 9, 2019
@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)
Copy link
Collaborator

@tpoterba tpoterba Apr 9, 2019

this feels like the wrong place for hasRSID - shouldn't it be handled in the same place as the filter/qual?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

No. It doesn’t appear in the same place in a vcf line as filter and qual

Copy link
Collaborator

@jigold jigold Apr 9, 2019

Is the issue that you have to parse the ID before REF and ALT?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

Yes

@cseed
Copy link
Collaborator

@cseed cseed commented Apr 9, 2019

Any performance numbers?

@chrisvittal chrisvittal force-pushed the better-vcf-parsing branch from 3928789 to a89928f Apr 9, 2019
@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Apr 9, 2019

If there is an improvement, it is very small, running more tests now, but it looks like the new code is in the noise compared to other factors.

@chrisvittal chrisvittal dismissed tpoterba’s stale review Apr 9, 2019

I believe rsid parsing is taking place in the correct spot. Other code was dead.

@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Apr 9, 2019

More extensive cloud tests are showing speedups in the combiner pipeline compared to master, about 15-20 seconds per partition, but that adds up quickly at scale.

It was just a data class at that point.
Copy link
Collaborator

@jigold jigold left a comment

I did a careful first pass. Giving you my feedback for now. Will look at it more carefully a second time since there's a lot of changes here.

true
else {
val c = line(p)
c == '\t' || c == ';' || c == ','
Copy link
Collaborator

@jigold jigold Apr 9, 2019

When would a comma be the end of an array field?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

I copied the name from endFormatArrayField, it ends an array element.

Copy link
Collaborator

@jigold jigold Apr 10, 2019

That's really confusing.

@@ -93,16 +120,31 @@ final class VCFLine(val line: String, arrayElementsRequired: Boolean) {
}
}

def endFilterArrayField(p: Int): Boolean = endInfoField
Copy link
Collaborator

@jigold jigold Apr 9, 2019

Is this for column 7 of the VCF? Piggy backing on the info field parsing?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

They're the same, I wanted this name to make sense when parsing FILTERS.

def nextField(): Unit = {
if (pos == line.length)
parseError("unexpected end of line")
assert(line(pos) == '\t')
pos += 1 // tab
}

def nextInfoField(): Unit = {
Copy link
Collaborator

@jigold jigold Apr 9, 2019

Do you need a hasNextInfoField?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

No, the error handling could probably be cleaner, but if the assertion below fails then it's not a valid VCF. We carefully control when the next functions are called during parsing so we don't need to ask if there is a next field.

def parseAddInfoInt(rvb: RegionValueBuilder) {
if (!infoFieldMissing()) {
rvb.setPresent()
rvb.addInt(parseInfoInt())
Copy link
Collaborator

@jigold jigold Apr 9, 2019

Is the default to set these to missing in the rvb? I feel like there should be an rvb.setMissing() here somewhere.

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

There is. I set everything to missing in parseAddInfo below.

def parseAddInfoFloat(rvb: RegionValueBuilder) {
if (!infoFieldMissing()) {
rvb.setPresent()
rvb.addFloat(parseInfoString().toFloat)
Copy link
Collaborator

@jigold jigold Apr 9, 2019

What happens if these are parse errors? Will it be an incomprehensible error message? Is this what we do for parsing the genotype fields?

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator

@jigold jigold Apr 9, 2019

Can you get rid of the code above and just keep this while loop?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

No, The first key is special as it may be . to indicate that the whole INFO record is missing.

Copy link
Collaborator

@jigold jigold Apr 10, 2019

Sorry I meant this:

    if (infoType.hasField(key)) {
       rvb.setFieldIndex(infoType.fieldIdx(key))
       if (infoFlagFieldNames.contains(key))
         rvb.addBoolean(true)
       else
         parseAddInfoField(rvb, infoType.fieldType(key))
     }
     skipInfoField()

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 10, 2019

this is the alternative, thoughts?

    var key = parseInfoKey()
    if (key == ".") {
      if (endField()) {
        rvb.endStruct()
        return
      } else
        parseError(s"invalid INFO key $key")
    }

    while (!endField()) {
      if (key == ".") {
        parseError(s"invalid INFO key $key")
      }
      if (infoType.hasField(key)) {
        rvb.setFieldIndex(infoType.fieldIdx(key))
        if (infoFlagFieldNames.contains(key))
          rvb.addBoolean(true)
        else
          parseAddInfoField(rvb, infoType.fieldType(key))
      }
      skipInfoField()
      if (!endField()) {
        nextInfoField()
        key = parseInfoKey()
      }
    }

Copy link
Collaborator

@jigold jigold Apr 11, 2019

This seems better. So we just skip a field silently if it's not in the infoType?

@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)
Copy link
Collaborator

@jigold jigold Apr 9, 2019

Is the issue that you have to parse the ID before REF and ALT?

@@ -813,6 +1161,7 @@ object LoadVCF {
arrayElementsRequired: Boolean,
skipInvalidLoci: Boolean
): ContextRDD[RVDContext, RegionValue] = {
val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid")
Copy link
Collaborator

@jigold jigold Apr 9, 2019

Is this for the prune fields push down?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)
Copy link
Collaborator

@jigold jigold Apr 9, 2019

Should this not be just set to missing?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

htsjdk doesn't, it sets it to -10

Copy link
Collaborator

@jigold jigold Apr 10, 2019

That's annoying.

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN
Copy link
Collaborator

@jigold jigold Apr 9, 2019

I don't think your float parsing handles these correctly.

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.

Copy link
Collaborator

@jigold jigold Apr 10, 2019

Can you make an issue then?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 10, 2019

Futhermore, VCF 4.2 makes no claims about the string representation of floats at all. VCF is garbage.

Copy link
Collaborator Author

@chrisvittal chrisvittal left a comment

Left some replies.

else
parseAddInfoField(rvb, infoType.fieldType(key))
}
skipInfoField()
Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

It mirrors skipFormatField

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

No, The first key is special as it may be . to indicate that the whole INFO record is missing.

@@ -813,6 +1161,7 @@ object LoadVCF {
arrayElementsRequired: Boolean,
skipInvalidLoci: Boolean
): ContextRDD[RVDContext, RegionValue] = {
val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid")
Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.

@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)
Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

Yes

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)
Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

htsjdk doesn't, it sets it to -10

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN
Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 9, 2019

You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.

@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Apr 9, 2019

@cseed

import hail as hl
hl.import_vcf(PATH, reference_genome='GRCh38').write('/tmp/vcfmt', overwrite=True)

On my laptop with a warmish filesystem cache takes 1 minute for this code and 1:20 for current master. That's actually pretty good.

Copy link
Collaborator

@jigold jigold left a comment

Do we have tests that are complicated enough to fully test these changes? Maybe a complicated VEP signature?

true
else {
val c = line(p)
c == '\t' || c == ';' || c == ','
Copy link
Collaborator

@jigold jigold Apr 10, 2019

That's really confusing.

}

def parseStringInfoArrayElement() {
if (infoArrayFieldMissing()) {
Copy link
Collaborator

@jigold jigold Apr 10, 2019

ok.

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator

@jigold jigold Apr 10, 2019

Sorry I meant this:

    if (infoType.hasField(key)) {
       rvb.setFieldIndex(infoType.fieldIdx(key))
       if (infoFlagFieldNames.contains(key))
         rvb.addBoolean(true)
       else
         parseAddInfoField(rvb, infoType.fieldType(key))
     }
     skipInfoField()

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)
Copy link
Collaborator

@jigold jigold Apr 10, 2019

That's annoying.

@@ -1320,3 +1707,15 @@ class VCFsReader(
}
}
}

class BufferedLineIterator(bit: BufferedIterator[String]) extends htsjdk.tribble.readers.LineIterator {
Copy link
Collaborator

@jigold jigold Apr 10, 2019

Can you move this to the top of the file?

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 10, 2019

sure

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN
Copy link
Collaborator

@jigold jigold Apr 10, 2019

Can you make an issue then?

if (endInfoArrayField())
parseError("empty integer")
var mul = 1
if (line(pos) == '-') {
Copy link
Collaborator

@jigold jigold Apr 10, 2019

ok.

@chrisvittal chrisvittal dismissed jigold’s stale review Apr 10, 2019

Please reply to discussion of key parsing. Otherwise, I think this is good. I'll work on tests.

chrisvittal added 2 commits Apr 11, 2019
The MatrixTables checked in here were the result of using the old parser
to read vcfs and write matrix tables.
@chrisvittal chrisvittal force-pushed the better-vcf-parsing branch from af5ac2f to c7d1a63 Apr 11, 2019
@chrisvittal
Copy link
Collaborator Author

@chrisvittal chrisvittal commented Apr 11, 2019

I added some old vs. new tests.

@@ -289,6 +289,22 @@ def test_import_vcfs_subset(self):
self.assertTrue(vcf2._same(filter1))
self.assertEqual(len(parts), vcf2.n_partitions())

def test_vcf_parser_golden_master(self):
# the three matrix tables referenced here were generated using the old VCF parser
# parser
Copy link
Collaborator

@jigold jigold Apr 11, 2019

delete

Copy link
Collaborator Author

@chrisvittal chrisvittal Apr 11, 2019

ok, I still feel that knowing how these files were generated if this test ever fails will be useful.

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##FORMAT=<ID=CNL,Number=.,Type=Integer>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
Copy link
Collaborator

@jigold jigold Apr 11, 2019

Does this cover all possible field types? For backwards compatibility add the fields with nans etc. What about adding an example with DB=0? I see arrays of doubles? Do we also need arrays of strings? Can we also get a completely missing INFO field .? Just trying to make sure we got all possible code paths.

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator

@jigold jigold Apr 11, 2019

This seems better. So we just skip a field silently if it's not in the infoType?

Copy link
Collaborator

@jigold jigold left a comment

This looks pretty good. See comments.

jigold
jigold approved these changes Apr 12, 2019
@danking danking merged commit 2571917 into hail-is:master Apr 13, 2019
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants