Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Rewrite VCF INFO Parser to not use htsjdk #5828

Merged
merged 16 commits into from Apr 13, 2019

Conversation

Projects
None yet
5 participants
@chrisvittal
Copy link
Collaborator

commented Apr 9, 2019

We should now never call into htsjdk during line parsing, only during header parsing.

@chrisvittal chrisvittal changed the title Better vcf parsing [hail] Rewrite VCF INFO Parser to not use htsjdk Apr 9, 2019

@chrisvittal chrisvittal requested a review from jigold Apr 9, 2019

@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 9, 2019

Collaborator

this feels like the wrong place for hasRSID - shouldn't it be handled in the same place as the filter/qual?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

No. It doesn’t appear in the same place in a vcf line as filter and qual

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Is the issue that you have to parse the ID before REF and ALT?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

Yes

Show resolved Hide resolved hail/src/main/scala/is/hail/io/vcf/HtsjdkRecordReader.scala Outdated
@cseed

This comment has been minimized.

Copy link
Collaborator

commented Apr 9, 2019

Any performance numbers?

@chrisvittal chrisvittal force-pushed the chrisvittal:better-vcf-parsing branch from 3928789 to a89928f Apr 9, 2019

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 9, 2019

If there is an improvement, it is very small, running more tests now, but it looks like the new code is in the noise compared to other factors.

I believe rsid parsing is taking place in the correct spot. Other code was dead.

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 9, 2019

More extensive cloud tests are showing speedups in the combiner pipeline compared to master, about 15-20 seconds per partition, but that adds up quickly at scale.

Remove HtsjdkRecordReader
It was just a data class at that point.
@jigold
Copy link
Collaborator

left a comment

I did a careful first pass. Giving you my feedback for now. Will look at it more carefully a second time since there's a lot of changes here.

true
else {
val c = line(p)
c == '\t' || c == ';' || c == ','

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

When would a comma be the end of an array field?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

I copied the name from endFormatArrayField, it ends an array element.

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

That's really confusing.

@@ -93,16 +120,31 @@ final class VCFLine(val line: String, arrayElementsRequired: Boolean) {
}
}

def endFilterArrayField(p: Int): Boolean = endInfoField

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Is this for column 7 of the VCF? Piggy backing on the info field parsing?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

They're the same, I wanted this name to make sense when parsing FILTERS.

def nextField(): Unit = {
if (pos == line.length)
parseError("unexpected end of line")
assert(line(pos) == '\t')
pos += 1 // tab
}

def nextInfoField(): Unit = {

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Do you need a hasNextInfoField?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

No, the error handling could probably be cleaner, but if the assertion below fails then it's not a valid VCF. We carefully control when the next functions are called during parsing so we don't need to ask if there is a next field.

def parseAddInfoInt(rvb: RegionValueBuilder) {
if (!infoFieldMissing()) {
rvb.setPresent()
rvb.addInt(parseInfoInt())

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Is the default to set these to missing in the rvb? I feel like there should be an rvb.setMissing() here somewhere.

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

There is. I set everything to missing in parseAddInfo below.

def parseAddInfoFloat(rvb: RegionValueBuilder) {
if (!infoFieldMissing()) {
rvb.setPresent()
rvb.addFloat(parseInfoString().toFloat)

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

What happens if these are parse errors? Will it be an incomprehensible error message? Is this what we do for parsing the genotype fields?

skipInfoField()

while (!endField()) {
nextInfoField()

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Can you get rid of the code above and just keep this while loop?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

No, The first key is special as it may be . to indicate that the whole INFO record is missing.

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

Sorry I meant this:

    if (infoType.hasField(key)) {
       rvb.setFieldIndex(infoType.fieldIdx(key))
       if (infoFlagFieldNames.contains(key))
         rvb.addBoolean(true)
       else
         parseAddInfoField(rvb, infoType.fieldType(key))
     }
     skipInfoField()

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 10, 2019

Author Collaborator

this is the alternative, thoughts?

    var key = parseInfoKey()
    if (key == ".") {
      if (endField()) {
        rvb.endStruct()
        return
      } else
        parseError(s"invalid INFO key $key")
    }

    while (!endField()) {
      if (key == ".") {
        parseError(s"invalid INFO key $key")
      }
      if (infoType.hasField(key)) {
        rvb.setFieldIndex(infoType.fieldIdx(key))
        if (infoFlagFieldNames.contains(key))
          rvb.addBoolean(true)
        else
          parseAddInfoField(rvb, infoType.fieldType(key))
      }
      skipInfoField()
      if (!endField()) {
        nextInfoField()
        key = parseInfoKey()
      }
    }

This comment has been minimized.

Copy link
@jigold

jigold Apr 11, 2019

Collaborator

This seems better. So we just skip a field silently if it's not in the infoType?

@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Is the issue that you have to parse the ID before REF and ALT?

@@ -813,6 +1161,7 @@ object LoadVCF {
arrayElementsRequired: Boolean,
skipInvalidLoci: Boolean
): ContextRDD[RVDContext, RegionValue] = {
val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid")

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Is this for the prune fields push down?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

Should this not be just set to missing?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

htsjdk doesn't, it sets it to -10

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

That's annoying.

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN

This comment has been minimized.

Copy link
@jigold

jigold Apr 9, 2019

Collaborator

I don't think your float parsing handles these correctly.

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

Can you make an issue then?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 10, 2019

Author Collaborator

Futhermore, VCF 4.2 makes no claims about the string representation of floats at all. VCF is garbage.

@chrisvittal
Copy link
Collaborator Author

left a comment

Left some replies.

else
parseAddInfoField(rvb, infoType.fieldType(key))
}
skipInfoField()

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

It mirrors skipFormatField

skipInfoField()

while (!endField()) {
nextInfoField()

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

No, The first key is special as it may be . to indicate that the whole INFO record is missing.

@@ -813,6 +1161,7 @@ object LoadVCF {
arrayElementsRequired: Boolean,
skipInvalidLoci: Boolean
): ContextRDD[RVDContext, RegionValue] = {
val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid")

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.

@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

Yes

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

htsjdk doesn't, it sets it to -10

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 9, 2019

Author Collaborator

You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 9, 2019

@cseed

import hail as hl
hl.import_vcf(PATH, reference_genome='GRCh38').write('/tmp/vcfmt', overwrite=True)

On my laptop with a warmish filesystem cache takes 1 minute for this code and 1:20 for current master. That's actually pretty good.

@jigold
Copy link
Collaborator

left a comment

Do we have tests that are complicated enough to fully test these changes? Maybe a complicated VEP signature?

true
else {
val c = line(p)
c == '\t' || c == ';' || c == ','

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

That's really confusing.

}

def parseStringInfoArrayElement() {
if (infoArrayFieldMissing()) {

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

ok.

skipInfoField()

while (!endField()) {
nextInfoField()

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

Sorry I meant this:

    if (infoType.hasField(key)) {
       rvb.setFieldIndex(infoType.fieldIdx(key))
       if (infoFlagFieldNames.contains(key))
         rvb.addBoolean(true)
       else
         parseAddInfoField(rvb, infoType.fieldType(key))
     }
     skipInfoField()
if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

That's annoying.

@@ -1320,3 +1707,15 @@ class VCFsReader(
}
}
}

class BufferedLineIterator(bit: BufferedIterator[String]) extends htsjdk.tribble.readers.LineIterator {

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

Can you move this to the top of the file?

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 10, 2019

Author Collaborator

sure

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

Can you make an issue then?

if (endInfoArrayField())
parseError("empty integer")
var mul = 1
if (line(pos) == '-') {

This comment has been minimized.

Copy link
@jigold

jigold Apr 10, 2019

Collaborator

ok.

Please reply to discussion of key parsing. Otherwise, I think this is good. I'll work on tests.

chrisvittal added some commits Apr 10, 2019

Refactor common code, add old vs new output checking
The MatrixTables checked in here were the result of using the old parser
to read vcfs and write matrix tables.

@chrisvittal chrisvittal force-pushed the chrisvittal:better-vcf-parsing branch from af5ac2f to c7d1a63 Apr 11, 2019

@chrisvittal

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 11, 2019

I added some old vs. new tests.

@@ -289,6 +289,22 @@ def test_import_vcfs_subset(self):
self.assertTrue(vcf2._same(filter1))
self.assertEqual(len(parts), vcf2.n_partitions())

def test_vcf_parser_golden_master(self):
# the three matrix tables referenced here were generated using the old VCF parser
# parser

This comment has been minimized.

Copy link
@jigold

jigold Apr 11, 2019

Collaborator

delete

This comment has been minimized.

Copy link
@chrisvittal

chrisvittal Apr 11, 2019

Author Collaborator

ok, I still feel that knowing how these files were generated if this test ever fails will be useful.

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##FORMAT=<ID=CNL,Number=.,Type=Integer>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

This comment has been minimized.

Copy link
@jigold

jigold Apr 11, 2019

Collaborator

Does this cover all possible field types? For backwards compatibility add the fields with nans etc. What about adding an example with DB=0? I see arrays of doubles? Do we also need arrays of strings? Can we also get a completely missing INFO field .? Just trying to make sure we got all possible code paths.

skipInfoField()

while (!endField()) {
nextInfoField()

This comment has been minimized.

Copy link
@jigold

jigold Apr 11, 2019

Collaborator

This seems better. So we just skip a field silently if it's not in the infoType?

@jigold
Copy link
Collaborator

left a comment

This looks pretty good. See comments.

@jigold

jigold approved these changes Apr 12, 2019

@danking danking merged commit 2571917 into hail-is:master Apr 13, 2019

1 check passed

hail-ci-0-1 successful build
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.