Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Rewrite VCF INFO Parser to not use htsjdk #5828

Merged
merged 16 commits into from
Apr 13, 2019

Conversation

chrisvittal
Copy link
Collaborator

We should now never call into htsjdk during line parsing, only during header parsing.

@chrisvittal chrisvittal changed the title Better vcf parsing [hail] Rewrite VCF INFO Parser to not use htsjdk Apr 9, 2019
@chrisvittal chrisvittal requested a review from jigold April 9, 2019 01:04
@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like the wrong place for hasRSID - shouldn't it be handled in the same place as the filter/qual?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It doesn’t appear in the same place in a vcf line as filter and qual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue that you have to parse the ID before REF and ALT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@cseed
Copy link
Collaborator

cseed commented Apr 9, 2019

Any performance numbers?

@chrisvittal
Copy link
Collaborator Author

If there is an improvement, it is very small, running more tests now, but it looks like the new code is in the noise compared to other factors.

@chrisvittal chrisvittal dismissed tpoterba’s stale review April 9, 2019 16:25

I believe rsid parsing is taking place in the correct spot. Other code was dead.

@chrisvittal
Copy link
Collaborator Author

More extensive cloud tests are showing speedups in the combiner pipeline compared to master, about 15-20 seconds per partition, but that adds up quickly at scale.

It was just a data class at that point.
Copy link
Collaborator

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a careful first pass. Giving you my feedback for now. Will look at it more carefully a second time since there's a lot of changes here.

true
else {
val c = line(p)
c == '\t' || c == ';' || c == ','
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would a comma be the end of an array field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the name from endFormatArrayField, it ends an array element.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really confusing.

@@ -93,16 +120,31 @@ final class VCFLine(val line: String, arrayElementsRequired: Boolean) {
}
}

def endFilterArrayField(p: Int): Boolean = endInfoField
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for column 7 of the VCF? Piggy backing on the info field parsing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're the same, I wanted this name to make sense when parsing FILTERS.

def nextField(): Unit = {
if (pos == line.length)
parseError("unexpected end of line")
assert(line(pos) == '\t')
pos += 1 // tab
}

def nextInfoField(): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a hasNextInfoField?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the error handling could probably be cleaner, but if the assertion below fails then it's not a valid VCF. We carefully control when the next functions are called during parsing so we don't need to ask if there is a next field.

def parseAddInfoInt(rvb: RegionValueBuilder) {
if (!infoFieldMissing()) {
rvb.setPresent()
rvb.addInt(parseInfoInt())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default to set these to missing in the rvb? I feel like there should be an rvb.setMissing() here somewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is. I set everything to missing in parseAddInfo below.

def parseAddInfoFloat(rvb: RegionValueBuilder) {
if (!infoFieldMissing()) {
rvb.setPresent()
rvb.addFloat(parseInfoString().toFloat)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if these are parse errors? Will it be an incomprehensible error message? Is this what we do for parsing the genotype fields?

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get rid of the code above and just keep this while loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, The first key is special as it may be . to indicate that the whole INFO record is missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant this:

    if (infoType.hasField(key)) {
       rvb.setFieldIndex(infoType.fieldIdx(key))
       if (infoFlagFieldNames.contains(key))
         rvb.addBoolean(true)
       else
         parseAddInfoField(rvb, infoType.fieldType(key))
     }
     skipInfoField()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the alternative, thoughts?

    var key = parseInfoKey()
    if (key == ".") {
      if (endField()) {
        rvb.endStruct()
        return
      } else
        parseError(s"invalid INFO key $key")
    }

    while (!endField()) {
      if (key == ".") {
        parseError(s"invalid INFO key $key")
      }
      if (infoType.hasField(key)) {
        rvb.setFieldIndex(infoType.fieldIdx(key))
        if (infoFlagFieldNames.contains(key))
          rvb.addBoolean(true)
        else
          parseAddInfoField(rvb, infoType.fieldType(key))
      }
      skipInfoField()
      if (!endField()) {
        nextInfoField()
        key = parseInfoKey()
      }
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems better. So we just skip a field silently if it's not in the infoType?

@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue that you have to parse the ID before REF and ALT?

@@ -813,6 +1161,7 @@ object LoadVCF {
arrayElementsRequired: Boolean,
skipInvalidLoci: Boolean
): ContextRDD[RVDContext, RegionValue] = {
val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the prune fields push down?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be just set to missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

htsjdk doesn't, it sets it to -10

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's annoying.

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think your float parsing handles these correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make an issue then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Futhermore, VCF 4.2 makes no claims about the string representation of floats at all. VCF is garbage.

Copy link
Collaborator Author

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some replies.

else
parseAddInfoField(rvb, infoType.fieldType(key))
}
skipInfoField()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mirrors skipFormatField

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, The first key is special as it may be . to indicate that the whole INFO record is missing.

@@ -813,6 +1161,7 @@ object LoadVCF {
arrayElementsRequired: Boolean,
skipInvalidLoci: Boolean
): ContextRDD[RVDContext, RegionValue] = {
val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.

@@ -831,7 +1181,7 @@ object LoadVCF {
val vcfLine = new VCFLine(line, arrayElementsRequired)
rvb.start(t.physicalType)
rvb.startStruct()
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci)
present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

htsjdk doesn't, it sets it to -10

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.

@chrisvittal
Copy link
Collaborator Author

@cseed

import hail as hl
hl.import_vcf(PATH, reference_genome='GRCh38').write('/tmp/vcfmt', overwrite=True)

On my laptop with a warmish filesystem cache takes 1 minute for this code and 1:20 for current master. That's actually pretty good.

jigold
jigold previously requested changes Apr 10, 2019
Copy link
Collaborator

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests that are complicated enough to fully test these changes? Maybe a complicated VEP signature?

true
else {
val c = line(p)
c == '\t' || c == ';' || c == ','
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really confusing.

}

def parseStringInfoArrayElement() {
if (infoArrayFieldMissing()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant this:

    if (infoType.hasField(key)) {
       rvb.setFieldIndex(infoType.fieldIdx(key))
       if (infoFlagFieldNames.contains(key))
         rvb.addBoolean(true)
       else
         parseAddInfoField(rvb, infoType.fieldType(key))
     }
     skipInfoField()

if (c.hasQual) {
val qstr = l.parseString()
if (qstr == ".")
rvb.addDouble(-10.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's annoying.

@@ -1320,3 +1707,15 @@ class VCFsReader(
}
}
}

class BufferedLineIterator(bit: BufferedIterator[String]) extends htsjdk.tribble.readers.LineIterator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the top of the file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

case (s: String, _: TFloat64) =>
val d = s match {
case "nan" => Double.NaN
case "-nan" => Double.NaN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make an issue then?

if (endInfoArrayField())
parseError("empty integer")
var mul = 1
if (line(pos) == '-') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@chrisvittal chrisvittal dismissed jigold’s stale review April 10, 2019 21:41

Please reply to discussion of key parsing. Otherwise, I think this is good. I'll work on tests.

The MatrixTables checked in here were the result of using the old parser
to read vcfs and write matrix tables.
@chrisvittal
Copy link
Collaborator Author

I added some old vs. new tests.

@@ -289,6 +289,22 @@ def test_import_vcfs_subset(self):
self.assertTrue(vcf2._same(filter1))
self.assertEqual(len(parts), vcf2.n_partitions())

def test_vcf_parser_golden_master(self):
# the three matrix tables referenced here were generated using the old VCF parser
# parser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I still feel that knowing how these files were generated if this test ever fails will be useful.

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##FORMAT=<ID=CNL,Number=.,Type=Integer>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cover all possible field types? For backwards compatibility add the fields with nans etc. What about adding an example with DB=0? I see arrays of doubles? Do we also need arrays of strings? Can we also get a completely missing INFO field .? Just trying to make sure we got all possible code paths.

skipInfoField()

while (!endField()) {
nextInfoField()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems better. So we just skip a field silently if it's not in the infoType?

jigold
jigold previously requested changes Apr 11, 2019
Copy link
Collaborator

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. See comments.

@danking danking merged commit 2571917 into hail-is:master Apr 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants