[hail] Rewrite VCF INFO Parser to not use htsjdk#5828
[hail] Rewrite VCF INFO Parser to not use htsjdk#5828danking merged 16 commits intohail-is:masterfrom
Conversation
| rvb.start(t.physicalType) | ||
| rvb.startStruct() | ||
| present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci) | ||
| present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci) |
There was a problem hiding this comment.
this feels like the wrong place for hasRSID - shouldn't it be handled in the same place as the filter/qual?
There was a problem hiding this comment.
No. It doesn’t appear in the same place in a vcf line as filter and qual
There was a problem hiding this comment.
Is the issue that you have to parse the ID before REF and ALT?
|
Any performance numbers? |
3928789 to
a89928f
Compare
|
If there is an improvement, it is very small, running more tests now, but it looks like the new code is in the noise compared to other factors. |
I believe rsid parsing is taking place in the correct spot. Other code was dead.
|
More extensive cloud tests are showing speedups in the combiner pipeline compared to master, about 15-20 seconds per partition, but that adds up quickly at scale. |
It was just a data class at that point.
jigold
left a comment
There was a problem hiding this comment.
I did a careful first pass. Giving you my feedback for now. Will look at it more carefully a second time since there's a lot of changes here.
| true | ||
| else { | ||
| val c = line(p) | ||
| c == '\t' || c == ';' || c == ',' |
There was a problem hiding this comment.
When would a comma be the end of an array field?
There was a problem hiding this comment.
I copied the name from endFormatArrayField, it ends an array element.
| } | ||
| } | ||
|
|
||
| def endFilterArrayField(p: Int): Boolean = endInfoField |
There was a problem hiding this comment.
Is this for column 7 of the VCF? Piggy backing on the info field parsing?
There was a problem hiding this comment.
They're the same, I wanted this name to make sense when parsing FILTERS.
| pos += 1 // tab | ||
| } | ||
|
|
||
| def nextInfoField(): Unit = { |
There was a problem hiding this comment.
Do you need a hasNextInfoField?
There was a problem hiding this comment.
No, the error handling could probably be cleaner, but if the assertion below fails then it's not a valid VCF. We carefully control when the next functions are called during parsing so we don't need to ask if there is a next field.
| def parseAddInfoInt(rvb: RegionValueBuilder) { | ||
| if (!infoFieldMissing()) { | ||
| rvb.setPresent() | ||
| rvb.addInt(parseInfoInt()) |
There was a problem hiding this comment.
Is the default to set these to missing in the rvb? I feel like there should be an rvb.setMissing() here somewhere.
There was a problem hiding this comment.
There is. I set everything to missing in parseAddInfo below.
| def parseAddInfoFloat(rvb: RegionValueBuilder) { | ||
| if (!infoFieldMissing()) { | ||
| rvb.setPresent() | ||
| rvb.addFloat(parseInfoString().toFloat) |
There was a problem hiding this comment.
What happens if these are parse errors? Will it be an incomprehensible error message? Is this what we do for parsing the genotype fields?
| skipInfoField() | ||
|
|
||
| while (!endField()) { | ||
| nextInfoField() |
There was a problem hiding this comment.
Can you get rid of the code above and just keep this while loop?
There was a problem hiding this comment.
No, The first key is special as it may be . to indicate that the whole INFO record is missing.
There was a problem hiding this comment.
Sorry I meant this:
if (infoType.hasField(key)) {
rvb.setFieldIndex(infoType.fieldIdx(key))
if (infoFlagFieldNames.contains(key))
rvb.addBoolean(true)
else
parseAddInfoField(rvb, infoType.fieldType(key))
}
skipInfoField()
There was a problem hiding this comment.
this is the alternative, thoughts?
var key = parseInfoKey()
if (key == ".") {
if (endField()) {
rvb.endStruct()
return
} else
parseError(s"invalid INFO key $key")
}
while (!endField()) {
if (key == ".") {
parseError(s"invalid INFO key $key")
}
if (infoType.hasField(key)) {
rvb.setFieldIndex(infoType.fieldIdx(key))
if (infoFlagFieldNames.contains(key))
rvb.addBoolean(true)
else
parseAddInfoField(rvb, infoType.fieldType(key))
}
skipInfoField()
if (!endField()) {
nextInfoField()
key = parseInfoKey()
}
}There was a problem hiding this comment.
This seems better. So we just skip a field silently if it's not in the infoType?
| rvb.start(t.physicalType) | ||
| rvb.startStruct() | ||
| present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci) | ||
| present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci) |
There was a problem hiding this comment.
Is the issue that you have to parse the ID before REF and ALT?
| arrayElementsRequired: Boolean, | ||
| skipInvalidLoci: Boolean | ||
| ): ContextRDD[RVDContext, RegionValue] = { | ||
| val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid") |
There was a problem hiding this comment.
Is this for the prune fields push down?
There was a problem hiding this comment.
Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.
| if (c.hasQual) { | ||
| val qstr = l.parseString() | ||
| if (qstr == ".") | ||
| rvb.addDouble(-10.0) |
There was a problem hiding this comment.
Should this not be just set to missing?
There was a problem hiding this comment.
htsjdk doesn't, it sets it to -10
| case (s: String, _: TFloat64) => | ||
| val d = s match { | ||
| case "nan" => Double.NaN | ||
| case "-nan" => Double.NaN |
There was a problem hiding this comment.
I don't think your float parsing handles these correctly.
There was a problem hiding this comment.
You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.
There was a problem hiding this comment.
Can you make an issue then?
There was a problem hiding this comment.
Futhermore, VCF 4.2 makes no claims about the string representation of floats at all. VCF is garbage.
chrisvittal
left a comment
There was a problem hiding this comment.
Left some replies.
| else | ||
| parseAddInfoField(rvb, infoType.fieldType(key)) | ||
| } | ||
| skipInfoField() |
There was a problem hiding this comment.
It mirrors skipFormatField
| skipInfoField() | ||
|
|
||
| while (!endField()) { | ||
| nextInfoField() |
There was a problem hiding this comment.
No, The first key is special as it may be . to indicate that the whole INFO record is missing.
| arrayElementsRequired: Boolean, | ||
| skipInvalidLoci: Boolean | ||
| ): ContextRDD[RVDContext, RegionValue] = { | ||
| val hasRSID = t.isInstanceOf[TStruct] && t.asInstanceOf[TStruct].hasField("rsid") |
There was a problem hiding this comment.
Yes, it is possible that the field won't be present in the final matrix and so we just need to skip over it.
| rvb.start(t.physicalType) | ||
| rvb.startStruct() | ||
| present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, skipInvalidLoci) | ||
| present = vcfLine.parseAddVariant(rvb, rgBc.map(_.value), contigRecoding, hasRSID, skipInvalidLoci) |
| if (c.hasQual) { | ||
| val qstr = l.parseString() | ||
| if (qstr == ".") | ||
| rvb.addDouble(-10.0) |
There was a problem hiding this comment.
htsjdk doesn't, it sets it to -10
| case (s: String, _: TFloat64) => | ||
| val d = s match { | ||
| case "nan" => Double.NaN | ||
| case "-nan" => Double.NaN |
There was a problem hiding this comment.
You are correct. Right now it handles it exactly as java would, to be properly VCF 4.3 compliant, we would need to handle, NaN, Inf, and -Inf. Which we don't even handle properly right now, just their lower case versions.
import hail as hl
hl.import_vcf(PATH, reference_genome='GRCh38').write('/tmp/vcfmt', overwrite=True)On my laptop with a warmish filesystem cache takes 1 minute for this code and 1:20 for current master. That's actually pretty good. |
jigold
left a comment
There was a problem hiding this comment.
Do we have tests that are complicated enough to fully test these changes? Maybe a complicated VEP signature?
| true | ||
| else { | ||
| val c = line(p) | ||
| c == '\t' || c == ';' || c == ',' |
| } | ||
|
|
||
| def parseStringInfoArrayElement() { | ||
| if (infoArrayFieldMissing()) { |
| skipInfoField() | ||
|
|
||
| while (!endField()) { | ||
| nextInfoField() |
There was a problem hiding this comment.
Sorry I meant this:
if (infoType.hasField(key)) {
rvb.setFieldIndex(infoType.fieldIdx(key))
if (infoFlagFieldNames.contains(key))
rvb.addBoolean(true)
else
parseAddInfoField(rvb, infoType.fieldType(key))
}
skipInfoField()
| if (c.hasQual) { | ||
| val qstr = l.parseString() | ||
| if (qstr == ".") | ||
| rvb.addDouble(-10.0) |
| } | ||
| } | ||
|
|
||
| class BufferedLineIterator(bit: BufferedIterator[String]) extends htsjdk.tribble.readers.LineIterator { |
There was a problem hiding this comment.
Can you move this to the top of the file?
| case (s: String, _: TFloat64) => | ||
| val d = s match { | ||
| case "nan" => Double.NaN | ||
| case "-nan" => Double.NaN |
There was a problem hiding this comment.
Can you make an issue then?
| if (endInfoArrayField()) | ||
| parseError("empty integer") | ||
| var mul = 1 | ||
| if (line(pos) == '-') { |
Please reply to discussion of key parsing. Otherwise, I think this is good. I'll work on tests.
The MatrixTables checked in here were the result of using the old parser to read vcfs and write matrix tables.
af5ac2f to
c7d1a63
Compare
|
I added some old vs. new tests. |
|
|
||
| def test_vcf_parser_golden_master(self): | ||
| # the three matrix tables referenced here were generated using the old VCF parser | ||
| # parser |
There was a problem hiding this comment.
ok, I still feel that knowing how these files were generated if this test ever fails will be useful.
| ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> | ||
| ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> | ||
| ##FORMAT=<ID=CNL,Number=.,Type=Integer> | ||
| #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 |
There was a problem hiding this comment.
Does this cover all possible field types? For backwards compatibility add the fields with nans etc. What about adding an example with DB=0? I see arrays of doubles? Do we also need arrays of strings? Can we also get a completely missing INFO field .? Just trying to make sure we got all possible code paths.
| skipInfoField() | ||
|
|
||
| while (!endField()) { | ||
| nextInfoField() |
There was a problem hiding this comment.
This seems better. So we just skip a field silently if it's not in the infoType?
jigold
left a comment
There was a problem hiding this comment.
This looks pretty good. See comments.
We should now never call into htsjdk during line parsing, only during header parsing.