Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GIAB VCF import fails because htsjdk "repairs" header in vcf #6012

Closed
marksantcroos opened this issue May 1, 2019 · 3 comments · Fixed by #6013
Closed

GIAB VCF import fails because htsjdk "repairs" header in vcf #6012

marksantcroos opened this issue May 1, 2019 · 3 comments · Fixed by #6013

Comments

@marksantcroos
Copy link
Contributor

marksantcroos commented May 1, 2019

Hi folks,

In evaluating Hail to see whether it fits my use case (a variant frequency database) I ran into an issue with importing VCF files from GIAB. It turns out that these use type String for the PS ##FORMAT entry.

Subsequently, Hail fails to import these with the error:

is.hail.utils.HailException: HG001.vcf.gz:column 492: invalid character 'P' in integer literal

This is because of the default behaviour of htsjdk to "repair" these according to the VCF "standard".

htsjdk exposes codec.disableOnTheFlyModifications to toggle this behaviour which can be called from somewhere around https://github.com/hail-is/hail/blob/master/hail/src/main/scala/is/hail/io/vcf/LoadVCF.scala#L1143.

Ideally I would like to expose this toggle also at the import_vcf method of Hail.
I'll create a PR to do so accordingly ASAP.

Comments/questions?

Thanks!

Regards,

Mark

@tpoterba
Copy link
Contributor

tpoterba commented May 1, 2019

Oh, we've been mad about this for a long time: #2822

If you can turn this behavior off, we're happy to accept that PR!

@tpoterba
Copy link
Contributor

tpoterba commented May 1, 2019

We're also pretty close to getting rid of htsjdk entirely, but this is a good immediate solution

@marksantcroos
Copy link
Contributor Author

Ah, I missed #2822 in my search. That's why included as many strings that might match search queries in my ticket for future generations :-)

Anyway, the PR is there now. Please be gentle, these are my first lines of Scala ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants