vcf_filter converts reference, sample, and vcfProcessLog fields incorrectly #83

Closed
cbare opened this Issue Jan 10, 2013 · 4 comments

Projects

None yet

3 participants

@cbare
cbare commented Jan 10, 2013

When vcf_filter writes the filtered VCF file, it seems to use python style serialization on some of the header fields, rather than the proper VCF style. For example, a ##reference field looked like this in the original file:

##reference=<ID="hg36.1 m5:9ebc6df9496613f373e73396d5b3b6b6 sp:homo sapiens",source="http://www.hgsc.bcm.tmc.edu/collaborations/human-reference/hsap36.1-hg18.fasta">

...ends up looking like this in the output:

##reference={'source': '"http://www.hgsc.bcm.tmc.edu/collaborations/human-reference/hsap36.1-hg18.fasta"', 'ID': '"hg36.1 m5:9ebc6df9496613f373e73396d5b3b6b6 sp:homo sapiens"'}

SAMPLE is similarly affected, as well as the fields ##vcfProcessLog which gets added.

For reference, the command line I used to call vcf_filter was:

vcf_filter.py examples/example1.vcf dps --depth-per-sample 7 > tmp.vcf
@cbare
cbare commented Jan 10, 2013

I came up with an easy repro with one of the example vcf files included in PyVCF/vcf/test:

vcf_filter.py vcf/test/example-4.1-bnd.vcf sq --site-quality 30 > tmp.vcf

Now, note that tmp.vcf has these headers:

##PEDIGREE={'Derived': '"Tumor"', 'Original': '"Germline"'}
##SAMPLE={'SampleName': ''}
##SAMPLE={'SampleName': ''}
@martijnvermaat martijnvermaat added a commit to martijnvermaat/PyVCF that referenced this issue Jan 10, 2013
@martijnvermaat martijnvermaat Correctly write meta lines with dictionary value
Write meta lines with a dictionary-like value as

    ##meta=<field=value,field=value,...>

instead of as the Python dictionary string representation. This is a
fix for jamescasbon#83 and a generalization of jamescasbon#81. A
regression compared to jamescasbon#81 is that the order of fields in
a `contig` line is no longer defined.
9d43fa9
@cbare
cbare commented Jan 10, 2013

Nice work, Martijn!

What do people think of using OrderedDict in read_meta_hash in _parser.vcf_metadata_parser.read_meta_hash() to preserve the ordering of the key/value pairs? I guess this is only cosmetic, but for SAMPLEs, people usually put ID first, which is nice.

Also, would allowing dictionaries in the reference field cause any problems? TCGA allows that in their VCF specification (See: Table 2: Examples of generic meta-information fields), like so:

##reference=<ID=hg18, Source=file://seq/references/1000GenomesPilot-NCBI36.fasta>
@martijnvermaat
Collaborator

Good suggestion, I added it to the pull request.

Apart from that, as you show above, the SAMPLE meta lines are not parsed correctly (so they also cannot be written back correctly). Another issue should be created for that.

@jamescasbon
Owner

Thanks for the report and the pull request.

I'm assuming this is fixed, please reopen if not/

@gotgenes gotgenes pushed a commit to gotgenes/PyVCF that referenced this issue May 13, 2014
@martijnvermaat martijnvermaat Correctly write meta lines with dictionary value
Write meta lines with a dictionary-like value as

    ##meta=<field=value,field=value,...>

instead of as the Python dictionary string representation. This is a
fix for jamescasbon#83 and a generalization of jamescasbon#81. A
regression compared to jamescasbon#81 is that the order of fields in
a `contig` line is no longer defined.
dc1a367
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment