Bug in vcf.utils.walk_together(..) even when .vcfs are sorted correctly #140

Closed
bw2 opened this Issue Feb 6, 2014 · 5 comments

Comments

Projects
None yet
3 participants

bw2 commented Feb 6, 2014

Thanks for creating this library and continuing to improve it. I just came across the following issue. walk_together uses a purely alphabetical ordering of chromosome names and so isn't aware that chr9 < chr10 or that chrY < chrM. As a result, it almost always produces unexpected behavior which, even when discovered, isn't useful in any situations I'm aware of, so I'm posting it as a bug.

For example, if you pass it 2 vcf files that have 4 variants each:

file1.vcf:
chr9 100 ...
chr9 200 ...
chr10 300 ...
chr10 400 ...

file2.vcf:
chr9 100 ...
chr9 210 ...
chr10 300 ...
chr10 410 ...

it will emit:
(file1-record1, file2-record2) # expected
(file1-record2, None) # expected
(file1-record3, None) # expecting (None, file2-record2) instead
(file1-record4, None) # expecting (file1-record3, file2-record3) instead
(None, file2-record2)
(None, file2-record3)
(None, file2-record4)

I'm not sure about the best way to make this function aware of the user's desired contig ordering, but as a simple solution it could take a list of contigs in their expected order (and then only emit records in those contigs).

Also, I think it would be nice if walk_together could raise an exception if any of the VCFs have conflicting contig orders.

-Ben

Ps. Below is a demonstration with 2 .vcf files with 17 variants each:

test1.vcf:
image

test2.vcf:
image

results for walk_together(test1, test2):
image

bw2 pushed a commit to bw2/PyVCF that referenced this issue Feb 6, 2014

Fix for issue #140, added vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.

bw2 pushed a commit to bw2/PyVCF that referenced this issue Feb 6, 2014

Fix for issue #140, add vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.
Owner

jamescasbon commented Feb 6, 2014

See #20

bow commented Feb 6, 2014

Just as a side note that I hope can be useful:

We've also just figured this out in our team (@ffinfo has the credit here). We've been trying to isolate a weird behavior in one of our scripts and it came down to this.

What we found out was that the ordering follows a regular string ordering due to the way __cmp__ is implemented here.

In short:

>>> ('chr9', 200) < ('chr9', 210)
True
>>> ('chr10', 300) < ('chr9', 210)
True
>>> 'chr10' < 'chr9'
True

So one way custom ordering can be implemented is if we allow the __cmp__ function to be overwritten or for the user to supply their own sorting function.

Alternatively, we can also allow the custom function to be plugged into walk_together instead. The __cmp__ behavior affects min in here, so perhaps there is a way to allow min to be replaced by something else?

bw2 commented Feb 6, 2014

@bow, pull request #143 is based on the same idea - it allows the user to define a vcf_record_sort_key which is used to order vcf records similar to how the 'key' arg in used by python sorted(..) method.

bw2 pushed a commit to bw2/PyVCF that referenced this issue Feb 7, 2014

jamescasbon pushed a commit that referenced this issue Feb 10, 2014

Fix for issue #140, add vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.
Owner

jamescasbon commented Feb 10, 2014

Merged #143

bow commented Feb 10, 2014

@datagram, ah ~ I didn't realize this was already written as a pull request. It looks ok, and seeing that it's been merged is good I think :).

gotgenes pushed a commit to gotgenes/PyVCF that referenced this issue May 13, 2014

Fix for issue #140, add vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.

gotgenes pushed a commit to gotgenes/PyVCF that referenced this issue May 13, 2014

gitanoqevaporelmundoentero pushed a commit to gitanoqevaporelmundoentero/PyVCF that referenced this issue Nov 26, 2014

Fix for issue #140, add vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.

gitanoqevaporelmundoentero pushed a commit to gitanoqevaporelmundoentero/PyVCF that referenced this issue Nov 26, 2014

gitanoqevaporelmundoentero pushed a commit to gitanoqevaporelmundoentero/PyVCF that referenced this issue Nov 26, 2014

Fix for issue #140, add vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.

gitanoqevaporelmundoentero pushed a commit to gitanoqevaporelmundoentero/PyVCF that referenced this issue Nov 26, 2014

gitanoqevaporelmundoentero pushed a commit to gitanoqevaporelmundoentero/PyVCF that referenced this issue Nov 26, 2014

Fix for issue #140, add vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.

gitanoqevaporelmundoentero pushed a commit to gitanoqevaporelmundoentero/PyVCF that referenced this issue Nov 26, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment