Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in vcf.utils.walk_together(..) even when .vcfs are sorted correctly #140

Closed
bw2 opened this issue Feb 6, 2014 · 5 comments
Closed

Comments

@bw2
Copy link

bw2 commented Feb 6, 2014

Thanks for creating this library and continuing to improve it. I just came across the following issue. walk_together uses a purely alphabetical ordering of chromosome names and so isn't aware that chr9 < chr10 or that chrY < chrM. As a result, it almost always produces unexpected behavior which, even when discovered, isn't useful in any situations I'm aware of, so I'm posting it as a bug.

For example, if you pass it 2 vcf files that have 4 variants each:

file1.vcf:
chr9 100 ...
chr9 200 ...
chr10 300 ...
chr10 400 ...

file2.vcf:
chr9 100 ...
chr9 210 ...
chr10 300 ...
chr10 410 ...

it will emit:
(file1-record1, file2-record2) # expected
(file1-record2, None) # expected
(file1-record3, None) # expecting (None, file2-record2) instead
(file1-record4, None) # expecting (file1-record3, file2-record3) instead
(None, file2-record2)
(None, file2-record3)
(None, file2-record4)

I'm not sure about the best way to make this function aware of the user's desired contig ordering, but as a simple solution it could take a list of contigs in their expected order (and then only emit records in those contigs).

Also, I think it would be nice if walk_together could raise an exception if any of the VCFs have conflicting contig orders.

-Ben

Ps. Below is a demonstration with 2 .vcf files with 17 variants each:

test1.vcf:
image

test2.vcf:
image

results for walk_together(test1, test2):
image

@jamescasbon
Copy link
Owner

See #20

@bow
Copy link

bow commented Feb 6, 2014

Just as a side note that I hope can be useful:

We've also just figured this out in our team (@ffinfo has the credit here). We've been trying to isolate a weird behavior in one of our scripts and it came down to this.

What we found out was that the ordering follows a regular string ordering due to the way __cmp__ is implemented here.

In short:

>>> ('chr9', 200) < ('chr9', 210)
True
>>> ('chr10', 300) < ('chr9', 210)
True
>>> 'chr10' < 'chr9'
True

So one way custom ordering can be implemented is if we allow the __cmp__ function to be overwritten or for the user to supply their own sorting function.

Alternatively, we can also allow the custom function to be plugged into walk_together instead. The __cmp__ behavior affects min in here, so perhaps there is a way to allow min to be replaced by something else?

@bw2
Copy link
Author

bw2 commented Feb 6, 2014

@bow, pull request #143 is based on the same idea - it allows the user to define a vcf_record_sort_key which is used to order vcf records similar to how the 'key' arg in used by python sorted(..) method.

jamescasbon pushed a commit that referenced this issue Feb 10, 2014
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.
@jamescasbon
Copy link
Owner

Merged #143

@bow
Copy link

bow commented Feb 10, 2014

@datagram, ah ~ I didn't realize this was already written as a pull request. It looks ok, and seeing that it's been merged is good I think :).

gotgenes pushed a commit to gotgenes/PyVCF that referenced this issue May 13, 2014
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue jamescasbon#140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.
gotgenes pushed a commit to gotgenes/PyVCF that referenced this issue May 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants