Fix for issue #140, add vcf_record_sort_key arg #143

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
5 participants
@bw2

bw2 commented Feb 6, 2014

  • Added 'vcf_record_sort_key' which allows users to specify any chromosome ordering.
  • Fixed issue #140 by emitting all records from the current chromosome before moving on to the next one. This takes care of issue #140 in the typical case (when all files have records for all contigs), but still fails in some edge cases (when vcf files have differing sets of contigs and where the alphabetical order of these contigs is not the desired order). In those edge cases, the user can set the 'vcf_record_sort_key' arg to explicitly defining the chromosome order - in which case the issue would be completely solved.
Fix for issue #140, add vcf_record_sort_key arg
- Added 'vcf_record_sort_key' to allow user to specify arbitrary chromosome ordering.
- Fixed issue #140 by making sure to emit all records from the current chromosome before moving on to the next one. This takes care of the problem in most typical cases (eg. when all files have records for all contigs), but not in some edge cases, in which case the 'vcf_record_sort_key' arg can be used to fully solve the problem by explicitly defining the chromosome order.
vcf/utils.py
-import operator
+def walk_together(*readers, **kwargs):
+ """ Simultaneously iteratate two or more VCF readers and return
+ lists of concurrent records from each

This comment has been minimized.

Show comment Hide comment
@jamescasbon

jamescasbon Feb 6, 2014

Owner

This docstring is over indented.

@jamescasbon

jamescasbon Feb 6, 2014

Owner

This docstring is over indented.

vcf/utils.py
+ if 'vcf_record_sort_key' in kwargs:
+ get_key = kwargs['vcf_record_sort_key']
+ else:
+ get_key = lambda r: (r.CHROM, r.POS)

This comment has been minimized.

Show comment Hide comment
@jamescasbon

jamescasbon Feb 6, 2014

Owner

The default should include REF and ALT

@jamescasbon

jamescasbon Feb 6, 2014

Owner

The default should include REF and ALT

@jamescasbon

This comment has been minimized.

Show comment Hide comment
@jamescasbon

jamescasbon Feb 6, 2014

Owner

Thanks, can you check your whitespace? Seems to be 8 when it should be 4 spaces.

Also, can you add some tests for this?

Owner

jamescasbon commented Feb 6, 2014

Thanks, can you check your whitespace? Seems to be 8 when it should be 4 spaces.

Also, can you add some tests for this?

Ben Weisburd
Fixed spacing and wrapping in utils.py, removed test for old walk_tog…
…ether arg (eq function), fixed edge case in _AltRecord
vcf/utils.py
else:
- eq_func = operator.eq
+ get_key = lambda r: (r.CHROM, r.POS) #, r.REF, r.ALT)

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 6, 2014

I would argue for the default key being (r.CHROM, r.POS) because

  • adding REF and ALT to the key breaks the "nrecs == 5" assert in test_walk(..) - (it makes nrecs == 8)
  • GATK LocusWalkers use CHROM/POS only. Also, I think its more intuitive for all the records at a given CHROM/POS to be emitted together at which point the user can decide if they care about possible multiple alleles at a site. Generally, it seems multi-allelic sites are still not typically distinguished from bi-allelic - for example, VarScan2 and GATK UnifiedGenotyper only emit one ALT allele per site by default.
  • The definitions of cmp and eq in model.Record appear to be contradictory on this issue ( cmp uses only CHROM and POS, while eq also adds ALT and REF).
@bw2

bw2 Feb 6, 2014

I would argue for the default key being (r.CHROM, r.POS) because

  • adding REF and ALT to the key breaks the "nrecs == 5" assert in test_walk(..) - (it makes nrecs == 8)
  • GATK LocusWalkers use CHROM/POS only. Also, I think its more intuitive for all the records at a given CHROM/POS to be emitted together at which point the user can decide if they care about possible multiple alleles at a site. Generally, it seems multi-allelic sites are still not typically distinguished from bi-allelic - for example, VarScan2 and GATK UnifiedGenotyper only emit one ALT allele per site by default.
  • The definitions of cmp and eq in model.Record appear to be contradictory on this issue ( cmp uses only CHROM and POS, while eq also adds ALT and REF).
@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 7, 2014

I'm done with adding tests and fixing white-space. All tests are passing for all tox python versions. Let me know if you see any other problems.
thanks
-Ben

bw2 commented Feb 7, 2014

I'm done with adding tests and fixing white-space. All tests are passing for all tox python versions. Let me know if you see any other problems.
thanks
-Ben

@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 10, 2014

Ps. If you're ok with the above patch, I'd like to submit a followup one which adds a simple check to walk_together to confirm that chromosomes are not ordered in contradictory ways among different readers (eg. chr1, chr10, chr2.. in one reader and chr1, chr2, chr3.. in another).

bw2 commented Feb 10, 2014

Ps. If you're ok with the above patch, I'd like to submit a followup one which adds a simple check to walk_together to confirm that chromosomes are not ordered in contradictory ways among different readers (eg. chr1, chr10, chr2.. in one reader and chr1, chr2, chr3.. in another).

@jamescasbon

This comment has been minimized.

Show comment Hide comment
@jamescasbon

jamescasbon Feb 10, 2014

Owner

Merged, thanks!

Would like a patch as you describe. Also, should check for contig ordering if specified (see #20)

Owner

jamescasbon commented Feb 10, 2014

Merged, thanks!

Would like a patch as you describe. Also, should check for contig ordering if specified (see #20)

@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 10, 2014

Thanks for merging.

Regarding #20, I wanted to look at the source code, but couldn't find where the "sort" functionality is.. if I understand correctly, the suggestion is to get chromosome order based on ##contig fields in the header?
If so, I would argue for always using the actual order of the records rather than what's in the header. Just last week I ran into an issue where the GATK LeftAlign walker added ##contig headers in an order that was different from the actual ordering of the records (actual order was chr1, chr2, chr3 and the headers were added as chr1, chr10, chr2..). Since walk_together reads all the records anyway, I would argue for ignoring the ##contig meta data.

bw2 commented Feb 10, 2014

Thanks for merging.

Regarding #20, I wanted to look at the source code, but couldn't find where the "sort" functionality is.. if I understand correctly, the suggestion is to get chromosome order based on ##contig fields in the header?
If so, I would argue for always using the actual order of the records rather than what's in the header. Just last week I ran into an issue where the GATK LeftAlign walker added ##contig headers in an order that was different from the actual ordering of the records (actual order was chr1, chr2, chr3 and the headers were added as chr1, chr10, chr2..). Since walk_together reads all the records anyway, I would argue for ignoring the ##contig meta data.

@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 11, 2014

I've pushed the additional feature to the same branch.

bw2 commented Feb 11, 2014

I've pushed the additional feature to the same branch.

@martijnvermaat

This comment has been minimized.

Show comment Hide comment
@martijnvermaat

martijnvermaat Feb 11, 2014

Collaborator

Note that VCF files can also be indexed using tabix giving zero cost jumping to any position. The fetch method makes use of this.

Of course we should never require files to be indexed, but we might want to consider how we can use an available index to alleviate sorting issues. This might well be beyond the scope of this pull request though.

Collaborator

martijnvermaat commented Feb 11, 2014

Note that VCF files can also be indexed using tabix giving zero cost jumping to any position. The fetch method makes use of this.

Of course we should never require files to be indexed, but we might want to consider how we can use an available index to alleviate sorting issues. This might well be beyond the scope of this pull request though.

@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 11, 2014

Hi martijnvermaat, since only sorted files can be indexed, are you suggesting getting the sort order from the index? I agree this could be a useful optimization, but not currently useful for my situation (not dealing with huge files).

bw2 commented Feb 11, 2014

Hi martijnvermaat, since only sorted files can be indexed, are you suggesting getting the sort order from the index? I agree this could be a useful optimization, but not currently useful for my situation (not dealing with huge files).

@martijnvermaat

This comment has been minimized.

Show comment Hide comment
@martijnvermaat

martijnvermaat Feb 11, 2014

Collaborator

@datagram Indeed, my suggestion was only regarding the order of the chromosomes.

Collaborator

martijnvermaat commented Feb 11, 2014

@datagram Indeed, my suggestion was only regarding the order of the chromosomes.

@bw2 bw2 closed this Feb 12, 2014

@bw2 bw2 reopened this Feb 12, 2014

@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 12, 2014

Accidentally clicked "Close" instead of "Comment".

After thinking about your suggestion again, it could be nice to have a GATK-style sharding mechanism where the indexes are used to divide up pieces of the VCF file(s) among multiple threads..

bw2 commented Feb 12, 2014

Accidentally clicked "Close" instead of "Comment".

After thinking about your suggestion again, it could be nice to have a GATK-style sharding mechanism where the indexes are used to divide up pieces of the VCF file(s) among multiple threads..

@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Mar 7, 2014

Hi James, I've checked in the 2nd patch I've discussed above which lets walk_together confirm that chromosomes are not ordered in contradictory ways among different readers (eg. chr1, chr10, chr2.. in one reader and chr1, chr2, chr3.. in another). The code is checked in to this same branch. Would you be willing to do another merge?

thanks
-Ben

bw2 commented Mar 7, 2014

Hi James, I've checked in the 2nd patch I've discussed above which lets walk_together confirm that chromosomes are not ordered in contradictory ways among different readers (eg. chr1, chr10, chr2.. in one reader and chr1, chr2, chr3.. in another). The code is checked in to this same branch. Would you be willing to do another merge?

thanks
-Ben

@bw2

This comment has been minimized.

Show comment Hide comment
@bw2

bw2 Feb 16, 2015

Hi James, I wanted to sort VCF files and noticed this hasn't been merged. Would you be willing to merge these changes?
Thanks
-Ben

bw2 commented Feb 16, 2015

Hi James, I wanted to sort VCF files and noticed this hasn't been merged. Would you be willing to merge these changes?
Thanks
-Ben

@martijnvermaat

This comment has been minimized.

Show comment Hide comment
@martijnvermaat

martijnvermaat Feb 16, 2015

Collaborator

Hi @bw2, I'd have a look but it's not clear to me exactly which part of this PR is unmerged. Probably just this commit?

It doesn't apply cleanly on the current master branch (yes, it has been a while...). If you can rebase it, I can have a look.

Collaborator

martijnvermaat commented Feb 16, 2015

Hi @bw2, I'd have a look but it's not clear to me exactly which part of this PR is unmerged. Probably just this commit?

It doesn't apply cleanly on the current master branch (yes, it has been a while...). If you can rebase it, I can have a look.

@sambrightman

This comment has been minimized.

Show comment Hide comment
@sambrightman

sambrightman Jan 28, 2017

@bw2 is this going to be cleaned up and rebased? Perhaps better to split off the new commits into a fresh PR and close this one?

@bw2 is this going to be cleaned up and rebased? Perhaps better to split off the new commits into a fresh PR and close this one?

@martijnvermaat

This comment has been minimized.

Show comment Hide comment
@martijnvermaat

martijnvermaat Jan 29, 2017

Collaborator

Yes, at this point it would be better just to submit a new PR. We'd be happy to have it.

Collaborator

martijnvermaat commented Jan 29, 2017

Yes, at this point it would be better just to submit a new PR. We'd be happy to have it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment