Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Add custom equality function as walk_together argument #132

Merged
merged 1 commit into from

3 participants

@bow
bow commented

The walk_together function is very useful for iterating over multiple VCF files. However, it breaks in some cases (e.g. comparing multiple single - sample VCF files with one containing reference calls) due to its definition of VCF record equality (which sometimes may be too strict).

This pull request adds the ability to use user-defined equality functions (e.g. if we want to compare only on Record positions and reference bases), while still leaving the original behavior intact. I should mention that this comes at a cost of a rather clunky usage of **kwargs in walk_together, but I think this is an acceptable trade off.

New test cases against this change have also been added and feedbacks are very welcomed :).

@jamescasbon
Owner

This looks good to me, @martijnvermaat any thoughts?

@martijnvermaat
Collaborator

I think this is good to go, thanks for the PR @bow!

@martijnvermaat martijnvermaat merged commit cede181 into from
@bow
bow commented

Thank you @jamescasbon & @martijnvermaat for merging :).

Slightly related to this, I've been thinking about the behavior of _Record.__eq__ (https://github.com/jamescasbon/PyVCF/blob/master/vcf/model.py#L151), actually. Perhaps it's a good idea to have it always return False if any one of the operand is None? That way we can avoid the TypeError that's always raised whenever you compare to None. I could open a new issue / PR for this if that's preferred (or use this one along with the issue below).

Also, a note on the failing build. I've overlooked the fact that Python2.6 doesn't have self.assertRaisesRegexp which I use in one of the tests. I can remove that bit and use just self.assertRaises instead.

@martijnvermaat
Collaborator

Apparently some of the unit tests were never run on Travis (including this one), which is why I didn't notice the fialure on Python 2.6. I fixed this in 322c212 and b8c0af7 (although some of the unittest mechanics are really beyond me I have to admit).

This was also a nice opportunity for me to try the Tox tool. It works pretty well.

Could you open a new issue or pull request for changing _Record.__eq__?

@bow
bow commented

Great, thanks for the test fix :). I'll open a new PR for the __eq__ change soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Dec 3, 2013
  1. @bow
This page is out of date. Refresh to see the latest.
Showing with 72 additions and 6 deletions.
  1. +36 −0 vcf/test/test_vcf.py
  2. +22 −0 vcf/test/walk_refcall.vcf
  3. +14 −6 vcf/utils.py
View
36 vcf/test/test_vcf.py
@@ -950,6 +950,42 @@ def test_walk(self):
assert recs[0] is not None
assert recs[1] is not None
+ # case with working custom equality function
+
+ # without custom function, exception should be raised
+
+ reader1 = vcf.Reader(fh('example-4.0.vcf'))
+ reader2 = vcf.Reader(fh('walk_refcall.vcf'))
+ self.assertRaisesRegexp(AttributeError, "'NoneType' object has no "
+ "attribute 'type'", next, utils.walk_together(reader1, reader2))
+
+ # with custom function, iteration works
+
+ reader1 = vcf.Reader(fh('example-4.0.vcf'))
+ reader2 = vcf.Reader(fh('walk_refcall.vcf'))
+
+ def custom_eq(rec1, rec2):
+ # check for equality only on CHROM, POS, and REF
+ if rec1 is None or rec2 is None:
+ return False
+ return rec1.CHROM == rec2.CHROM and rec1.POS == rec2.POS and \
+ rec1.REF == rec2.REF
+
+ nrecs, ncomps = 0, 0
+ for x in utils.walk_together(reader1, reader2, eq_func=custom_eq):
+ assert len(x) == 2
+ # avoid assert() when one record is None
+ if x[0] is not None and x[1] is not None:
+ assert (custom_eq(x[0], x[1]) and custom_eq(x[1], x[0]))
+ ncomps += 1
+ # still increment counter to ensure iteration is finished for all
+ # records
+ nrecs += 1
+ # check number of records total
+ assert nrecs == 5
+ # check how many records found in all files
+ assert ncomps == 4
+
def test_trim(self):
tests = [('TAA GAA', 'T G'),
('TA TA', 'T T'),
View
22 vcf/test/walk_refcall.vcf
@@ -0,0 +1,22 @@
+##fileformat=VCFv4.0
+##fileDate=20090805
+##source=myImputationProgramV3.1
+##reference=1000GenomesPilot-NCBI36
+##phasing=partial
+##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
+##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
+##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
+##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
+##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
+##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
+##FILTER=<ID=q10,Description="Quality below 10">
+##FILTER=<ID=s50,Description="Less than 50% of samples have data">
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
+##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
+#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
+20 14370 rs6054257 G . 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 0|0:48:8:51,51 0/0:43:5:.,.
+20 17330 . T . 3.0 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|0:3:5:65,3 0/0:41:3
+20 1110696 rs6040355 A . 1e+03 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 0|0:21:6:23,27 0|0:2:0:18,2 0/0:35:4
+20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
View
20 vcf/utils.py
@@ -2,14 +2,25 @@
Utilities for VCF files.
"""
+import operator
-def walk_together(*readers):
+
+def walk_together(*readers, **kwargs):
""" Simultaneously iteratate two or more VCF readers and return
lists of concurrent records from each
reader, with None if no record present. Caller must check the
inputs are sorted in the same way and use the same reference
otherwise behaviour is undefined.
"""
+ # if defined, custom equality functions must take the same arguments
+ # as operator.eq
+ if 'eq_func' in kwargs:
+ eq_func = kwargs['eq_func']
+ # by default, we use the equality operator (==), which compares
+ # equality in CHROM, POS, REF, and ALT
+ else:
+ eq_func = operator.eq
+
# if one of the VCFs has no records, StopIteration is
# raised immediately, so we need to check for that and
# deal appropriately
@@ -23,15 +34,12 @@ def walk_together(*readers):
while True:
min_next = min([x for x in nexts if x is not None])
- # this line uses equality on Records, which checks the ALTs
- # not sure what to do with records that have overlapping but different
- # variation
- yield [x if x is None or x == min_next else None for x in nexts]
+ yield [x if x is None or eq_func(x, min_next) else None for x in nexts]
# update nexts that we just yielded
for i, n in enumerate(nexts):
- if n is not None and n == min_next:
+ if n is not None and eq_func(n, min_next):
try:
nexts[i] = readers[i].next()
except StopIteration:
Something went wrong with that request. Please try again.