Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Patch for Writer so it keeps non-GT data when writing records without GT data. #88

Merged
merged 2 commits into from

3 participants

@bow

Some programs (e.g. bcftools) may output VCF files whose samples do not have any
GT field value
. When dealing with files like these, PyVCF's writer
(previously) removes all non-GT data and replace it with './.' since the
_format_sample function immediately returns upon failing to find GT data.

This fix addresses the issue, so that the Writer keeps any non-GT data intact.

I've added a test case for this type of file as well, so we can test for this behavior in the future. However, I've also noted that without the code change the file still pass the test (albeit having its PL data replaced with ./.). So I tweaked _Call.__eq__ making it compare the data attributes when doing the equality testing.

My setup is Arch Linux 64-bit running Python 2.7.3 and all tests pass.

vcf/model.py
@@ -33,7 +33,8 @@ def __eq__(self, other):
"""
return (self.site == other.site
and self.sample == other.sample
- and self.gt_type == other.gt_type)
+ and self.gt_type == other.gt_type
@martijnvermaat Collaborator

The gt_type comparison can now probably be left out, since it is calculated from data.

Although I'm not sure about this change to __eq__, it might be a bit unexpected for some. In comparing _Record, we also don't check QUAL, FILTER, and INFO for example.

If we just want to have a test case for writing data, it could also be done by looking for that explicitely instead of using __eq__ on the calls.

@bow
bow added a note

Hmm..

My concern is that previously, the code might results in behaviors like this:

>>> call1 = Call(sample=-, CallData(PL=[221, 33, 0]))  # not an actual instantiation, assume this is already instantiated for now
>>> call2 = CallData(PL=None)  # likewise
>>> call1 == call2
True
>>> call1.data == call2.data
False

I assume that for two sample calls to be the same, the data values must be the same as well, so I switched this. Are there any unexpected behaviors that I should watch out for in this comparison?

@martijnvermaat Collaborator

Well, it all depends on what we want equality to mean. At this point, equality on records is equality on their variants and equality on calls is equality on their record, sample and genotype.

Literally comparing the entire datastructures would break things like utils.walk_together since it assumes equality on records is equality on their variants.

I agree the current behaviour might also give unexpected results though, as you show.

@bow
bow added a note

Ah,

I understand. I was aiming for a more strict definition of equal, which after more consideration, may be less practical than the current implementation.

Given the possible unexpected results (e.g. above), though, I think it may still be better to notify the users somehow. Using warnings may be too 'noisy' here, perhaps an extra line or two in the docs? Is there a separate repository for the docs hosted on https://pyvcf.readthedocs.org?

Anyway, I'll do a rebase to undo the __eq__ change, remove the new test file (re: the discussion on the pull request's page), and try to come up with another test to detect the pull request's bug in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@martijnvermaat
Collaborator

Looks good! For reading records without genotypes, I previously added a test file (bcftools.vcf). Do you think your test could use that, or is the new test file needed?

@bow

Ah, my bad. Indeed, bcftools.vcf should suffice for testing this behavior. I'll remove the new nogt.vcf test and use bcftools.vcf instead.

However, the tests previously 'hid' this error (due to the _Call.__eq__ comparison). Would adding new simple string equality tests be useful? I imagine we could also test writing by checking, for each record line, whether the string is the same in the input file and output file or not.

@martijnvermaat
Collaborator

That would be fine (bug make sure there is no undeterministic ordering of the fields involved), but I think you can also compare on something like data['PL'].

bow added some commits
@bow bow Fix bug that removes sample data when GT field is not present
Some programs (e.g. bcftools) may output VCF files whose samples do not have its
GT field value. When dealing with files like these, PyVCF's writer will
(previously) remove all non-GT data and replace it with './.' since the
`_format_sample` function immediately returns upon failing to find GT data.

This fix addresses the issue, so that the Writer keeps any non-GT data intact.
e3e5484
@bow bow Update writer unit tests to test call data equality
Samples written by the writer should have the exact same data before and after
they are parsed. Previously this was not tested, since call data equality
testing only checks for the sample name, genotype, and record (and not other
data fields).
3540bb7
@bow

Ok, done :).

I didn't resort to the string comparison since I found out sometimes the writer outputs read integers as floating points (e.g. 1 becomes 1.0 after writing) so I ended up adding simple loops in the test suite to test the data over.

Anyway, should be fixed now. Thanks for the inputs!

@jamescasbon
Owner

Thanks, I'll merge. It's likely this will have to change further if we want to support adding data to records as per #82

@jamescasbon jamescasbon merged commit dfc938b into jamescasbon:master
@bow

Thanks, James. I'll keep an eye for updates to #82 in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 30, 2013
  1. @bow

    Fix bug that removes sample data when GT field is not present

    bow authored
    Some programs (e.g. bcftools) may output VCF files whose samples do not have its
    GT field value. When dealing with files like these, PyVCF's writer will
    (previously) remove all non-GT data and replace it with './.' since the
    `_format_sample` function immediately returns upon failing to find GT data.
    
    This fix addresses the issue, so that the Writer keeps any non-GT data intact.
  2. @bow

    Update writer unit tests to test call data equality

    bow authored
    Samples written by the writer should have the exact same data before and after
    they are parsed. Previously this was not tested, since call data equality
    testing only checks for the sample name, genotype, and record (and not other
    data fields).
This page is out of date. Refresh to see the latest.
Showing with 34 additions and 3 deletions.
  1. +24 −3 vcf/parser.py
  2. +10 −0 vcf/test/test_vcf.py
View
27 vcf/parser.py
@@ -620,9 +620,30 @@ def _format_info(self, info):
return ';'.join([self._stringify_pair(x,y) for x, y in info.iteritems()])
def _format_sample(self, fmt, sample):
- if getattr(sample.data, 'GT', None) is None:
- return "./."
- return ':'.join([self._stringify(x) for x in sample.data])
+ try:
+ # Try to get the GT value first.
+ gt = getattr(sample.data, 'GT')
+ # PyVCF stores './.' GT values as None, so we need to revert it back
+ # to './.' when writing.
+ if gt is None:
+ gt = './.'
+ except AttributeError:
+ # Failing that, try to check whether 'GT' is specified in the FORMAT
+ # field. If yes, use the recommended empty value ('./.')
+ if 'GT' in fmt:
+ gt = './.'
+ # Otherwise use an empty string as the value
+ else:
+ gt = ''
+ # If gt is an empty string (i.e. not stored), write all other data
+ if not gt:
+ return ':'.join([self._stringify(x) for x in sample.data])
+ # Otherwise use the GT values from above and combine it with the rest of
+ # the data.
+ # Note that this follows the VCF spec, where GT is always the first
+ # item whenever it is present.
+ else:
+ return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]])
def _stringify(self, x, none='.', delim=','):
if type(x) == type([]):
View
10 vcf/test/test_vcf.py
@@ -232,6 +232,11 @@ def testWrite(self):
for l, r in zip(records, reader2):
self.assertEquals(l.samples, r.samples)
+ # test for call data equality, since equality on the sample calls
+ # may not always mean their data are all equal
+ for l_call, r_call in zip(l.samples, r.samples):
+ self.assertEqual(l_call.data, r_call.data)
+
class TestBcfToolsOutputWriter(unittest.TestCase):
@@ -256,6 +261,11 @@ def testWrite(self):
for l, r in zip(records, reader2):
self.assertEquals(l.samples, r.samples)
+ # test for call data equality, since equality on the sample calls
+ # may not always mean their data are all equal
+ for l_call, r_call in zip(l.samples, r.samples):
+ self.assertEqual(l_call.data, r_call.data)
+
class TestWriterDictionaryMeta(unittest.TestCase):
Something went wrong with that request. Please try again.