Cannot fetch() whole chromosome #123

Closed
nkoelling opened this Issue Oct 6, 2013 · 2 comments

Projects

None yet

3 participants

In tabix, one can grab all variants for an entire chromosome using tabix.fetch(chr, None, None). In PyVCF, this call fails because of the line "start = start - 1" in the fetch() wrapper, which assumes that start is an integer.

I worked around this by setting the .reader property to the correct call to _tabix.fetch() manually, but it would obviously be better to make fetch() work with start=None in the first place.

I guess a very simple solution would be this, although I actually suspect the line should not be required in the first place:

    if not start is None:
        start = start - 1
Owner

Yes, I've always felt that was wrong. But so is one based in indexing. Any
chance of a PR?
On 6 Oct 2013 17:45, "nkoelling" notifications@github.com wrote:

In tabix, one can grab all variants for an entire chromosome using
tabix.fetch(chr, None, None). In PyVCF, this call fails because of the line
"start = start - 1" in the fetch() wrapper, which assumes that start is an
integer.

I worked around this by setting the .reader property to the correct call
to _tabix.fetch() manually, but it would obviously be better to make
fetch() work with start=None in the first place.

I guess a very simple solution would be this, although I actually suspect
the line should not be required in the first place:

if not start is None:
    start = start - 1


Reply to this email directly or view it on GitHubhttps://github.com/jamescasbon/PyVCF/issues/123
.

@gotgenes gotgenes added a commit to gotgenes/PyVCF that referenced this issue May 14, 2014
@gotgenes gotgenes Reader.fetch uses zero-based, half-open coordinates.
This change is to bring consistency with pysam, which uses the
zero-based, half-open coordinate system for Tabixfile.fetch. See
http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.Tabixfile.fetch

Previously, PyVCF's Reader.fetch declared no particular coordinate
system. Since the method quietly deducted 1 from the start position,
apparently it assumed users were going to input a one-based coordinate
there. However, users familiar with pysam's Tabixfile for other formats
get an unexpected surprise when variants ahead of the start coordinate
start getting returned by Reader.fetch.

As _Call.start and _Call.end are in the ZBHO coordinate system, it adds
to the consistency that fetch take start and end coordinates in ZBHO, so
the same _Call instance could be retrieved using its .CHROM, .start, and
.end coordinates.

This change also removes the prior behavior of fetch of returning a
single _Call instance if given only chrom and start coordinates, by
implicitly doing a Tabixfile.fetch(chrom, start-1, start). The new
behavior when omitting the end parameter is to return an iterator of
_Call instances starting at start and continuing through the end of the
chromosome chrom. Again, this is the behavior consistent with
pysam.Tabixfile.fetch, and is what users ought to expect.

This change also allows the user to omit both the start and end
positions. In this case, an iterable of _Call instances for all records
for the particular chromosome chrom will be returned, which again, is
consistent with Tabixfile.fetch. This behavior also resolves Issue #123
"Cannot fetch() whole chromosome".
da24dd7
@gotgenes gotgenes added a commit to gotgenes/PyVCF that referenced this issue May 14, 2014
@gotgenes gotgenes Reader.fetch uses zero-based, half-open coordinates.
These changes make the behavior of Reader.fetch consistent with with
pysam.Tabixfile, which uses the zero-based, half-open coordinate system
for Tabixfile.fetch. See
http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.Tabixfile.fetch

Previously, PyVCF's Reader.fetch declared no particular coordinate
system. Since the method quietly deducted 1 from the start position,
apparently it assumed users were going to input a one-based coordinate
there. However, users familiar with pysam's Tabixfile for other formats
get an unexpected surprise when variants ahead of the start coordinate
start getting returned by Reader.fetch.

As _Call.start and _Call.end are in the ZBHO coordinate system, it adds
to the consistency that fetch take start and end coordinates in ZBHO, so
the same _Call instance could be retrieved using its .CHROM, .start, and
.end coordinates.

This change also removes the prior behavior of fetch of returning a
single _Call instance if given only chrom and start coordinates, by
implicitly doing a Tabixfile.fetch(chrom, start-1, start). The new
behavior when omitting the end parameter is to return an iterator of
_Call instances starting at start and continuing through the end of the
chromosome chrom. Again, this is the behavior consistent with
pysam.Tabixfile.fetch, and is what users ought to expect.

This change also allows the user to omit both the start and end
positions. In this case, an iterable of _Call instances for all records
for the particular chromosome chrom will be returned, which again, is
consistent with Tabixfile.fetch. This behavior also resolves Issue #123
"Cannot fetch() whole chromosome".
98e8d42
@gotgenes gotgenes added a commit to gotgenes/PyVCF that referenced this issue May 14, 2014
@gotgenes gotgenes Reader.fetch uses zero-based, half-open coordinates.
These changes make the behavior of Reader.fetch consistent with with
pysam.Tabixfile, which uses the zero-based, half-open coordinate system
for Tabixfile.fetch. See
http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.Tabixfile.fetch

Previously, PyVCF's Reader.fetch declared no particular coordinate
system. Since the method quietly deducted 1 from the start position,
apparently it assumed users were going to input a one-based coordinate
there. However, users familiar with pysam's Tabixfile for other formats
get an unexpected surprise when variants ahead of the start coordinate
start getting returned by Reader.fetch.

As _Record.start and _Record.end are in the ZBHO coordinate system, it
adds to the consistency that fetch take start and end coordinates in
ZBHO, so the same _Record instance could be retrieved using its .CHROM,
.start, and .end coordinates.

This change also removes the prior behavior of fetch of returning a
single _Record instance if given only chrom and start coordinates, by
implicitly doing a Tabixfile.fetch(chrom, start-1, start). The new
behavior when omitting the end parameter is to return an iterator of
_Record instances starting at start and continuing through the end of
the chromosome chrom. Again, this is the behavior consistent with
pysam.Tabixfile.fetch, and is what users ought to expect.

This change also allows the user to omit both the start and end
positions. In this case, an iterable of _Record instances for all
records for the particular chromosome chrom will be returned, which
again, is consistent with Tabixfile.fetch. This behavior also resolves
Issue #123 "Cannot fetch() whole chromosome".
221d7a4
@martijnvermaat martijnvermaat added a commit that referenced this issue May 14, 2014
@gotgenes @martijnvermaat gotgenes + martijnvermaat Reader.fetch uses zero-based, half-open coordinates.
These changes make the behavior of Reader.fetch consistent with with
pysam.Tabixfile, which uses the zero-based, half-open coordinate system
for Tabixfile.fetch. See
http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.Tabixfile.fetch

Previously, PyVCF's Reader.fetch declared no particular coordinate
system. Since the method quietly deducted 1 from the start position,
apparently it assumed users were going to input a one-based coordinate
there. However, users familiar with pysam's Tabixfile for other formats
get an unexpected surprise when variants ahead of the start coordinate
start getting returned by Reader.fetch.

As _Record.start and _Record.end are in the ZBHO coordinate system, it
adds to the consistency that fetch take start and end coordinates in
ZBHO, so the same _Record instance could be retrieved using its .CHROM,
.start, and .end coordinates.

This change also removes the prior behavior of fetch of returning a
single _Record instance if given only chrom and start coordinates, by
implicitly doing a Tabixfile.fetch(chrom, start-1, start). The new
behavior when omitting the end parameter is to return an iterator of
_Record instances starting at start and continuing through the end of
the chromosome chrom. Again, this is the behavior consistent with
pysam.Tabixfile.fetch, and is what users ought to expect.

This change also allows the user to omit both the start and end
positions. In this case, an iterable of _Record instances for all
records for the particular chromosome chrom will be returned, which
again, is consistent with Tabixfile.fetch. This behavior also resolves
Issue #123 "Cannot fetch() whole chromosome".
4fba62c
Collaborator

Should be fixed with #156.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment