Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use zero-based, half-open coordinates for Reader.fetch #156

Closed

Conversation

gotgenes
Copy link

This pull request currently sits atop PR #155. It includes one commit beyond #155, 221d7a4, that provides the changes for this pull request. Please accept #155 prior to accepting this one.

I have filed this pull request because Reader.fetch currently uses one-based indexing, which is inconsistent with pysam's Tabixfile.fetch (on which Reader.fetch relies). This caused a critical issue in one of my projects where variants were getting included that overlapped what I thought was my zero-based start index. Needless to say, I was less than pleased at the implicit and inconsistent behavior of Reader.fetch; this PR corrects this.

I quote the contents of the commit message containing the changes below

Reader.fetch uses zero-based, half-open coordinates.

These changes make the behavior of Reader.fetch consistent with with pysam.Tabixfile, which uses the zero-based, half-open coordinate system for Tabixfile.fetch. See http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.Tabixfile.fe

Previously, PyVCF's Reader.fetch declared no particular coordinate system. Since the method quietly deducted 1 from the start position, apparently it assumed users were going to input a one-based coordinate there. However, users familiar with pysam's Tabixfile for other formats get an unexpected surprise when variants ahead of the start coordinate start getting returned by Reader.fetch.

As _Record.start and _Record.end are in the ZBHO coordinate system, it adds to the consistency that fetch take start and end coordinates in ZBHO, so the same _Record instance could be retrieved using its .CHROM, .start, and .end coordinates.

This change also removes the prior behavior of fetch of returning a single _Record instance if given only chrom and start coordinates, by implicitly doing a Tabixfile.fetch(chrom, start-1, start). The new behavior when omitting the end parameter is to return an iterator of _Record instances starting at start and continuing through the end of the chromosome chrom. Again, this is the behavior consistent with pysam.Tabixfile.fetch, and is what users ought to expect.

This change also allows the user to omit both the start and end positions. In this case, an iterable of _Record instances for all records for the particular chromosome chrom will be returned, which again, is consistent with Tabixfile.fetch. This behavior also resolves Issue #123 "Cannot fetch() whole chromosome".

Decorates tests that are potentially skipped, as well as broken tests
that are always skipped, as being skipped, rather than indicating
falsely that these tests have passed (the result of premature return
statements prior to any assertions in the tests).

This introduces another dependency for Python 2.6, the unittest2 module,
which back-ported this functionality from Python 2.7 and Python 3.
These changes make the behavior of Reader.fetch consistent with with
pysam.Tabixfile, which uses the zero-based, half-open coordinate system
for Tabixfile.fetch. See
http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.Tabixfile.fetch

Previously, PyVCF's Reader.fetch declared no particular coordinate
system. Since the method quietly deducted 1 from the start position,
apparently it assumed users were going to input a one-based coordinate
there. However, users familiar with pysam's Tabixfile for other formats
get an unexpected surprise when variants ahead of the start coordinate
start getting returned by Reader.fetch.

As _Record.start and _Record.end are in the ZBHO coordinate system, it
adds to the consistency that fetch take start and end coordinates in
ZBHO, so the same _Record instance could be retrieved using its .CHROM,
.start, and .end coordinates.

This change also removes the prior behavior of fetch of returning a
single _Record instance if given only chrom and start coordinates, by
implicitly doing a Tabixfile.fetch(chrom, start-1, start). The new
behavior when omitting the end parameter is to return an iterator of
_Record instances starting at start and continuing through the end of
the chromosome chrom. Again, this is the behavior consistent with
pysam.Tabixfile.fetch, and is what users ought to expect.

This change also allows the user to omit both the start and end
positions. In this case, an iterable of _Record instances for all
records for the particular chromosome chrom will be returned, which
again, is consistent with Tabixfile.fetch. This behavior also resolves
Issue jamescasbon#123 "Cannot fetch() whole chromosome".
martijnvermaat added a commit that referenced this pull request May 14, 2014
Use zero-based, half-open coordinates for Reader.fetch
@martijnvermaat
Copy link
Collaborator

Agreed. Many thanks, especially for the detailed commit message and docstring!

@martijnvermaat
Copy link
Collaborator

(Merged after rebasing.)

@gotgenes gotgenes deleted the feature-zbho_tabix_coordinates branch May 14, 2014 15:27
gotgenes added a commit to gotgenes/PyVCF that referenced this pull request May 14, 2014
This corrects several lines that relate to the changes to fetch brought
in by Pull Request jamescasbon#156.
@gotgenes
Copy link
Author

Thanks for accepting this PR, @martijnvermaat! I have submitted PR #157 because I missed the documentation tucked away inside vcf/__init__.py in this PR. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants