Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for wiggle coverage tracks. #73

Merged
merged 4 commits into from Feb 2, 2023
Merged

Add support for wiggle coverage tracks. #73

merged 4 commits into from Feb 2, 2023

Conversation

sadams2013
Copy link
Contributor

This adds support for wiggle coverage tracks (WIG files) by adding:

  1. Entries for the '.wig' file extension.
  2. A new reader class (WigReader) that parses and slices wig files.
  3. Unit tests for the WigReader class.
  4. Basic test data with two regions extracted from NIST data for NA12878 (genome in a bottle).

@jrobinso
Copy link
Contributor

Thanks for the PR! A couple of quick comments. This is from code inspection, I didn't try anything, so I could be missing something.

  • Wig files are used for all sorts of data, not just coverage, so I suggest renaming "coverage.py" to "wig.py", or something generic.

  • igv.js supports a "track" line

  • The wig parser doesn't look complete. See http://rohsdb.cmb.usc.edu/goldenPath/help/wiggle.html

    • It doesn't appear that "fixed step" is implemented. This option has a start and step size, in addition to an optional "span". It looks like a fixed step wig will be interpreted as a variable step

    • the end position is not accounted for in "slice". in particular span is not accounted for. This will affect wig features whose start position is < region start but which overlap due to "span".

I don't think you can avoid parsing the wig lines to account for variable & fixed step, and the optional "span" variable. You might find this helpful (skip to line 397): https://github.com/igvteam/igv.js/blob/master/js/feature/decode/ucsc.js.

I think extending "FeatureReader" to handle wig files might be a better approach. To do this you would add a "parse" method in feature.py similar to the igv.js javascript referenced above. See the parse_bed method for an example The parse_wigmethod would return a Feature with chr, start, end, and line => line being the literal line of text parsed. One advantage to this, other than fitting in with the existing framework, is an interval tree is built for the features to support fast slicing. In the implementation here a linear search is done through the features for each chromosome for every region. With a large wig file, and many regions, this could get slow.

@sadams2013
Copy link
Contributor Author

Thank you for reviewing!

I agree - the implementation is lacking for several use cases. My approach was a bit myopic in looking to support my own (narrow) use case.

I will work through your suggestions and recommit.

…rser, and handle wig parsing for more diverse use cases
@sadams2013
Copy link
Contributor Author

Notes for recent commit:

  1. Renamed coverage to wig throughout
  2. I agree that implementing a wig parser in the FeatureReader class would be more succinct. However, I found that handling span headers would require modification (i.e. for a wig file with multiple spans). Therefore, I retained and modified the WigReader class with the following:
    • The parser addresses the full wig spec
    • For each fixed or variable span, creates a FeatureTree with positions (properly accounting for start/end)
    • Slices query all spans for matching regions with the FeatureTree query method
  3. I added a new test case that accounts for wig files with fixed and variable span headers.
  4. I had also missed the stream function that allows for reading from remote sources. That has been implemented within the wig parser.

The initial parse will still be slow for very large files - I'm not sure of a way around that; but this should be reasonable for slices. Perhaps a better implementation would be tabix indexed bedgraph files.

My use case is pretty narrow, for which this is perfect. No hard feelings if you'd prefer not to merge.

@jrobinso
Copy link
Contributor

OK looks good overall. I'll try to get to it this week, if I don't ping this conversation, it will just mean something else distracted me.

Bedgraph is a good idea, that could be just simple since really its just a bed3+1 file. Also I should probably add bigwig support. But this is a good start.

@jrobinso
Copy link
Contributor

@sadams2013 Apologies I got slammed by unexpected issues, although I don't know why I don't expect them, but anyway I didn't get to this. On the list for next week, please ping me if it's not merged by Friday.

@jrobinso
Copy link
Contributor

jrobinso commented Feb 2, 2023

This looks good. Merging.

@jrobinso jrobinso merged commit 3ad8406 into igvteam:master Feb 2, 2023
jrobinso added a commit that referenced this pull request Feb 2, 2023
@jrobinso
Copy link
Contributor

jrobinso commented Feb 2, 2023

I found 1 bug, there was a logic error that caused the entire wig file to be read on every call to "slice". Fixed now, see commit #99223383 We can never have a "region2" without "region"

@sadams2013
Copy link
Contributor Author

Great catch - thank you for fixing and merging!

@jrobinso
Copy link
Contributor

jrobinso commented Feb 2, 2023

Thank you for the PR. I learned some new Python. You might have noticed its not my first language, I have to google every 3rd line but am getting better. I did not know about the possibility of multiple return values.

@jrobinso
Copy link
Contributor

jrobinso commented Feb 6, 2023

This is now released to pypi as 1.7.0

@sadams2013
Copy link
Contributor Author

sadams2013 commented Feb 6, 2023

I found the code base to be very well written and easy to understand.

Thank you for merging and releasing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants