-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API to get contig from region of named chromosome #1348
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1348 +/- ##
=======================================
Coverage 99.67% 99.68%
=======================================
Files 113 113
Lines 3699 3751 +52
Branches 475 487 +12
=======================================
+ Hits 3687 3739 +52
Misses 8 8
Partials 4 4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Ah, there's a slight issue. If the recombination map is longer than the chromosome, the current behavior of stdpopsim is to use the recombination map length for simulations instead of the chromosome length. But, the behavior in this PR is to clip the recombination map to the size of the contig. I can modify the PR to keep the current behavior (when the contig is the entire chromosome and the recombination map is longer) but this seems like an odd edge case to document. Especially if the plan is to deprecate genetic maps that don't match the assembly. So I've kept the new "clip-to-contig" behavior. There's no effect on any of the tests, aside from the occasional stochastic failure of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks @nspope!
I'm luke warm on all the clipping though. Can these just raise errors instead? Are there any reasonable use cases for mismatched masks, recombination maps, and genome coordinates?
Thanks @grahamgower -- with regards to clipping, say there's an available chromosome-wide mask and annotation, and the user wants to use these whilst simulating some small region. One option is that the user clips the mask/annotation to the region themselves, and stdpopsim raises errors if any of the provided intervals fall outside the region. Another option is that stdpopsim does the clipping and warns if something extreme happens, like all mask/annotation intervals falling outside of the region (this is the current strategy in the PR). I could go either way -- not sure what is the right trade-off between ease-of-use and preventing miss-specification. Another possibility is to go the latter route (stdpopsim does the clipping-to-region) but first check that the mask/annotation matches the chromosome boundaries and raise an error if not. |
I think the last round of commits should address your comments, @grahamgower. @andrewkern and/or @petrelharp, could you also take a look? It'd be good to get another opinion with regards to (1) keeping coordinate system of parent chromosome for adding DFEs; (2) clipping masks/maps/dfes to the contig boundaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some minor changes here but looks good to me
we'll also want to add some docs for this including a use case example |
in addition to use case, we'll need to make it clear in the docs that simulating a region with selection is not the same as simulating a chromosome with selection and cropping to the region. |
@andrewkern do you mind taking a look at the additions to the tutorial? Otherwise I think this is good to go! |
looks good to me @nspope! can you rebase and squash these changes before we merge? |
Make sure to convert boundaries to int Add test to bump coverage And another small test to bump coverage Update test to use HapMapII_GRCh8 to avoid stochastic failure Store contig origin as tuple rather than string Use numpy ops for interval clipping Mark pytest fixtures for masking tests Disable length multiplier with left,right coordinates Clean up interaction b/w length_multiplier and left,right Add docs
8616a56
to
9c5f10f
Compare
"Squash and merge" will do this for you. |
Mostly addresses #1346, #670, #401 (and supersedes #402) by adding a way to get a contig from an arbitrary interval of a named chromosome.
Thanks to
msprime.RateMap.slice
this was pretty straightforward, except for the handling of coordinates when DFEs are added. I opted to use the same coordinate system as the chromosome from which the contig is extracted, so for example the following adds a DFE in the middle of the contig:IMO this is cleaner than requiring the user to manually shift the DFE coordinates (and is less confusing with DFE bed files, annotations, etc).
Previously, if an added DFE had an interval that fell outside the contig, a rather ambiguous SLiM error would get raised. Now, the added DFE intervals are silently clipped to the contig boundaries. If all the intervals fall outside of these boundaries, a warning is raised and a DFE with empty intervals is added.
The same goes for masks (previously an error would get thrown from tskit if a mask interval overlapped a contig boundary): these are now clipped, with a warning if all masked intervals fall outside of the contig boundaries.