Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API to get contig from region of named chromosome #1348

Merged
merged 2 commits into from
Sep 15, 2022

Conversation

nspope
Copy link
Collaborator

@nspope nspope commented Sep 3, 2022

Mostly addresses #1346, #670, #401 (and supersedes #402) by adding a way to get a contig from an arbitrary interval of a named chromosome.

Thanks to msprime.RateMap.slice this was pretty straightforward, except for the handling of coordinates when DFEs are added. I opted to use the same coordinate system as the chromosome from which the contig is extracted, so for example the following adds a DFE in the middle of the contig:

species = stdpopsim.get_species("HomSap")
contig = species.get_contig(chromosome="chr1", left=1e6, right=1.1e6)
contig.add_dfe(intervals=[[1.05e6, 1.06e6]], dfe=...)

IMO this is cleaner than requiring the user to manually shift the DFE coordinates (and is less confusing with DFE bed files, annotations, etc).

Previously, if an added DFE had an interval that fell outside the contig, a rather ambiguous SLiM error would get raised. Now, the added DFE intervals are silently clipped to the contig boundaries. If all the intervals fall outside of these boundaries, a warning is raised and a DFE with empty intervals is added.

The same goes for masks (previously an error would get thrown from tskit if a mask interval overlapped a contig boundary): these are now clipped, with a warning if all masked intervals fall outside of the contig boundaries.

@nspope nspope changed the title Add API to get contig from interval of named chromosome Add API to get contig from region of named chromosome Sep 3, 2022
@codecov
Copy link

codecov bot commented Sep 3, 2022

Codecov Report

Merging #1348 (8616a56) into main (2bfa88b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 8616a56 differs from pull request most recent head 9c5f10f. Consider uploading reports for the commit 9c5f10f to get more accurate results

@@           Coverage Diff           @@
##             main    #1348   +/-   ##
=======================================
  Coverage   99.67%   99.68%           
=======================================
  Files         113      113           
  Lines        3699     3751   +52     
  Branches      475      487   +12     
=======================================
+ Hits         3687     3739   +52     
  Misses          8        8           
  Partials        4        4           
Impacted Files Coverage Δ
stdpopsim/genetic_maps.py 100.00% <ø> (ø)
stdpopsim/species.py 97.89% <ø> (ø)
stdpopsim/cli.py 99.77% <100.00%> (+<0.01%) ⬆️
stdpopsim/genomes.py 100.00% <100.00%> (ø)
stdpopsim/utils.py 100.00% <100.00%> (ø)
stdpopsim/warning_categories.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@nspope nspope marked this pull request as draft September 3, 2022 01:31
@nspope nspope marked this pull request as ready for review September 3, 2022 02:10
@nspope nspope marked this pull request as draft September 3, 2022 02:32
@nspope nspope marked this pull request as ready for review September 3, 2022 03:54
@nspope
Copy link
Collaborator Author

nspope commented Sep 3, 2022

Ah, there's a slight issue. If the recombination map is longer than the chromosome, the current behavior of stdpopsim is to use the recombination map length for simulations instead of the chromosome length. But, the behavior in this PR is to clip the recombination map to the size of the contig. I can modify the PR to keep the current behavior (when the contig is the entire chromosome and the recombination map is longer) but this seems like an odd edge case to document. Especially if the plan is to deprecate genetic maps that don't match the assembly. So I've kept the new "clip-to-contig" behavior.

There's no effect on any of the tests, aside from the occasional stochastic failure of test_slim_engine.py::TestRecombinationMap::test_chr1 that uses "HapMapII_GRCh37" and explicitly checks that simulated tree sequence breakpoints fall within the recombination map length. Updating this test to use "HapMapII_GRCh38" eliminates the stochastic failure.

Copy link
Member

@grahamgower grahamgower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @nspope!

I'm luke warm on all the clipping though. Can these just raise errors instead? Are there any reasonable use cases for mismatched masks, recombination maps, and genome coordinates?

stdpopsim/cli.py Outdated Show resolved Hide resolved
stdpopsim/cli.py Outdated Show resolved Hide resolved
stdpopsim/genomes.py Outdated Show resolved Hide resolved
stdpopsim/genomes.py Outdated Show resolved Hide resolved
stdpopsim/genomes.py Outdated Show resolved Hide resolved
stdpopsim/utils.py Outdated Show resolved Hide resolved
tests/test_masking.py Show resolved Hide resolved
@nspope
Copy link
Collaborator Author

nspope commented Sep 3, 2022

Thanks @grahamgower -- with regards to clipping, say there's an available chromosome-wide mask and annotation, and the user wants to use these whilst simulating some small region. One option is that the user clips the mask/annotation to the region themselves, and stdpopsim raises errors if any of the provided intervals fall outside the region. Another option is that stdpopsim does the clipping and warns if something extreme happens, like all mask/annotation intervals falling outside of the region (this is the current strategy in the PR). I could go either way -- not sure what is the right trade-off between ease-of-use and preventing miss-specification.

Another possibility is to go the latter route (stdpopsim does the clipping-to-region) but first check that the mask/annotation matches the chromosome boundaries and raise an error if not.

@nspope
Copy link
Collaborator Author

nspope commented Sep 9, 2022

I think the last round of commits should address your comments, @grahamgower.

@andrewkern and/or @petrelharp, could you also take a look? It'd be good to get another opinion with regards to (1) keeping coordinate system of parent chromosome for adding DFEs; (2) clipping masks/maps/dfes to the contig boundaries.

Copy link
Member

@grahamgower grahamgower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@andrewkern andrewkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor changes here but looks good to me

stdpopsim/cli.py Outdated Show resolved Hide resolved
stdpopsim/genetic_maps.py Outdated Show resolved Hide resolved
stdpopsim/genomes.py Show resolved Hide resolved
@andrewkern
Copy link
Member

we'll also want to add some docs for this including a use case example

@nspope
Copy link
Collaborator Author

nspope commented Sep 14, 2022

in addition to use case, we'll need to make it clear in the docs that simulating a region with selection is not the same as simulating a chromosome with selection and cropping to the region.

@nspope
Copy link
Collaborator Author

nspope commented Sep 14, 2022

@andrewkern do you mind taking a look at the additions to the tutorial?

Otherwise I think this is good to go!

@andrewkern
Copy link
Member

looks good to me @nspope! can you rebase and squash these changes before we merge?

petrelharp and others added 2 commits September 15, 2022 09:38
Make sure to convert boundaries to int

Add test to bump coverage

And another small test to bump coverage

Update test to use HapMapII_GRCh8 to avoid stochastic failure

Store contig origin as tuple rather than string

Use numpy ops for interval clipping

Mark pytest fixtures for masking tests

Disable length multiplier with left,right coordinates

Clean up interaction b/w length_multiplier and left,right

Add docs
@petrelharp
Copy link
Contributor

rebase and squash these changes before we merge?

"Squash and merge" will do this for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants