-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annot fixup #960
Annot fixup #960
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the direction we should take @andrewkern. We still don't know how the annotations are going to be used by the BGS models, so I think it's premature to do something concrete here where we choose which annotations folks can use. I think we really need to make some BGS models (for 2+ species), apply them to the appropriate annotations, then step back and look at how we create an API (both internal and external).
maintenance/annotation_maint.py
Outdated
# this is fragile-- is ensembl_havana always a feature? | ||
exons = gff[ | ||
np.where( | ||
np.logical_and(gff.source == "ensembl_havana", gff.type == "exon") | ||
) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ensembl_havana source
is only available for human, mouse, zebrafish and rat. But there's also ensembl
and havana
sources in this gff, so we certainly should be judicious in our choice. I chose the ensembl_havana source in the analysis2 repo by guessing --- I have no clue what we actually want. Is it even possible to know what is going to be in a gff without opening it and looking?
I hate to say it, but I think we need to keep all the annotation info, so that BGS model implementers can choose this stuff downstream. Probably what we want the maintenance code to do is to split the gff up into smaller pieces, by chromosome, and possibly other fields, so that loading/parsing is quick for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so while choosing this one gene model track -- ensemble_havana -- is fragile, i don't think there is any reason we want all the annotations. at most i reckon we want genes and conserved noncoding regions. CNSs will be challenging to get for many non-human species, so that leaves us with genes as the major locus of selection and I think that's good.
with respect to splitting the gff-- it seems quiet fast now to load and the files are small. humans will have more annotation than anything else so this is the slowest example we'll work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion in #391 suggests we'll quickly outgrow just exons. I chose the ensembl_havana source and exon type as an example, fully expecting someone to come back and say: "we don't want that, we want xyz
instead", or "we need abc
as well". I really do think we want to let the BGS models guide us here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason folks might want access to other annotation types/sources is to do simulation-based inference. If we lock ourselves to one or just a few things, this use case becomes difficult or impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just had a look back at #391, and yeah I agaree with what is being said there. we've already got this handled i reckon though in the current implementation of Annotation
, because species can be associated with more than one set of annotations. so it's simple enough for us to create a list of features like UTR5, UTR3, CDS, etc. then on the maintenance side add those as annotations and push them to AWS.
right now, i see this single case of the human genome with only exons as a proof of concept that we can tweak going forward, but i think all the major bits are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we decide we want users to access as many annotation types as they want, then we will have to make interval operations in the user side of things -- at least merging and subtracting. I think also we will ultimately want this, but don't see why we can't just merge this and open follow up issues to expand?
maintenance/annotation_maint.py
Outdated
return list(iter_merged(intervals, closed=closed)) | ||
|
||
|
||
def test_merged(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should go in tests/test_maintenance.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think tests/test_annotation.py
is actually the right place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but the test function can be removed from here now.
@@ -24,3 +24,4 @@ numpy | |||
scikit-allel | |||
zarr>=2.4 | |||
biopython | |||
boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? I didn't see it in the imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thats for the AWS push script I wrote. still not sure how to share that script cause of passwords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for the aws_push_annotations.py
script that i've written to deal with getting these info to AWS. problem that we haven't yet figured out is how to share those passwords securely and have the code on github
I'm with @grahamgower here - I think we should get something simple working in terms of simulations first before making a more general annotations API. |
so i think we're chasing our own tail here. if we merge this code then we have a good place to move forward for building the simulation API. we need intervals do do selection sims, this provides them in a general way. we can nitpick the maintenance side later as to what sorts of annotations we provide. |
We had decided a few weeks ago to avoid this complexity and just make some simple gene annotations based on a RateMap that we pass around between ourselves, so that we can get the simulations going though. I guess we can see how much work it is to go in either direction from here, though. |
7116fe6
to
03a1564
Compare
tests/test_annotations.py
Outdated
def setup_module(): | ||
destination = pathlib.Path("_test_cache/zipfiles/") | ||
def setUpModule(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conventions for this in pytest are setup_module()
/teardown_module()
. https://docs.pytest.org/en/latest/how-to/xunit_setup.html?highlight=setup_module
tests/test_annotations.py
Outdated
import stdpopsim | ||
from stdpopsim import utils | ||
import tests | ||
|
||
import unittest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've recently removed unittest
imports so that we use pytest more extensively. Pytest uses raw asserts, test classes don't need to inherit from unittest.TestCase
anymore, and warnings are filtered using a class/method decorator. This leads to clearer tests, usually.
tests/test_annotations.py
Outdated
@@ -130,15 +133,15 @@ def test_correct_url(self): | |||
# The destination file will be missing. | |||
with pytest.raises(FileNotFoundError): | |||
an.download() | |||
mocked_get.assert_called_once_with(an.zarr_url, filename=mock.ANY) | |||
mocked_get.assert_called_once_with(an.intervals_url, filename=unittest.mock.ANY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/unittest.mock/mock/
tests/test_annotations.py
Outdated
@pytest.mark.xfail # HomSap annotation not currently available | ||
class TestGetChromosomeAnnotations(tests.CacheReadingTest): | ||
# @pytest.mark.xfail # HomSap annotation not currently available | ||
class TestGetChromosomeAnnotations(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really should inherit from tests.CacheReadingTest
. (and unittest.TestCase isn't needed anymore)
maintenance/annotation_maint.py
Outdated
return list(iter_merged(intervals, closed=closed)) | ||
|
||
|
||
def test_merged(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but the test function can be removed from here now.
maintenance/annotation_maint.py
Outdated
for an in spc.annotations: | ||
GFF_URL = an.url | ||
GFF_SHA256 = an.gff_sha256 | ||
CHROM_IDS = [chrom.id for chrom in spc.genome.chromosomes] | ||
genome_version = os.path.basename(GFF_URL).split(".")[1] | ||
logger.info(f"Downloading GFF file {spc.id}") | ||
tmp_path = f"{spc.id}.tmp.gff.gz" | ||
gff = get_gff_recarray(GFF_URL, GFF_SHA256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the UPPERCASE variables aren't really needed here. Just replace the use of these variables with an.url
and an.gff_sha256
, etc.
I'm not sure what's going on with the failing test |
03a1564
to
05677dd
Compare
dcb2415
to
7d67c85
Compare
Codecov Report
@@ Coverage Diff @@
## main #960 +/- ##
==========================================
+ Coverage 98.65% 99.51% +0.86%
==========================================
Files 89 89
Lines 2899 2887 -12
Branches 348 346 -2
==========================================
+ Hits 2860 2873 +13
+ Misses 31 6 -25
Partials 8 8
Continue to review full report at Codecov.
|
this was an annoying logger thing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiousity, does this generalise beyond HomSap?
gff_sha256="313ad46bd4af78b45b9f5d8407bbcbd3f87f4be0747060e84b3b5eb931530ec1", | ||
intervals_url=( | ||
"https://stdpopsim.s3-us-west-2.amazonaws.com/" | ||
"annotations/HomSap/HomSap.GRCh38.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the ensembl release number in this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup.
tests/test_annotations.py
Outdated
with pytest.raises(OSError): | ||
an.download() | ||
|
||
@pytest.mark.xfail # HomSap annotation not currently available | ||
# @pytest.mark.xfail # HomSap annotation not currently available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test now passes, the xfail decorator can be removed. Ditto for test below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -818,9 +818,10 @@ def test_dry_run_quiet(self): | |||
filename = path / "output.trees" | |||
cmd = ( | |||
f"{sys.executable} -m stdpopsim -q HomSap -D -L 1000 " | |||
"-o {filename} 2" | |||
f"-o {filename} 2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
tests/test_cli.py
Outdated
) | ||
subprocess.run(cmd, stderr=stderr, shell=True, check=True) | ||
print(cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from debugging? I guess this can be removed?
It won't generalize that well because the GFF files from different species have different annotations. After we merge this I'll go ahead and add Drosophila annotations-- that should be instructive |
…ervals Graham had produced passing tests, now returns intervals pushed annotation maintenance tests to correct place dammit i put the test in the wrong file some churn getting this up to date: working clean up before rebase to kill bug? found dry run bug clean up edits from Graham
9770ae0
to
66a68cb
Compare
gonna try to use the "Rebase and merge" button... will that do bad things? |
gonna go for it |
ugh that didn't quite work out... looks like it didn't squash any of my commits... sorry team, i won't try that again |
okay here is a PR that fixes up the Annotation class to return non-overlapping intervals using @grahamgower's approach from the Analysis2 repo.
basics points to this PR:
Annotation.get_chromosome_annotations()
returns this array for use with SLiMthings to do:
gff.source == "ensembl_havana"
requirement is. should we worry?