Annot fixup #960

andrewkern · 2021-06-16T20:51:37Z

okay here is a PR that fixes up the Annotation class to return non-overlapping intervals using @grahamgower's approach from the Analysis2 repo.

basics points to this PR:

refactored the data type stored to be a numpy array of intervals. the Annotation.get_chromosome_annotations() returns this array for use with SLiM
the annotation_maint.py script now does the download of the GFF and the interval merge, before tarring stuff up for aws

things to do:

i don't know how fragile the gff.source == "ensembl_havana" requirement is. should we worry?
the way things are written now, species can have more than one annotation. it would be cool to extend this to other choices
add a front end for doing the download of genome annotations in the same way that we have for genome assemblies

grahamgower

I'm not sure this is the direction we should take @andrewkern. We still don't know how the annotations are going to be used by the BGS models, so I think it's premature to do something concrete here where we choose which annotations folks can use. I think we really need to make some BGS models (for 2+ species), apply them to the appropriate annotations, then step back and look at how we create an API (both internal and external).

grahamgower · 2021-06-17T08:19:38Z

maintenance/annotation_maint.py

+            # this is fragile-- is ensembl_havana always a feature?
+            exons = gff[
+                np.where(
+                    np.logical_and(gff.source == "ensembl_havana", gff.type == "exon")
+                )
+            ]


The ensembl_havana source is only available for human, mouse, zebrafish and rat. But there's also ensembl and havana sources in this gff, so we certainly should be judicious in our choice. I chose the ensembl_havana source in the analysis2 repo by guessing --- I have no clue what we actually want. Is it even possible to know what is going to be in a gff without opening it and looking?

I hate to say it, but I think we need to keep all the annotation info, so that BGS model implementers can choose this stuff downstream. Probably what we want the maintenance code to do is to split the gff up into smaller pieces, by chromosome, and possibly other fields, so that loading/parsing is quick for the user.

so while choosing this one gene model track -- ensemble_havana -- is fragile, i don't think there is any reason we want all the annotations. at most i reckon we want genes and conserved noncoding regions. CNSs will be challenging to get for many non-human species, so that leaves us with genes as the major locus of selection and I think that's good.

with respect to splitting the gff-- it seems quiet fast now to load and the files are small. humans will have more annotation than anything else so this is the slowest example we'll work with.

The discussion in #391 suggests we'll quickly outgrow just exons. I chose the ensembl_havana source and exon type as an example, fully expecting someone to come back and say: "we don't want that, we want xyz instead", or "we need abc as well". I really do think we want to let the BGS models guide us here.

One reason folks might want access to other annotation types/sources is to do simulation-based inference. If we lock ourselves to one or just a few things, this use case becomes difficult or impossible.

just had a look back at #391, and yeah I agaree with what is being said there. we've already got this handled i reckon though in the current implementation of Annotation, because species can be associated with more than one set of annotations. so it's simple enough for us to create a list of features like UTR5, UTR3, CDS, etc. then on the maintenance side add those as annotations and push them to AWS.

right now, i see this single case of the human genome with only exons as a proof of concept that we can tweak going forward, but i think all the major bits are here.

if we decide we want users to access as many annotation types as they want, then we will have to make interval operations in the user side of things -- at least merging and subtracting. I think also we will ultimately want this, but don't see why we can't just merge this and open follow up issues to expand?

grahamgower · 2021-06-17T08:23:37Z

maintenance/annotation_maint.py

+    return list(iter_merged(intervals, closed=closed))
+
+
+def test_merged():


The test should go in tests/test_maintenance.py.

i think tests/test_annotation.py is actually the right place

Ok, but the test function can be removed from here now.

grahamgower · 2021-06-17T08:25:30Z

requirements/development.txt

@@ -24,3 +24,4 @@ numpy
 scikit-allel
 zarr>=2.4
 biopython
+boto3


What's this for? I didn't see it in the imports.

ah thats for the AWS push script I wrote. still not sure how to share that script cause of passwords.

this is for the aws_push_annotations.py script that i've written to deal with getting these info to AWS. problem that we haven't yet figured out is how to share those passwords securely and have the code on github

jeromekelleher · 2021-06-17T16:36:49Z

I'm with @grahamgower here - I think we should get something simple working in terms of simulations first before making a more general annotations API.

andrewkern · 2021-06-17T17:22:15Z

so i think we're chasing our own tail here. if we merge this code then we have a good place to move forward for building the simulation API. we need intervals do do selection sims, this provides them in a general way.

we can nitpick the maintenance side later as to what sorts of annotations we provide.

jeromekelleher · 2021-06-17T17:40:45Z

We had decided a few weeks ago to avoid this complexity and just make some simple gene annotations based on a RateMap that we pass around between ourselves, so that we can get the simulations going though. I guess we can see how much work it is to go in either direction from here, though.

grahamgower · 2021-06-18T07:21:24Z

tests/test_annotations.py

-def setup_module():
-    destination = pathlib.Path("_test_cache/zipfiles/")
+def setUpModule():


The conventions for this in pytest are setup_module()/teardown_module(). https://docs.pytest.org/en/latest/how-to/xunit_setup.html?highlight=setup_module

grahamgower · 2021-06-18T07:24:54Z

tests/test_annotations.py

 import stdpopsim
 from stdpopsim import utils
 import tests
-
+import unittest


I've recently removed unittest imports so that we use pytest more extensively. Pytest uses raw asserts, test classes don't need to inherit from unittest.TestCase anymore, and warnings are filtered using a class/method decorator. This leads to clearer tests, usually.

grahamgower · 2021-06-18T07:25:55Z

tests/test_annotations.py

@@ -130,15 +133,15 @@ def test_correct_url(self):
            # The destination file will be missing.
            with pytest.raises(FileNotFoundError):
                an.download()
-        mocked_get.assert_called_once_with(an.zarr_url, filename=mock.ANY)
+        mocked_get.assert_called_once_with(an.intervals_url, filename=unittest.mock.ANY)


s/unittest.mock/mock/

grahamgower · 2021-06-18T07:26:50Z

tests/test_annotations.py

-@pytest.mark.xfail  # HomSap annotation not currently available
-class TestGetChromosomeAnnotations(tests.CacheReadingTest):
+# @pytest.mark.xfail  # HomSap annotation not currently available
+class TestGetChromosomeAnnotations(unittest.TestCase):


I think this really should inherit from tests.CacheReadingTest. (and unittest.TestCase isn't needed anymore)

grahamgower · 2021-06-18T07:27:45Z

maintenance/annotation_maint.py

+    return list(iter_merged(intervals, closed=closed))
+
+
+def test_merged():


Ok, but the test function can be removed from here now.

grahamgower · 2021-06-18T07:30:16Z

maintenance/annotation_maint.py

+        for an in spc.annotations:
+            GFF_URL = an.url
+            GFF_SHA256 = an.gff_sha256
+            CHROM_IDS = [chrom.id for chrom in spc.genome.chromosomes]
+            genome_version = os.path.basename(GFF_URL).split(".")[1]
+            logger.info(f"Downloading GFF file {spc.id}")
+            tmp_path = f"{spc.id}.tmp.gff.gz"
+            gff = get_gff_recarray(GFF_URL, GFF_SHA256)


I think the UPPERCASE variables aren't really needed here. Just replace the use of these variables with an.url and an.gff_sha256, etc.

grahamgower · 2021-06-18T07:32:39Z

I'm not sure what's going on with the failing test tests/test_cli.py::TestDryRun::test_dry_run_quiet. That test checks that there's no output when the -q flag is used, so probably there's some warning or print statment leaking into the output. If you modify the test to print the output, I'm sure you'll narrow it down.

codecov · 2021-10-19T05:25:33Z

Codecov Report

Merging #960 (52c2e59) into main (e1c1e72) will increase coverage by 0.86%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #960      +/-   ##
==========================================
+ Coverage   98.65%   99.51%   +0.86%     
==========================================
  Files          89       89              
  Lines        2899     2887      -12     
  Branches      348      346       -2     
==========================================
+ Hits         2860     2873      +13     
+ Misses         31        6      -25     
  Partials        8        8

Impacted Files	Coverage Δ
stdpopsim/annotations.py	`95.23% <100.00%> (+31.60%)`	⬆️
stdpopsim/catalog/HomSap/annotations.py	`100.00% <100.00%> (ø)`
stdpopsim/species.py	`97.91% <0.00%> (+7.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1c1e72...52c2e59. Read the comment docs.

andrewkern · 2021-10-19T05:25:46Z

I'm not sure what's going on with the failing test tests/test_cli.py::TestDryRun::test_dry_run_quiet. That test checks that there's no output when the -q flag is used, so probably there's some warning or print statment leaking into the output. If you modify the test to print the output, I'm sure you'll narrow it down.

this was an annoying logger thing

grahamgower

Out of curiousity, does this generalise beyond HomSap?

grahamgower · 2021-10-19T13:12:41Z

stdpopsim/catalog/HomSap/annotations.py

+    gff_sha256="313ad46bd4af78b45b9f5d8407bbcbd3f87f4be0747060e84b3b5eb931530ec1",
+    intervals_url=(
+        "https://stdpopsim.s3-us-west-2.amazonaws.com/"
+        "annotations/HomSap/HomSap.GRCh38.tar.gz"


Should we keep the ensembl release number in this?

grahamgower · 2021-10-19T13:16:34Z

tests/test_annotations.py

        with pytest.raises(OSError):
            an.download()

-    @pytest.mark.xfail  # HomSap annotation not currently available
+    # @pytest.mark.xfail  # HomSap annotation not currently available


If the test now passes, the xfail decorator can be removed. Ditto for test below.

grahamgower · 2021-10-19T13:17:33Z

tests/test_cli.py

@@ -818,9 +818,10 @@ def test_dry_run_quiet(self):
                filename = path / "output.trees"
                cmd = (
                    f"{sys.executable} -m stdpopsim -q HomSap -D -L 1000 "
-                    "-o {filename} 2"
+                    f"-o {filename} 2"


Nice catch!

grahamgower · 2021-10-19T13:18:06Z

tests/test_cli.py

                )
                subprocess.run(cmd, stderr=stderr, shell=True, check=True)
+                print(cmd)


Leftover from debugging? I guess this can be removed?

andrewkern · 2021-10-19T17:11:54Z

It won't generalize that well because the GFF files from different species have different annotations. After we merge this I'll go ahead and add Drosophila annotations-- that should be instructive

…ervals Graham had produced passing tests, now returns intervals pushed annotation maintenance tests to correct place dammit i put the test in the wrong file some churn getting this up to date: working clean up before rebase to kill bug? found dry run bug clean up edits from Graham

andrewkern · 2021-10-19T18:50:03Z

gonna try to use the "Rebase and merge" button... will that do bad things?

andrewkern · 2021-10-19T18:51:18Z

gonna go for it

andrewkern · 2021-10-19T18:53:10Z

ugh that didn't quite work out... looks like it didn't squash any of my commits... sorry team, i won't try that again

petrelharp · 2021-10-22T08:24:21Z

FYI, there's a "squash and merge" button:

andrewkern requested review from grahamgower and mufernando June 16, 2021 20:51

grahamgower reviewed Jun 17, 2021

View reviewed changes

andrewkern force-pushed the annot_fixup branch 2 times, most recently from 7116fe6 to 03a1564 Compare June 17, 2021 18:13

grahamgower reviewed Jun 18, 2021

View reviewed changes

andrewkern force-pushed the annot_fixup branch from 03a1564 to 05677dd Compare September 28, 2021 17:03

andrewkern force-pushed the annot_fixup branch from dcb2415 to 7d67c85 Compare October 19, 2021 04:23

grahamgower reviewed Oct 19, 2021

View reviewed changes

andrewkern force-pushed the annot_fixup branch from 9770ae0 to 66a68cb Compare October 19, 2021 17:15

izabelcavassim mentioned this pull request Oct 19, 2021

Add both annotation and DFE to the catalog documentation #1041

Closed

andrewkern added 3 commits October 19, 2021 11:05

wanted to clean up the maintenance interface for the annotations

01cf886

url swap

81c73bf

file naming woes from AWS

52c2e59

andrewkern merged commit da3e351 into popsim-consortium:main Oct 19, 2021

andrewkern deleted the annot_fixup branch October 19, 2021 18:52

		return list(iter_merged(intervals, closed=closed))


		def test_merged():

Annot fixup #960

Annot fixup #960

Conversation

andrewkern commented Jun 16, 2021 • edited Loading

grahamgower left a comment

Choose a reason for hiding this comment

grahamgower Jun 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkern Jun 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Jun 17, 2021

andrewkern commented Jun 17, 2021

jeromekelleher commented Jun 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grahamgower commented Jun 18, 2021

codecov bot commented Oct 19, 2021 • edited Loading

Codecov Report

andrewkern commented Oct 19, 2021

grahamgower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkern commented Oct 19, 2021

andrewkern commented Oct 19, 2021

andrewkern commented Oct 19, 2021

andrewkern commented Oct 19, 2021

petrelharp commented Oct 22, 2021

andrewkern commented Jun 16, 2021 •

edited

Loading

grahamgower Jun 17, 2021 •

edited

Loading

andrewkern Jun 17, 2021 •

edited

Loading

codecov bot commented Oct 19, 2021 •

edited

Loading