Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annot fixup #960

Merged
merged 4 commits into from
Oct 19, 2021
Merged

Annot fixup #960

merged 4 commits into from
Oct 19, 2021

Conversation

andrewkern
Copy link
Member

@andrewkern andrewkern commented Jun 16, 2021

okay here is a PR that fixes up the Annotation class to return non-overlapping intervals using @grahamgower's approach from the Analysis2 repo.

basics points to this PR:

  1. refactored the data type stored to be a numpy array of intervals. the Annotation.get_chromosome_annotations() returns this array for use with SLiM
  2. the annotation_maint.py script now does the download of the GFF and the interval merge, before tarring stuff up for aws

things to do:

  1. i don't know how fragile the gff.source == "ensembl_havana" requirement is. should we worry?
  2. the way things are written now, species can have more than one annotation. it would be cool to extend this to other choices
  3. add a front end for doing the download of genome annotations in the same way that we have for genome assemblies

Copy link
Member

@grahamgower grahamgower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the direction we should take @andrewkern. We still don't know how the annotations are going to be used by the BGS models, so I think it's premature to do something concrete here where we choose which annotations folks can use. I think we really need to make some BGS models (for 2+ species), apply them to the appropriate annotations, then step back and look at how we create an API (both internal and external).

Comment on lines 121 to 104
# this is fragile-- is ensembl_havana always a feature?
exons = gff[
np.where(
np.logical_and(gff.source == "ensembl_havana", gff.type == "exon")
)
]
Copy link
Member

@grahamgower grahamgower Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ensembl_havana source is only available for human, mouse, zebrafish and rat. But there's also ensembl and havana sources in this gff, so we certainly should be judicious in our choice. I chose the ensembl_havana source in the analysis2 repo by guessing --- I have no clue what we actually want. Is it even possible to know what is going to be in a gff without opening it and looking?

I hate to say it, but I think we need to keep all the annotation info, so that BGS model implementers can choose this stuff downstream. Probably what we want the maintenance code to do is to split the gff up into smaller pieces, by chromosome, and possibly other fields, so that loading/parsing is quick for the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so while choosing this one gene model track -- ensemble_havana -- is fragile, i don't think there is any reason we want all the annotations. at most i reckon we want genes and conserved noncoding regions. CNSs will be challenging to get for many non-human species, so that leaves us with genes as the major locus of selection and I think that's good.

with respect to splitting the gff-- it seems quiet fast now to load and the files are small. humans will have more annotation than anything else so this is the slowest example we'll work with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion in #391 suggests we'll quickly outgrow just exons. I chose the ensembl_havana source and exon type as an example, fully expecting someone to come back and say: "we don't want that, we want xyz instead", or "we need abc as well". I really do think we want to let the BGS models guide us here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason folks might want access to other annotation types/sources is to do simulation-based inference. If we lock ourselves to one or just a few things, this use case becomes difficult or impossible.

Copy link
Member Author

@andrewkern andrewkern Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just had a look back at #391, and yeah I agaree with what is being said there. we've already got this handled i reckon though in the current implementation of Annotation, because species can be associated with more than one set of annotations. so it's simple enough for us to create a list of features like UTR5, UTR3, CDS, etc. then on the maintenance side add those as annotations and push them to AWS.

right now, i see this single case of the human genome with only exons as a proof of concept that we can tweak going forward, but i think all the major bits are here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we decide we want users to access as many annotation types as they want, then we will have to make interval operations in the user side of things -- at least merging and subtracting. I think also we will ultimately want this, but don't see why we can't just merge this and open follow up issues to expand?

return list(iter_merged(intervals, closed=closed))


def test_merged():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should go in tests/test_maintenance.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think tests/test_annotation.py is actually the right place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but the test function can be removed from here now.

@@ -24,3 +24,4 @@ numpy
scikit-allel
zarr>=2.4
biopython
boto3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? I didn't see it in the imports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thats for the AWS push script I wrote. still not sure how to share that script cause of passwords.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for the aws_push_annotations.py script that i've written to deal with getting these info to AWS. problem that we haven't yet figured out is how to share those passwords securely and have the code on github

@jeromekelleher
Copy link
Member

I'm with @grahamgower here - I think we should get something simple working in terms of simulations first before making a more general annotations API.

@andrewkern
Copy link
Member Author

so i think we're chasing our own tail here. if we merge this code then we have a good place to move forward for building the simulation API. we need intervals do do selection sims, this provides them in a general way.

we can nitpick the maintenance side later as to what sorts of annotations we provide.

@jeromekelleher
Copy link
Member

We had decided a few weeks ago to avoid this complexity and just make some simple gene annotations based on a RateMap that we pass around between ourselves, so that we can get the simulations going though. I guess we can see how much work it is to go in either direction from here, though.

@andrewkern andrewkern force-pushed the annot_fixup branch 2 times, most recently from 7116fe6 to 03a1564 Compare June 17, 2021 18:13
def setup_module():
destination = pathlib.Path("_test_cache/zipfiles/")
def setUpModule():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conventions for this in pytest are setup_module()/teardown_module(). https://docs.pytest.org/en/latest/how-to/xunit_setup.html?highlight=setup_module

import stdpopsim
from stdpopsim import utils
import tests

import unittest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recently removed unittest imports so that we use pytest more extensively. Pytest uses raw asserts, test classes don't need to inherit from unittest.TestCase anymore, and warnings are filtered using a class/method decorator. This leads to clearer tests, usually.

@@ -130,15 +133,15 @@ def test_correct_url(self):
# The destination file will be missing.
with pytest.raises(FileNotFoundError):
an.download()
mocked_get.assert_called_once_with(an.zarr_url, filename=mock.ANY)
mocked_get.assert_called_once_with(an.intervals_url, filename=unittest.mock.ANY)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/unittest.mock/mock/

Comment on lines 153 to 157
@pytest.mark.xfail # HomSap annotation not currently available
class TestGetChromosomeAnnotations(tests.CacheReadingTest):
# @pytest.mark.xfail # HomSap annotation not currently available
class TestGetChromosomeAnnotations(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this really should inherit from tests.CacheReadingTest. (and unittest.TestCase isn't needed anymore)

return list(iter_merged(intervals, closed=closed))


def test_merged():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but the test function can be removed from here now.

Comment on lines 112 to 119
for an in spc.annotations:
GFF_URL = an.url
GFF_SHA256 = an.gff_sha256
CHROM_IDS = [chrom.id for chrom in spc.genome.chromosomes]
genome_version = os.path.basename(GFF_URL).split(".")[1]
logger.info(f"Downloading GFF file {spc.id}")
tmp_path = f"{spc.id}.tmp.gff.gz"
gff = get_gff_recarray(GFF_URL, GFF_SHA256)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the UPPERCASE variables aren't really needed here. Just replace the use of these variables with an.url and an.gff_sha256, etc.

@grahamgower
Copy link
Member

I'm not sure what's going on with the failing test tests/test_cli.py::TestDryRun::test_dry_run_quiet. That test checks that there's no output when the -q flag is used, so probably there's some warning or print statment leaking into the output. If you modify the test to print the output, I'm sure you'll narrow it down.

@codecov
Copy link

codecov bot commented Oct 19, 2021

Codecov Report

Merging #960 (52c2e59) into main (e1c1e72) will increase coverage by 0.86%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #960      +/-   ##
==========================================
+ Coverage   98.65%   99.51%   +0.86%     
==========================================
  Files          89       89              
  Lines        2899     2887      -12     
  Branches      348      346       -2     
==========================================
+ Hits         2860     2873      +13     
+ Misses         31        6      -25     
  Partials        8        8              
Impacted Files Coverage Δ
stdpopsim/annotations.py 95.23% <100.00%> (+31.60%) ⬆️
stdpopsim/catalog/HomSap/annotations.py 100.00% <100.00%> (ø)
stdpopsim/species.py 97.91% <0.00%> (+7.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1c1e72...52c2e59. Read the comment docs.

@andrewkern
Copy link
Member Author

I'm not sure what's going on with the failing test tests/test_cli.py::TestDryRun::test_dry_run_quiet. That test checks that there's no output when the -q flag is used, so probably there's some warning or print statment leaking into the output. If you modify the test to print the output, I'm sure you'll narrow it down.

this was an annoying logger thing

Copy link
Member

@grahamgower grahamgower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, does this generalise beyond HomSap?

gff_sha256="313ad46bd4af78b45b9f5d8407bbcbd3f87f4be0747060e84b3b5eb931530ec1",
intervals_url=(
"https://stdpopsim.s3-us-west-2.amazonaws.com/"
"annotations/HomSap/HomSap.GRCh38.tar.gz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the ensembl release number in this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup.

with pytest.raises(OSError):
an.download()

@pytest.mark.xfail # HomSap annotation not currently available
# @pytest.mark.xfail # HomSap annotation not currently available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test now passes, the xfail decorator can be removed. Ditto for test below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -818,9 +818,10 @@ def test_dry_run_quiet(self):
filename = path / "output.trees"
cmd = (
f"{sys.executable} -m stdpopsim -q HomSap -D -L 1000 "
"-o {filename} 2"
f"-o {filename} 2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

)
subprocess.run(cmd, stderr=stderr, shell=True, check=True)
print(cmd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from debugging? I guess this can be removed?

@andrewkern
Copy link
Member Author

It won't generalize that well because the GFF files from different species have different annotations. After we merge this I'll go ahead and add Drosophila annotations-- that should be instructive

…ervals Graham had produced

passing tests, now returns intervals

pushed annotation maintenance tests to correct place

dammit i put the test in the wrong file

some churn getting this up to date:

working

clean up before rebase to kill bug?

found dry run bug

clean up edits from Graham
@andrewkern
Copy link
Member Author

gonna try to use the "Rebase and merge" button... will that do bad things?

@andrewkern
Copy link
Member Author

gonna go for it

@andrewkern andrewkern merged commit da3e351 into popsim-consortium:main Oct 19, 2021
@andrewkern andrewkern deleted the annot_fixup branch October 19, 2021 18:52
@andrewkern
Copy link
Member Author

ugh that didn't quite work out... looks like it didn't squash any of my commits... sorry team, i won't try that again

@petrelharp
Copy link
Contributor

FYI, there's a "squash and merge" button:
Screenshot from 2021-10-22 01-23-58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants