Basic infrastructure for pulling genome data from Ensembl #563

jeromekelleher · 2020-06-15T13:23:42Z

This seems a like a reasonable place to start with integrating with upstream genome data providers.

jeromekelleher · 2020-06-15T13:24:05Z

Not sure how to test this, btw.

grahamgower · 2020-06-17T15:37:16Z

I guess you could test the taxon_id via the REST API? E.g.: query ext=f"/info/assembly/{taxon_id}", get the assembly_accession field, then query ext=f"/info/genomes/accession/{assembly_accession}" and finally translate the name field into a 3+3 to compare with the species id.

jeromekelleher · 2020-06-17T16:25:55Z

There's some basic infrastructure here for pulling the genome/assembly data for a species from Ensembl here. The idea is that we use the REST API to get information on the assembly, and then write this info into a Python file. We can then easily read this in, and use it to populate the objects.

I decided that writing a simple Python file was better than trying to write the actual object definitions here because (a) it's a bit faffy writing out Python source code; and (b) we're still combining information from multiple sources, as the mutation/recombination rates aren't going to be available from Ensembl. We could equally use a JSON file, it probably doesn't matter. I just thought that a Black formatting python file like this would be an unequivocal and diff-friendly way to stating the information.

If we like this approach, then I think what we should do is to make each species module in the catalog a package directory, where the __init__.py brings together all the information. For now, the only extra information in the directory will be the genome_data.py file, but we might want others later.

There's a little bit more infrastructure needed to go over all the species automatically, but it's straightforward. Do we like the general layout, before I go any further?

Pinging @grahamgower @andrewkern @ndukler

jeromekelleher · 2020-06-17T16:27:05Z

Oh yeah, Ensembl doesn't use the "chr" prefix and we currently do. Hooray.

ndukler · 2020-06-17T16:42:11Z

Heh. I use a package in R called GenomeInfoDb that can detect chromosome name formats and convert between them to avoid this sort of thing when I work in R. Either we can switch everything to ensembl nomenclature or we can implement a similar functionality for our package. They have the mapping files here and we could probably re-implement their functionality in stdpopsim. That would also let people specify chromsomes from the command line or API however they want and we could convert them all to a single format internally.

maintenance/__init__.py

grahamgower · 2020-06-17T16:57:36Z

Very nice @jeromekelleher, I like it a lot!

ndukler · 2020-06-17T17:04:25Z

Very nice @jeromekelleher, I like it a lot!

Seconded!

ndukler · 2020-06-17T17:39:04Z

Is it worth me re-implementing the chromosome renaming functionality @jeromekelleher ? Or should we just agree that everyone always has to use ensembl nomenclature? Or just set up a specific ensembl to ucsc conversion when the genome data is imported and force everywhere else to be UCSC?

stdpopsim/catalog/HomSap_genome_data.py

jeromekelleher · 2020-06-18T06:13:50Z

I've pushed through the changes and it's mostly working pretty well. There's some details with some assemblies like PonAbe that we need to make decisions on, but there's no big issues.

The only real roadblock is the chromosome naming conventions. We either need to

(a) find some mapping from the chromosome names to their synonyms (there's surely a web service to do this, right?) and build synonyms into the Chromosome object
(b) Do it ourselves based on the data @ndukler pointed out above; or
(c) Standardise on just the Ensembl IDs, and fix the breakage in our tests (and break existing code).

I think (a) would be the ideal solution. Anyone like to do a bit of digging to see if we can find the required data? I'd be surprised if it wasn't in Ensembl somewhere.

grahamgower

This is looking very useful. I think using the karyotype field of the /info/assembly/ endpoint should clean up the Pongo chromosomes.

grahamgower · 2020-06-18T18:41:40Z

maintenance/ensembl.py

+        data["chromosomes"] = {
+            name: chromosomes[name] for name in sorted(chromosomes.keys())}


I think the "karyotype" output field gives a more natural ordering than sorting.

>>> sorted(["1", "2", "10"]) ['1', '10', '2']

Yes, excellent idea. I'll change that.

grahamgower · 2020-06-18T18:42:27Z

stdpopsim/catalog/PonAbe/__init__.py

+    # JK: I've no idea what this chr is. Is this the correct thing
+    # to do?
+    "Un": 0,


Maybe "Un" means unplaced?

stdpopsim/species.py

ndukler · 2020-06-18T22:59:45Z

(a) find some mapping from the chromosome names to their synonyms (there's surely a web service to do this, right?) and build synonyms into the Chromosome object

Do you want me to prototype this? I looked around for a web service but there genuinely does not seem to be one. There is something in Galaxy here but it seems pretty limited. There are also some lists here. I can just grab the tables from one of the places I mentioned then set up a system for automatically attaching the synonyms to each chromosome when they are constructed. Then get_chromosome() can search the synonyms as well.

jeromekelleher · 2020-06-19T07:51:21Z

That sounds good @ndukler, but I worry a bit about how extensive the species data will be and potential conflicts with the Ensembl data model. I'll ping a few more people to see if anyone knows, so maybe don't put too much effort into it.

jeromekelleher · 2020-06-19T20:09:35Z

Some helpful comments on twitter here

jeromekelleher · 2020-06-23T20:10:16Z

I've update this with your ideas @ndukler - it's still pretty messy tbh. I don't really know where we should get the data from - perhaps the best thing is to git submodule the R code? Or, maybe we should pull in the data using the R package by running R from Python? Then we can checkout a particular release tag from GenomeInfoDb.

I had a go at getting this information from UCSC via the database API too, but everything so bloody inconsistent :(

codecov · 2020-06-24T18:39:01Z

Codecov Report

Merging #563 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #563   +/-   ##
=======================================
  Coverage   99.64%   99.64%           
=======================================
  Files          23       29    +6     
  Lines        1973     1997   +24     
  Branches      203      206    +3     
=======================================
+ Hits         1966     1990   +24     
  Misses          3        3           
  Partials        4        4

Impacted Files	Coverage Δ
stdpopsim/catalog/AraTha/__init__.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/AraTha/genome_data.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/CanFam/__init__.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/CanFam/genome_data.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/DroMel/__init__.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/DroMel/genome_data.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/EscCol/__init__.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/EscCol/genome_data.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/HomSap/__init__.py	`100.00% <100.00%> (ø)`
stdpopsim/catalog/HomSap/genome_data.py	`100.00% <100.00%> (ø)`
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfe1c6a...97e6efa. Read the comment docs.

jeromekelleher · 2020-06-24T18:40:44Z

This is nearly done, I think, so could use some review from those in the know. We can get some synonym information from Ensembl. It's not that comprehensive, but it's good enough - we can just say we only allow Ensembl IDs for the things we don't have (like DroMel, weirdly). I fixed the the test cases that broke because of this.

The only thing that seriously broken now is the recombination maps. These have UCSC IDs in them, which is fun. Hmm, I should be able to fix this.. wait on.

jeromekelleher · 2020-06-24T19:21:42Z

OK! Recombination maps are fixed, so I think this is ready for review and final cleanups now.

Pinging @andrewkern, @grahamgower, @ndukler, @apragsdale

grahamgower

Looks great @jeromekelleher. The only blocker I see here is that DroMel needs a _recombination_rate_data dict like the other species, to set the recombination rate to zero for Y and Mt.

Also, how do you feel about having a consistent synonym for Mt across all species?

stdpopsim/catalog/DroMel/genome_data.py

stdpopsim/catalog/AraTha/__init__.py

stdpopsim/catalog/DroMel/__init__.py

stdpopsim/genomes.py

stdpopsim/catalog/AraTha/genome_data.py

jeromekelleher · 2020-07-07T17:15:46Z

Notes from the call 07/07

Let's not worry about getting all the extra synoyms (unless Ensembl adds them)
We can information on sex chrs as a follow up, once this is completed.

reedacartwright · 2020-07-07T21:07:41Z

What information do we actually need for a sex chromosome? In my mutation calling software, what is important is to know the expected ploidy of an individual at a locus 0, 1, and 2 and how it is inherited. Chromosomes X, Y, etc. are used as shortcuts for determining the polidy of males and females and how it can be transmitted to offspring. This is then used to prune a family-history graph into the proper transmission structure.

In that sense, stdpopsim would need a low level API that specifies the polidy and transmission rules for chromosomes/fragments in species that have defined sexes. On top of that would be user-friendly labels like Autosomal, X, Y, Z, etc. And on top of that would be annotating a genome based on these labels.

jeromekelleher · 2020-07-08T08:47:44Z

Good questions @reedacartwright! I think how we actually simulate sex chrs is a big question. All we're interested in for now, I think, is to basically record the information that something is not an autosome, so that we can emit a warning.

andrewkern · 2020-07-16T15:42:48Z

guys i've spent a bit of time playing with this code this AM. first off @jeromekelleher this is awesome.

second, i think we should merge this as is and then fine tune things as we try to maintain it. undoubtedly we will come on myriad issues trying to align ourselves with ensembl (e.g. the synonym stuff), so i vote we merge this and give it a go while our user base is still small.

This commit changes the basic data associated with species from being locally curated information to being derived from Ensembl.

jeromekelleher · 2020-07-17T12:56:59Z

I've updated this with the bug-fixes, and opened issues to track the outstanding stuff raised above. In the interest of making progress I suggest we merge this much.

@grahamgower, would you mind taking a last look and merging if you're happy please?

grahamgower · 2020-07-17T13:30:06Z

Looks good @jeromekelleher! Merging.

jeromekelleher force-pushed the emsembl-glue branch 2 times, most recently from 26414c7 to d3ae40c Compare June 17, 2020 16:18

jeromekelleher changed the title ~~Taxon_id for species.~~ Basic infrastructure for pulling genome data from Ensembl Jun 17, 2020

grahamgower reviewed Jun 17, 2020

View reviewed changes

maintenance/__init__.py Outdated Show resolved Hide resolved

andrewkern reviewed Jun 17, 2020

View reviewed changes

stdpopsim/catalog/HomSap_genome_data.py Outdated Show resolved Hide resolved

grahamgower reviewed Jun 18, 2020

View reviewed changes

grahamgower requested changes Jun 24, 2020

View reviewed changes

stdpopsim/catalog/DroMel/genome_data.py Show resolved Hide resolved

stdpopsim/catalog/AraTha/__init__.py Outdated Show resolved Hide resolved

stdpopsim/catalog/DroMel/__init__.py Outdated Show resolved Hide resolved

ndukler reviewed Jun 25, 2020

View reviewed changes

stdpopsim/genomes.py Show resolved Hide resolved

ndukler reviewed Jun 25, 2020

View reviewed changes

stdpopsim/catalog/AraTha/genome_data.py Show resolved Hide resolved

andrewkern mentioned this pull request Jul 16, 2020

Annotation WIP #560

Merged

This was referenced Jul 17, 2020

Allow case-insensitive matching of chromsome IDs #569

Closed

Add default synonyms for chromosome names #570

Closed

Use genome data from Ensembl.

97e6efa

This commit changes the basic data associated with species from being locally curated information to being derived from Ensembl.

jeromekelleher force-pushed the emsembl-glue branch from 5483631 to 97e6efa Compare July 17, 2020 12:55

grahamgower merged commit e148606 into popsim-consortium:master Jul 17, 2020

jeromekelleher deleted the emsembl-glue branch July 17, 2020 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic infrastructure for pulling genome data from Ensembl #563

Basic infrastructure for pulling genome data from Ensembl #563

jeromekelleher commented Jun 15, 2020

jeromekelleher commented Jun 15, 2020

grahamgower commented Jun 17, 2020

jeromekelleher commented Jun 17, 2020

jeromekelleher commented Jun 17, 2020

ndukler commented Jun 17, 2020 •

edited

Loading

grahamgower commented Jun 17, 2020

ndukler commented Jun 17, 2020

ndukler commented Jun 17, 2020

jeromekelleher commented Jun 18, 2020

grahamgower left a comment

grahamgower Jun 18, 2020

jeromekelleher Jun 18, 2020

grahamgower Jun 18, 2020

ndukler commented Jun 18, 2020 •

edited

Loading

jeromekelleher commented Jun 19, 2020

jeromekelleher commented Jun 19, 2020

jeromekelleher commented Jun 23, 2020

codecov bot commented Jun 24, 2020 •

edited

Loading

jeromekelleher commented Jun 24, 2020 •

edited

Loading

jeromekelleher commented Jun 24, 2020

grahamgower left a comment

jeromekelleher commented Jul 7, 2020

reedacartwright commented Jul 7, 2020

jeromekelleher commented Jul 8, 2020

andrewkern commented Jul 16, 2020

jeromekelleher commented Jul 17, 2020

grahamgower commented Jul 17, 2020

		data["chromosomes"] = {
		name: chromosomes[name] for name in sorted(chromosomes.keys())}

Basic infrastructure for pulling genome data from Ensembl #563

Basic infrastructure for pulling genome data from Ensembl #563

Conversation

jeromekelleher commented Jun 15, 2020

jeromekelleher commented Jun 15, 2020

grahamgower commented Jun 17, 2020

jeromekelleher commented Jun 17, 2020

jeromekelleher commented Jun 17, 2020

ndukler commented Jun 17, 2020 • edited Loading

grahamgower commented Jun 17, 2020

ndukler commented Jun 17, 2020

ndukler commented Jun 17, 2020

jeromekelleher commented Jun 18, 2020

grahamgower left a comment

Choose a reason for hiding this comment

grahamgower Jun 18, 2020

Choose a reason for hiding this comment

jeromekelleher Jun 18, 2020

Choose a reason for hiding this comment

grahamgower Jun 18, 2020

Choose a reason for hiding this comment

ndukler commented Jun 18, 2020 • edited Loading

jeromekelleher commented Jun 19, 2020

jeromekelleher commented Jun 19, 2020

jeromekelleher commented Jun 23, 2020

codecov bot commented Jun 24, 2020 • edited Loading

Codecov Report

jeromekelleher commented Jun 24, 2020 • edited Loading

jeromekelleher commented Jun 24, 2020

grahamgower left a comment

Choose a reason for hiding this comment

jeromekelleher commented Jul 7, 2020

reedacartwright commented Jul 7, 2020

jeromekelleher commented Jul 8, 2020

andrewkern commented Jul 16, 2020

jeromekelleher commented Jul 17, 2020

grahamgower commented Jul 17, 2020

ndukler commented Jun 17, 2020 •

edited

Loading

ndukler commented Jun 18, 2020 •

edited

Loading

codecov bot commented Jun 24, 2020 •

edited

Loading

jeromekelleher commented Jun 24, 2020 •

edited

Loading