Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic infrastructure for pulling genome data from Ensembl #563

Merged
merged 1 commit into from
Jul 17, 2020

Conversation

jeromekelleher
Copy link
Member

This seems a like a reasonable place to start with integrating with upstream genome data providers.

@jeromekelleher
Copy link
Member Author

Not sure how to test this, btw.

@grahamgower
Copy link
Member

I guess you could test the taxon_id via the REST API? E.g.: query ext=f"/info/assembly/{taxon_id}", get the assembly_accession field, then query ext=f"/info/genomes/accession/{assembly_accession}" and finally translate the name field into a 3+3 to compare with the species id.

@jeromekelleher jeromekelleher force-pushed the emsembl-glue branch 2 times, most recently from 26414c7 to d3ae40c Compare June 17, 2020 16:18
@jeromekelleher jeromekelleher changed the title Taxon_id for species. Basic infrastructure for pulling genome data from Ensembl Jun 17, 2020
@jeromekelleher
Copy link
Member Author

There's some basic infrastructure here for pulling the genome/assembly data for a species from Ensembl here. The idea is that we use the REST API to get information on the assembly, and then write this info into a Python file. We can then easily read this in, and use it to populate the objects.

I decided that writing a simple Python file was better than trying to write the actual object definitions here because (a) it's a bit faffy writing out Python source code; and (b) we're still combining information from multiple sources, as the mutation/recombination rates aren't going to be available from Ensembl. We could equally use a JSON file, it probably doesn't matter. I just thought that a Black formatting python file like this would be an unequivocal and diff-friendly way to stating the information.

If we like this approach, then I think what we should do is to make each species module in the catalog a package directory, where the __init__.py brings together all the information. For now, the only extra information in the directory will be the genome_data.py file, but we might want others later.

There's a little bit more infrastructure needed to go over all the species automatically, but it's straightforward. Do we like the general layout, before I go any further?

Pinging @grahamgower @andrewkern @ndukler

@jeromekelleher
Copy link
Member Author

Oh yeah, Ensembl doesn't use the "chr" prefix and we currently do. Hooray.

@ndukler
Copy link
Contributor

ndukler commented Jun 17, 2020

Heh. I use a package in R called GenomeInfoDb that can detect chromosome name formats and convert between them to avoid this sort of thing when I work in R. Either we can switch everything to ensembl nomenclature or we can implement a similar functionality for our package. They have the mapping files here and we could probably re-implement their functionality in stdpopsim. That would also let people specify chromsomes from the command line or API however they want and we could convert them all to a single format internally.

@grahamgower
Copy link
Member

Very nice @jeromekelleher, I like it a lot!

@ndukler
Copy link
Contributor

ndukler commented Jun 17, 2020

Very nice @jeromekelleher, I like it a lot!

Seconded!

@ndukler
Copy link
Contributor

ndukler commented Jun 17, 2020

Is it worth me re-implementing the chromosome renaming functionality @jeromekelleher ? Or should we just agree that everyone always has to use ensembl nomenclature? Or just set up a specific ensembl to ucsc conversion when the genome data is imported and force everywhere else to be UCSC?

@jeromekelleher
Copy link
Member Author

I've pushed through the changes and it's mostly working pretty well. There's some details with some assemblies like PonAbe that we need to make decisions on, but there's no big issues.

The only real roadblock is the chromosome naming conventions. We either need to

  • (a) find some mapping from the chromosome names to their synonyms (there's surely a web service to do this, right?) and build synonyms into the Chromosome object
  • (b) Do it ourselves based on the data @ndukler pointed out above; or
  • (c) Standardise on just the Ensembl IDs, and fix the breakage in our tests (and break existing code).

I think (a) would be the ideal solution. Anyone like to do a bit of digging to see if we can find the required data? I'd be surprised if it wasn't in Ensembl somewhere.

Copy link
Member

@grahamgower grahamgower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very useful. I think using the karyotype field of the /info/assembly/ endpoint should clean up the Pongo chromosomes.

Comment on lines 97 to 98
data["chromosomes"] = {
name: chromosomes[name] for name in sorted(chromosomes.keys())}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "karyotype" output field gives a more natural ordering than sorting.

>>> sorted(["1", "2", "10"])
['1', '10', '2']

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, excellent idea. I'll change that.

Comment on lines 50 to 52
# JK: I've no idea what this chr is. Is this the correct thing
# to do?
"Un": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Un" means unplaced?

stdpopsim/species.py Show resolved Hide resolved
@ndukler
Copy link
Contributor

ndukler commented Jun 18, 2020

  • (a) find some mapping from the chromosome names to their synonyms (there's surely a web service to do this, right?) and build synonyms into the Chromosome object

Do you want me to prototype this? I looked around for a web service but there genuinely does not seem to be one. There is something in Galaxy here but it seems pretty limited. There are also some lists here. I can just grab the tables from one of the places I mentioned then set up a system for automatically attaching the synonyms to each chromosome when they are constructed. Then get_chromosome() can search the synonyms as well.

@jeromekelleher
Copy link
Member Author

That sounds good @ndukler, but I worry a bit about how extensive the species data will be and potential conflicts with the Ensembl data model. I'll ping a few more people to see if anyone knows, so maybe don't put too much effort into it.

@jeromekelleher
Copy link
Member Author

Some helpful comments on twitter here

@jeromekelleher
Copy link
Member Author

I've update this with your ideas @ndukler - it's still pretty messy tbh. I don't really know where we should get the data from - perhaps the best thing is to git submodule the R code? Or, maybe we should pull in the data using the R package by running R from Python? Then we can checkout a particular release tag from GenomeInfoDb.

I had a go at getting this information from UCSC via the database API too, but everything so bloody inconsistent :(

@codecov
Copy link

codecov bot commented Jun 24, 2020

Codecov Report

Merging #563 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #563   +/-   ##
=======================================
  Coverage   99.64%   99.64%           
=======================================
  Files          23       29    +6     
  Lines        1973     1997   +24     
  Branches      203      206    +3     
=======================================
+ Hits         1966     1990   +24     
  Misses          3        3           
  Partials        4        4           
Impacted Files Coverage Δ
stdpopsim/catalog/AraTha/__init__.py 100.00% <100.00%> (ø)
stdpopsim/catalog/AraTha/genome_data.py 100.00% <100.00%> (ø)
stdpopsim/catalog/CanFam/__init__.py 100.00% <100.00%> (ø)
stdpopsim/catalog/CanFam/genome_data.py 100.00% <100.00%> (ø)
stdpopsim/catalog/DroMel/__init__.py 100.00% <100.00%> (ø)
stdpopsim/catalog/DroMel/genome_data.py 100.00% <100.00%> (ø)
stdpopsim/catalog/EscCol/__init__.py 100.00% <100.00%> (ø)
stdpopsim/catalog/EscCol/genome_data.py 100.00% <100.00%> (ø)
stdpopsim/catalog/HomSap/__init__.py 100.00% <100.00%> (ø)
stdpopsim/catalog/HomSap/genome_data.py 100.00% <100.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfe1c6a...97e6efa. Read the comment docs.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jun 24, 2020

This is nearly done, I think, so could use some review from those in the know. We can get some synonym information from Ensembl. It's not that comprehensive, but it's good enough - we can just say we only allow Ensembl IDs for the things we don't have (like DroMel, weirdly). I fixed the the test cases that broke because of this.

The only thing that seriously broken now is the recombination maps. These have UCSC IDs in them, which is fun. Hmm, I should be able to fix this.. wait on.

@jeromekelleher
Copy link
Member Author

OK! Recombination maps are fixed, so I think this is ready for review and final cleanups now.

Pinging @andrewkern, @grahamgower, @ndukler, @apragsdale

Copy link
Member

@grahamgower grahamgower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @jeromekelleher. The only blocker I see here is that DroMel needs a _recombination_rate_data dict like the other species, to set the recombination rate to zero for Y and Mt.

Also, how do you feel about having a consistent synonym for Mt across all species?

stdpopsim/catalog/DroMel/genome_data.py Show resolved Hide resolved
stdpopsim/catalog/AraTha/__init__.py Outdated Show resolved Hide resolved
stdpopsim/catalog/DroMel/__init__.py Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member Author

Notes from the call 07/07

  • Let's not worry about getting all the extra synoyms (unless Ensembl adds them)
  • We can information on sex chrs as a follow up, once this is completed.

@reedacartwright
Copy link

What information do we actually need for a sex chromosome? In my mutation calling software, what is important is to know the expected ploidy of an individual at a locus 0, 1, and 2 and how it is inherited. Chromosomes X, Y, etc. are used as shortcuts for determining the polidy of males and females and how it can be transmitted to offspring. This is then used to prune a family-history graph into the proper transmission structure.

In that sense, stdpopsim would need a low level API that specifies the polidy and transmission rules for chromosomes/fragments in species that have defined sexes. On top of that would be user-friendly labels like Autosomal, X, Y, Z, etc. And on top of that would be annotating a genome based on these labels.

@jeromekelleher
Copy link
Member Author

Good questions @reedacartwright! I think how we actually simulate sex chrs is a big question. All we're interested in for now, I think, is to basically record the information that something is not an autosome, so that we can emit a warning.

@andrewkern
Copy link
Member

guys i've spent a bit of time playing with this code this AM. first off @jeromekelleher this is awesome.

second, i think we should merge this as is and then fine tune things as we try to maintain it. undoubtedly we will come on myriad issues trying to align ourselves with ensembl (e.g. the synonym stuff), so i vote we merge this and give it a go while our user base is still small.

@andrewkern andrewkern mentioned this pull request Jul 16, 2020
This commit changes the basic data associated with species from being
locally curated information to being derived from Ensembl.
@jeromekelleher
Copy link
Member Author

I've updated this with the bug-fixes, and opened issues to track the outstanding stuff raised above. In the interest of making progress I suggest we merge this much.

@grahamgower, would you mind taking a last look and merging if you're happy please?

@grahamgower
Copy link
Member

Looks good @jeromekelleher! Merging.

@grahamgower grahamgower merged commit e148606 into popsim-consortium:master Jul 17, 2020
@jeromekelleher jeromekelleher deleted the emsembl-glue branch July 17, 2020 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants