Skip to content

Commit

Permalink
global: support for genomic identifiers
Browse files Browse the repository at this point in the history
* Adds SRA, BioProject, and BioSample identification.

* Adds Ensembl, UniProt, RefSeq, GenBank/RefSeq identification.

* Adds ENA BioProject format identification.
  • Loading branch information
Alan Rubin authored and lnielsen committed Aug 17, 2018
1 parent f6420f9 commit fb0afb3
Show file tree
Hide file tree
Showing 5 changed files with 234 additions and 12 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ htmlcov/
nosetests.xml
coverage.xml
*,cover
.pytest_cache

# Translations
*.mo
Expand Down
13 changes: 7 additions & 6 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@
Authors
=======

- Adrian Pawel Baran <adrian.pawel.baran@cern.ch>
- Alexander Ioannidis <a.ioannidis@cern.ch>
- Jiri Kuncar <jiri.kuncar@cern.ch>
- Lars Holm Nielsen <lars.holm.nielsen@cern.ch>
- Pedro Gaudencio <pmgaudencio@gmail.com>
- Tibor Simko <tibor.simko@cern.ch>
- Adrian Pawel Baran
- Alan Rubin
- Alexander Ioannidis
- Jiri Kuncar
- Lars Holm Nielsen
- Pedro Gaudencio
- Tibor Simko
7 changes: 4 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
..
This file is part of IDUtils
Copyright (C) 2015 CERN.
Copyright (C) 2015-2018 CERN.
Copyright (C) 2018 Alan Rubin.
IDUtils is free software; you can redistribute it and/or modify
it under the terms of the Revised BSD License; see LICENSE file for
Expand All @@ -26,7 +27,7 @@ Features
- Generation of resolving links for persistent identifiers.
- Supported schemes: ISBN10, ISBN13, ISSN, ISTC, DOI, Handle, EAN8, EAN13, ISNI
ORCID, ARK, PURL, LSID, URN, Bibcode, arXiv, PubMed ID, PubMed Central ID,
GND.
GND, SRA, BioProject, BioSample, Ensembl, UniProt, RefSeq, Genome Assembly.

Installation
============
Expand All @@ -42,7 +43,7 @@ API
===

.. automodule:: idutils
:members: is_isbn10, is_isbn13, is_isbn, is_issn, is_istc, is_doi, is_handle, is_ean8, is_ean13, is_ean, is_isni, is_orcid, is_purl, is_url, is_lsid, is_urn, is_ads, is_arxiv_post_2007, is_arxiv_pre_2007, is_arxiv, is_pmid, is_pmcid, is_gnd, detect_identifier_schemes, normalize_doi, normalize_handle, normalize_ads, normalize_orcid, normalize_gnd, normalize_pmid, normalize_arxiv, normalize_pid, to_url
:members: is_isbn10, is_isbn13, is_isbn, is_issn, is_istc, is_doi, is_handle, is_ean8, is_ean13, is_ean, is_isni, is_orcid, is_purl, is_url, is_lsid, is_urn, is_ads, is_arxiv_post_2007, is_arxiv_pre_2007, is_arxiv, is_pmid, is_pmcid, is_gnd, is_sra, is_bioproject, is_biosample, is_ensembl, is_uniprot, is_refseq, is_genome, detect_identifier_schemes, normalize_doi, normalize_handle, normalize_ads, normalize_orcid, normalize_gnd, normalize_pmid, normalize_arxiv, normalize_pid, to_url


.. include:: ../CHANGES.rst
Expand Down
194 changes: 193 additions & 1 deletion idutils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# -*- coding: utf-8 -*-
#
# This file is part of IDUtils
# Copyright (C) 2015, 2016 CERN.
# Copyright (C) 2015-2018 CERN.
# Copyright (C) 2018 Alan Rubin.
#
# IDUtils is free software; you can redistribute it and/or modify
# it under the terms of the Revised BSD License; see LICENSE file for
Expand All @@ -23,6 +24,123 @@

from .version import __version__

ENSEMBL_PREFIXES = (
"ENSPMA", # Petromyzon marinus (Lamprey)
"ENSNGA", # Nannospalax galili (Upper Galilee mountains blind mole rat)
"ENSOPR", # Ochotona princeps (Pika)
"ENSMNE", # Macaca nemestrina (Pig-tailed macaque)
"MGP_C57BL6NJ_", # Mus musculus (Mouse C57BL/6NJ)
"MGP_LPJ_", # Mus musculus (Mouse LP/J)
"FB", # Drosophila melanogaster (Fruitfly)
"ENSORL", # Oryzias latipes (Medaka)
"ENSONI", # Oreochromis niloticus (Tilapia)
"ENSOCU", # Oryctolagus cuniculus (Rabbit)
"ENSXET", # Xenopus tropicalis (Xenopus)
"ENSRRO", # Rhinopithecus roxellana (Golden snub-nosed monkey)
"ENSCAT", # Cercocebus atys (Sooty mangabey)
"ENSAME", # Ailuropoda melanoleuca (Panda)
"MGP_CASTEiJ_", # Mus musculus castaneus (Mouse CAST/EiJ)
"ENSCSAV", # Ciona savignyi
"ENSMAU", # Mesocricetus auratus (Golden Hamster)
"ENSFAL", # Ficedula albicollis (Flycatcher)
"ENSTRU", # Takifugu rubripes (Fugu)
"ENSPTR", # Pan troglodytes (Chimpanzee)
"ENSTTR", # Tursiops truncatus (Dolphin)
"ENSCJA", # Callithrix jacchus (Marmoset)
"ENSSAR", # Sorex araneus (Shrew)
"ENSVPA", # Vicugna pacos (Alpaca)
"ENSLAC", # Latimeria chalumnae (Coelacanth)
"ENSPVA", # Pteropus vampyrus (Megabat)
"ENSPAN", # Papio anubis (Olive baboon)
"ENSHGLF", # Heterocephalus glaber (Naked mole-rat female)
"MGP_PWKPhJ_", # Mus musculus musculus (Mouse PWK/PhJ)
"MGP_NZOHlLtJ_", # Mus musculus (Mouse NZO/HlLtJ)
"ENSCAF", # Canis lupus familiaris (Dog)
"MGP_AJ_", # Mus musculus (Mouse A/J)
"ENSMOD", # Monodelphis domestica (Opossum)
"ENSMGA", # Meleagris gallopavo (Turkey)
"ENSPCO", # Propithecus coquereli (Coquerel's sifaka)
"ENSFDA", # Fukomys damarensis (Damara mole rat)
"ENSBTA", # Bos taurus (Cow)
"ENSGAL", # Gallus gallus (Chicken)
"ENSLAF", # Loxodonta africana (Elephant)
"ENSGGO", # Gorilla gorilla gorilla (Gorilla)
"ENSCAP", # Cavia aperea (Brazilian guinea pig)
"ENSMMU", # Macaca mulatta (Macaque)
"ENSAPL", # Anas platyrhynchos (Duck)
"ENSCEL", # Caenorhabditis elegans (Caenorhabditis elegans)
"ENSMEU", # Notamacropus eugenii (Wallaby)
"ENSCGR", # Cricetulus griseus (Chinese hamster CriGri)
"ENSANA", # Aotus nancymaae (Ma's night monkey)
"ENSGMO", # Gadus morhua (Cod)
"ENSPEM", # Peromyscus maniculatus bairdii (Northern American deer mouse)
"MGP_C3HHeJ_", # Mus musculus (Mouse C3H/HeJ)
"ENSTGU", # Taeniopygia guttata (Zebra Finch)
"ENSSCE", # Saccharomyces cerevisiae (Saccharomyces cerevisiae)
"ENSOGA", # Otolemur garnettii (Bushbaby)
"ENSACA", # Anolis carolinensis (Anole lizard)
"ENSTSY", # Carlito syrichta (Tarsier)
"ENSTBE", # Tupaia belangeri (Tree Shrew)
"MGP_AKRJ_", # Mus musculus (Mouse AKR/J)
"ENSDAR", # Danio rerio (Zebrafish)
"ENSMUS", # Mus musculus (Mouse)
"ENSETE", # Echinops telfairi (Lesser hedgehog tenrec)
"ENSSBO", # Saimiri boliviensis boliviensis (Bolivian squirrel monkey)
"ENS", # Homo sapiens (Human)
"ENSCGR", # Cricetulus griseus (Chinese hamster CHOK1GS)
"ENSFCA", # Felis catus (Cat)
"MGP_BALBcJ_", # Mus musculus (Mouse BALB/cJ)
"MGP_PahariEiJ_", # Mus pahari (Shrew mouse)
"ENSCSA", # Chlorocebus sabaeus (Vervet-AGM)
"ENSCCA", # Cebus capucinus imitator (Capuchin)
"ENSOAR", # Ovis aries (Sheep)
"ENSCHI", # Capra hircus (Goat)
"ENSDOR", # Dipodomys ordii (Kangaroo rat)
"ENSCHO", # Choloepus hoffmanni (Sloth)
"ENSSHA", # Sarcophilus harrisii (Tasmanian devil)
"ENSMPU", # Mustela putorius furo (Ferret)
"ENSNLE", # Nomascus leucogenys (Gibbon)
"ENSXMA", # Xiphophorus maculatus (Platyfish)
"ENSSSC", # Sus scrofa (Pig)
"ENSEEU", # Erinaceus europaeus (Hedgehog)
"ENSPSI", # Pelodiscus sinensis (Chinese softshell turtle)
"MGP_DBA2J_", # Mus musculus (Mouse DBA/2J)
"ENSAMX", # Astyanax mexicanus (Cave fish)
"MGP_WSBEiJ_", # Mus musculus domesticus (Mouse WSB/EiJ)
"ENSJJA", # Jaculus jaculus (Lesser Egyptian jerboa)
"ENSCIN", # Ciona intestinalis
"ENSPPA", # Pan paniscus (Bonobo)
"MGP_SPRETEiJ_", # Mus spretus (Algerian mouse)
"ENSCAN", # Colobus angolensis palliatus (Angola colobus)
"MGP_NODShiLtJ_", # Mus musculus (Mouse NOD/ShiLtJ)
"ENSCLA", # Chinchilla lanigera (Long-tailed chinchilla)
"ENSCPO", # Cavia porcellus (Guinea Pig)
"ENSDNO", # Dasypus novemcinctus (Armadillo)
"ENSPFO", # Poecilia formosa (Amazon molly)
"ENSMIC", # Microcebus murinus (Mouse Lemur)
"MGP_FVBNJ_", # Mus musculus (Mouse FVB/NJ)
"MGP_CBAJ_", # Mus musculus (Mouse CBA/J)
"ENSSTO", # Ictidomys tridecemlineatus (Squirrel)
"ENSRNO", # Rattus norvegicus (Rat)
"ENSMOC", # Microtus ochrogaster (Prairie vole)
"ENSTNI", # Tetraodon nigroviridis (Tetraodon)
"ENSPPY", # Pongo abelii (Orangutan)
"ENSGAC", # Gasterosteus aculeatus (Stickleback)
"ENSLOC", # Lepisosteus oculatus (Spotted gar)
"ENSODE", # Octodon degus (Degu)
"ENSPCA", # Procavia capensis (Hyrax)
"ENSECA", # Equus caballus (Horse)
"ENSOAN", # Ornithorhynchus anatinus (Platypus)
"MGP_CAROLIEiJ_", # Mus caroli (Ryukyu mouse)
"ENSHGLM", # Heterocephalus glaber (Naked mole-rat male)
"MGP_129S1SvImJ_", # Mus musculus (Mouse 129S1/SvImJ)
"ENSRBI", # Rhinopithecus bieti (Black snub-nosed monkey)
"ENSMLU", # Myotis lucifugus (Microbat)
"ENSMLE", # Mandrillus leucophaeus (Drill)
"ENSMFA", # Macaca fascicularis (Crab-eating macaque)
)
"""List of species-specific prefixes for Ensembl accession numbers."""

doi_regexp = re.compile(
"(doi:\s*|(?:https?://)?(?:dx\.)?doi\.org/)?(10\.\d+(.\d+)*/.+)$",
flags=re.I
Expand Down Expand Up @@ -92,6 +210,30 @@

gnd_resolver_url = "http://d-nb.info/gnd/"

sra_regexp = re.compile("[SED]R[APRSXZ]\d+$")
"""Sequence Read Archive regular expression."""

bioproject_regexp = re.compile("PRJ(NA|EA|EB|DB)\d+$")
"""BioProject regular expression."""

biosample_regexp = re.compile("SAM(N|EA|D)\d+$")
"""BioSample regular expression."""

ensembl_regexp = re.compile("({prefixes})(E|FM|G|GT|P|R|T)\d{{11}}$".format(
prefixes="|".join(ENSEMBL_PREFIXES)))
"""Ensembl regular expression."""

uniprot_regexp = re.compile("([A-N,R-Z][0-9]([A-Z][A-Z,0-9]{2}[0-9]){1,2})|"
"([O,P,Q][0-9][A-Z,0-9]{3}[0-9])(\.\d+)?$")
"""UniProt regular expression."""

refseq_regexp = re.compile("((AC|NC|NG|NT|NW|NM|NR|XM|XR|AP|NP|YP|XP|WP)_|"
"NZ_[A-Z]{4})\d+(\.\d+)?$")
"""RefSeq regular expression."""

genome_regexp = re.compile("GC[AF]_\d+\.\d+$")
"""GenBank or RefSeq genome assembly accession."""


def _convert_x_to_10(x):
"""Convert char to int with X being converted to 10."""
Expand Down Expand Up @@ -333,6 +475,42 @@ def is_gnd(val):

return gnd_regexp.match(val)


def is_sra(val):
"""Test if argument is an SRA accession."""
return sra_regexp.match(val)


def is_bioproject(val):
"""Test if argument is a BioProject accession."""
return bioproject_regexp.match(val)


def is_biosample(val):
"""Test if argument is a BioSample accession."""
return biosample_regexp.match(val)


def is_ensembl(val):
"""Test if argument is an Ensembl accession."""
return ensembl_regexp.match(val)


def is_uniprot(val):
"""Test if argument is a UniProt accession."""
return uniprot_regexp.match(val)


def is_refseq(val):
"""Test if argument is a RefSeq accession."""
return refseq_regexp.match(val)


def is_genome(val):
"""Test if argument is a GenBank or RefSeq genome assembly accession."""
return genome_regexp.match(val)


PID_SCHEMES = [
('doi', is_doi),
('ark', is_ark),
Expand All @@ -353,6 +531,13 @@ def is_gnd(val):
('gnd', is_gnd),
('url', is_url),
('pmid', is_pmid),
('sra', is_sra),
('bioproject', is_bioproject),
('biosample', is_biosample),
('ensembl', is_ensembl),
('uniprot', is_uniprot),
('refseq', is_refseq),
('genome', is_genome),
]
"""Definition of scheme name and associated test function.
Expand Down Expand Up @@ -511,6 +696,13 @@ def normalize_pid(val, scheme):
'pmcid': u'{scheme}://www.ncbi.nlm.nih.gov/pmc/{pid}',
'gnd': u'http://d-nb.info/gnd/{pid}',
'urn': u'{scheme}://nbn-resolving.org/{pid}',
'sra': u'{scheme}://www.ebi.ac.uk/ena/data/view/{pid}',
'bioproject': u'{scheme}://www.ebi.ac.uk/ena/data/view/{pid}',
'biosample': u'{scheme}://www.ebi.ac.uk/ena/data/view/{pid}',
'ensembl': u'{scheme}://www.ensembl.org/id/{pid}',
'uniprot': u'{scheme}://purl.uniprot.org/uniprot/{pid}',
'refseq': u'{scheme}://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val={pid}',
'genome': u'{scheme}://www.ncbi.nlm.nih.gov/assembly/{pid}',
}
"""URL generation configuration for the supported PID providers."""

Expand Down
31 changes: 29 additions & 2 deletions tests/test_idutils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# -*- coding: utf-8 -*-
#
# This file is part of IDUtils
# Copyright (C) 2015, 2016 CERN.
# Copyright (C) 2015-2018 CERN.
# Copyright (C) 2015-2018 Alan Rubin.
#
# IDUtils is free software; you can redistribute it and/or modify
# it under the terms of the Revised BSD License; see LICENSE file for
Expand Down Expand Up @@ -136,7 +137,33 @@
'http://d-nb.info/gnd/4079154-3'),
('4079154-3', ['gnd', ], 'gnd:4079154-3',
'http://d-nb.info/gnd/4079154-3'),

('SRX3529244', ['sra', ], '',
'http://www.ebi.ac.uk/ena/data/view/SRX3529244'),
('SRR6437777', ['sra', ], '',
'http://www.ebi.ac.uk/ena/data/view/SRR6437777'),
('PRJNA224116', ['bioproject', ], '',
'http://www.ebi.ac.uk/ena/data/view/PRJNA224116'),
('SAMN08289383', ['biosample', ], '',
'http://www.ebi.ac.uk/ena/data/view/SAMN08289383'),
('ENSG00000012048', ['ensembl', ], '',
'http://www.ensembl.org/id/ENSG00000012048'),
('ENSMUST00000017290', ['ensembl', ], '',
'http://www.ensembl.org/id/ENSMUST00000017290'),
('P02833', ['uniprot', ], '',
'http://purl.uniprot.org/uniprot/P02833'),
('Q9GYV0', ['uniprot', ], '',
'http://purl.uniprot.org/uniprot/Q9GYV0'),
('NZ_JXSL01000036.1', ['refseq', ], '',
'http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val='
'NZ_JXSL01000036.1'),
('NM_206454', ['refseq', ], '',
'http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_206454'),
('XM_002113800.1', ['refseq', ], '',
'http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=XM_002113800.1'),
('GCA_000002275.2', ['genome', ], '',
'http://www.ncbi.nlm.nih.gov/assembly/GCA_000002275.2'),
('GCF_000001405.38', ['genome', ], '',
'http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38'),
]


Expand Down

0 comments on commit fb0afb3

Please sign in to comment.