We will start by creating a support function to perform a web request:

In [1]:
import requests
ensembl_server = 'http://rest.ensembl.org'

def do_request(server, service, *args, **kwargs):
    url_params = ''
    for a in args:
        if a is not None:
            url_params += '/' + a
    req = requests.get('%s/%s%s' % (server, service, url_params),
                       params=kwargs,
                       headers={'Content-Type': 'application/json'})
 
    if not req.ok:
        req.raise_for_status()
    return req.json()

We start by importing the *requests* library and specifying the root URL. Then, we create a
simple function that will take the functionality to be called and
generate a complete URL. It will also add optional parameters and specify the payload to be of
the **JSON** type (just to get a default JSON answer). It will return the response in JSON format.

Then, we will check all the available species on the server.

In [2]:
answer = do_request(ensembl_server, 'info/species')
for i, sp in enumerate(answer['species']):
    print(i, sp['name'])

0 aotus_nancymaae
1 homo_sapiens
2 ursus_maritimus
3 mus_musculus_c3hhej
4 gopherus_agassizii
5 zalophus_californianus
6 cavia_aperea
7 xenopus_tropicalis
8 gallus_gallus
9 chrysemys_picta_bellii
10 dasypus_novemcinctus
11 sinocyclocheilus_grahami
12 takifugu_rubripes
13 bison_bison_bison
14 oryctolagus_cuniculus
15 pogona_vitticeps
16 canis_lupus_familiarisbasenji
17 petromyzon_marinus
18 oreochromis_aureus
19 mus_musculus_akrj
20 cebus_imitator
21 strigops_habroptila
22 cyprinus_carpio_huanghe
23 procavia_capensis
24 paramormyrops_kingsleyae
25 ciona_savignyi
26 gouania_willdenowi
27 zonotrichia_albicollis
28 sus_scrofa_rongchang
29 salmo_trutta
30 betta_splendens
31 mus_musculus_aj
32 otolemur_garnettii
33 cynoglossus_semilaevis
34 poecilia_formosa
35 mus_pahari
36 chinchilla_lanigera
37 cyprinus_carpio_germanmirror
38 sus_scrofa
39 esox_lucius
40 ficedula_albicollis
41 pelusios_castaneus
42 apteryx_rowi
43 sinocyclocheilus_anshuiensis
44 pan_troglodytes
45 salvator_merianae
46 maca

Note that this will construct a URL starting with the http://rest.ensembl.org/info/
species prefix for the REST request. The preceding link will not work on your browser, by
the way; it should only be used via a REST API.

Now, let’s try to find any HGNC databases on the server related to human data:

In [3]:
ext_dbs = do_request(ensembl_server, 'info/external_dbs', 'homo_sapiens', filter='HGNC%')
print(ext_dbs)

[{'release': '1', 'display_name': 'HGNC Symbol', 'name': 'HGNC', 'description': None}, {'description': 'transcript name from HGNC', 'release': '1', 'display_name': 'Transcript name', 'name': 'HGNC_trans_name'}]


We restrict the search to human-related databases (homo_sapiens). We also filter databases
starting with HGNC (this filtering uses the SQL notation). HGNC is the HUGO database. We
want to make sure that it’s available because the HUGO database is responsible for curating
human gene names and maintaining our LCT identifier.

Now that we know that the LCT identifier is probably available, we want to retrieve the Ensembl
ID for the gene, as shown in the following code:

In [4]:
answer = do_request(ensembl_server, 'lookup/symbol', 'homo_sapiens', 'LCT')
print(answer)
lct_id = answer['id']

{'logic_name': 'ensembl_havana_gene_homo_sapiens', 'object_type': 'Gene', 'source': 'ensembl_havana', 'id': 'ENSG00000115850', 'assembly_name': 'GRCh38', 'description': 'lactase [Source:HGNC Symbol;Acc:HGNC:6530]', 'canonical_transcript': 'ENST00000264162.7', 'seq_region_name': '2', 'biotype': 'protein_coding', 'start': 135787850, 'version': 10, 'end': 135837184, 'species': 'homo_sapiens', 'display_name': 'LCT', 'db_type': 'core', 'strand': -1}


Just for your information, we can now get the sequence of the area containing the gene.

In [5]:
lct_seq = do_request(ensembl_server, 'sequence/id', lct_id)
print(lct_seq)

{'seq': 'AACAGTTCCTAGAAAATGGAGCTGTCTTGGCATGTAGTCTTTATTGCCCTGCTAAGTTTTTCATGCTGGGGGTCAGACTGGGAGTCTGATAGAAATTTCATTTCCACCGCTGGTCCTCTAACCAATGACTTGCTGCACAACCTGAGTGGTCTCCTGGGAGACCAGAGTTCTAACTTTGTAGCAGGGGACAAAGACATGTATGTTTGTCACCAGCCACTGCCCACTTTCCTGCCAGAATACTTCAGCAGTCTCCATGCCAGTCAGATCACCCATTATAAGGTATTTCTGTCATGGGCACAGCTCCTCCCAGCAGGAAGCACCCAGAATCCAGACGAGAAAACAGTGCAGTGCTACCGGCGACTCCTCAAGGCCCTCAAGACTGCACGGCTTCAGCCCATGGTCATCCTGCACCACCAGACCCTCCCTGCCAGCACCCTCCGGAGAACCGAAGCCTTTGCTGACCTCTTCGCCGACTATGCCACATTCGCCTTCCACTCCTTCGGGGACCTAGTTGGGATCTGGTTCACCTTCAGTGACTTGGAGGAAGTGATCAAGGAGCTTCCCCACCAGGAATCAAGAGCGTCACAACTCCAGACCCTCAGTGATGCCCACAGAAAAGCCTATGAGATTTACCACGAAAGCTATGCTTTTCAGGGTGAGTACACATTGACCTGATGGTGACCCCTCGGCAACCTTCATCACACACCTTCCCCATCCTCCTTAGAGCAGATTCGACATTTCTCCCAACTCACCTTCAGCAGTCCTCTTATGTCTGTGCATAGGGAGAAATTAATATTGTAAATTGATTTCCCACTGGCGATAGGAAGGGGTAGCTAACATGGCAAAACACTCAGCATTTCCTTTGAAAAATATCTTTGAGGCTCACGCCTGTAATCCTAGCACTTTGGGAGGCCGAGGTGGGCGGATCACTTGAAGTCAGGAGTTCGAGACCAGCCTGGCCAATATGGCAAAACCCCGTCTCTACTAAAAA

We can also inspect other databases known to Ensembl; refer to the following gene:

You will find different kinds of databases, such as the Vertebrate Genome Annotation (Vega) project, UniProt (the Protein Data Bank), and WikiGene.

In [6]:
lct_xrefs = do_request(ensembl_server, 'xrefs/id', lct_id)
for xref in lct_xrefs:
    print(xref['db_display_name'])
    print(xref)

LRG display in Ensembl gene
{'display_id': 'LRG_338', 'description': 'Locus Reference Genomic record for LCT', 'version': '0', 'dbname': 'ENS_LRG_gene', 'info_type': 'DIRECT', 'synonyms': [], 'primary_id': 'LRG_338', 'info_text': '', 'db_display_name': 'LRG display in Ensembl gene'}
Expression Atlas
{'display_id': 'ENSG00000115850', 'description': None, 'version': '0', 'dbname': 'ArrayExpress', 'info_type': 'DIRECT', 'synonyms': [], 'primary_id': 'ENSG00000115850', 'db_display_name': 'Expression Atlas', 'info_text': ''}
NCBI gene (formerly Entrezgene)
{'synonyms': [], 'primary_id': '3938', 'info_text': '', 'db_display_name': 'NCBI gene (formerly Entrezgene)', 'version': '0', 'description': 'lactase', 'display_id': 'LCT', 'info_type': 'DEPENDENT', 'dbname': 'EntrezGene'}
HGNC Symbol
{'synonyms': [], 'primary_id': 'HGNC:6530', 'info_text': 'Generated via ensembl_manual', 'db_display_name': 'HGNC Symbol', 'version': '0', 'description': 'lactase', 'display_id': 'LCT', 'dbname': 'HGNC', 'in

Let’s get the orthologues for this gene on the horse genome:

You will get quite a lot of information about an orthologue, such as the taxonomic level of orthology (Boreoeutheria – placental mammals is the closest phylogenetic level between humans and horses), the Ensembl ID of the orthologue, the dN/dS ratio (non-synonymous
to synonymous mutations), and the CIGAR string of differences among sequences. By default, you will also get the
alignment of the orthologous sequence, but I have removed it to unclog the output.

In [7]:
hom_response = do_request(ensembl_server, 'homology/id', lct_id, type='orthologues', sequence='none')
#print(hom_response['data'][0]['homologies'])
homologies = hom_response['data'][0]['homologies']
for homology in homologies:
    print(homology['target']['species'])
    if homology['target']['species'] != 'equus_caballus':
        continue
    print(homology)
    print(homology['taxonomy_level'])
    horse_id = homology['target']['id']

horse_req = do_request(ensembl_server, 'lookup/id', horse_id)
print(horse_req)

pongo_abelii
gorilla_gorilla
pan_paniscus
pan_troglodytes
nomascus_leucogenys
ornithorhynchus_anatinus
sarcophilus_harrisii
notamacropus_eugenii
choloepus_hoffmanni
dasypus_novemcinctus
erinaceus_europaeus
echinops_telfairi
callithrix_jacchus
cercocebus_atys
macaca_fascicularis
macaca_mulatta
macaca_nemestrina
papio_anubis
mandrillus_leucophaeus
canis_lupus_familiaris
physeter_catodon
mustela_putorius_furo
panthera_leo
felis_catus
vulpes_vulpes
sus_scrofa
sus_scrofa
tursiops_truncatus
ailuropoda_melanoleuca
panthera_pardus
mesocricetus_auratus
mus_pahari
balaenoptera_musculus
mus_caroli
marmota_marmota_marmota
capra_hircus
procavia_capensis
cavia_porcellus
ochotona_princeps
dipodomys_ordii
loxodonta_africana
delphinapterus_leucas
sorex_araneus
oryctolagus_cuniculus
oryctolagus_cuniculus
otolemur_garnettii
mus_spretus
microcebus_murinus
phocoena_sinus
vombatus_ursinus
mus_spicilegus
tupaia_belangeri
urocitellus_parryii
bos_grunniens
cricetulus_griseus_chok1gshd
monodelphis_domestica
mon

We could have acquired the orthologues directly for the horse genome by specifying a `target_species` parameter on `do_request`. However, this code allows you to inspect
all the available orthologues.

Finally, let’s look for the horse_id Ensembl record:

In [8]:
horse_req = do_request(ensembl_server, 'lookup/id',horse_id)
print(horse_req)

{'db_type': 'core', 'display_name': 'LCT', 'species': 'equus_caballus', 'strand': -1, 'version': 3, 'object_type': 'Gene', 'source': 'ensembl', 'start': 19678126, 'logic_name': 'ensembl', 'id': 'ENSECAG00000018594', 'biotype': 'protein_coding', 'canonical_transcript': 'ENSECAT00000020097.3', 'end': 19724999, 'assembly_name': 'EquCab3.0', 'description': 'lactase [Source:VGNC Symbol;Acc:VGNC:19613]', 'seq_region_name': '18'}
