# Introduction: Use DDOT to download and process a human-focused Gene Ontology (GO)

The Gene Ontology (GO) is a general structure that has been curated to describe all species. This notebook extracts the portion of GO that is focused on human biology. This notebook does the following processing steps:

1. Download the GO structure and gene-term annotations
2. Remove redundant GO terms that are not relevant for human, i.e. terms that contain no human genes.
3. Concatenate all three branches of GO (biological process, molecular function, cellular component) into a unified ontology with an artificial root 'GO:00SUPER'
4. Convert gene IDs and symbols using mygene.info Python package (Installation: https://pypi.org/project/mygene/)
5. Upload ontology to NDEx

It is strongly recommended that you go through the tutorial (DDOT_Tutorial.ipynb) before this notebook.

In [1]:
import requests
import gzip
import pandas as pd
import networkx as nx
import sys

import ddot
from ddot import Ontology

# Set username and password on the Network Data Exchange (NDEx).
* It is strongly recommended that you create a free account on NDEx in order to keep track of your own ontologies.
* Note that there are two NDEx servers: the main server at http://public.ndexbio.org/ and a test server for prototyping your code at http://test.ndexbio.org (also aliased as http://dev2.ndexbio.org). Each server requires a separate user account. Because the main server contains networks from publications, we recommend that you use an account on the test server while you become familiar with DDOT

In [2]:
# Note: change the server to http://public.ndexbio.org, if this is where you created your NDEx account
ndex_server = 'http://test.ndexbio.org' 

# Set the NDEx server and the user account (replace with your own account)
ndex_user, ndex_pass = '<enter your username>', '<enter your account password>'

# Download and Parse Gene Ontology files

In [3]:
# Download GO obo file
r = requests.get('http://purl.obolibrary.org/obo/go/go-basic.obo')
with open('go-basic.obo', 'wb') as f:
    f.write(r.content)

# Parse OBO file
ddot.parse_obo('go-basic.obo', 'go.tab', 'goID_2_name.tab', 'goID_2_namespace.tab', 'goID_2_alt_id.tab')

# Download gene-term annotations for human
r = requests.get('http://geneontology.org/gene-associations/goa_human.gaf.gz')
with open('goa_human.gaf.gz', 'wb') as f:
    f.write(r.content)

# Create Ontology object from Gene Ontology files

In [4]:
hierarchy = pd.read_table('go.tab',
                          sep='\t',
                          header=None,
                          names=['Parent', 'Child', 'Relation', 'Namespace'])
with gzip.open('goa_human.gaf.gz', 'rb') as f:
    mapping = ddot.parse_gaf(f)

  if self.run_code(code, result):


In [5]:
go_human = Ontology.from_table(
    table=hierarchy,
    parent='Parent',
    child='Child',
    mapping=mapping,
    mapping_child='DB Object ID',
    mapping_parent='GO ID',
    add_root_name='GO:00SUPER',
    ignore_orphan_terms=True)
go_human.clear_node_attr()
go_human.clear_edge_attr()
go_human

Unifying 3 roots into one super-root


19737 genes, 45018 terms, 277105 gene-term relations, 92880 term-term relations
node_attributes: []
edge_attributes: []

# Collapse GO with respect to human UniProt IDs

In [6]:
%time go_human = go_human.collapse_ontology(method='mhkramer')
if 'GO:00SUPER' not in go_human.terms: go_human.add_root('GO:00SUPER', inplace=True)
print go_human

collapse command: /cellar/users/mikeyu/anaconda2/lib/python2.7/site-packages/ddot/alignOntology/collapseRedundantNodes /tmp/tmp6hPhd_
CPU times: user 16.8 s, sys: 808 ms, total: 17.6 s
Wall time: 29.6 s
19737 genes, 19873 terms, 222123 gene-term relations, 44842 term-term relations
node_attributes: []
edge_attributes: []


# Add descriptions of GO terms

In [7]:
go_descriptions = pd.read_table('goID_2_name.tab',
                                header=None,
                                names=['Term', 'Term_Description'],
                                index_col=0)
go_human.update_node_attr(go_descriptions)

go_branches = pd.read_table('goID_2_namespace.tab',
                                header=None,
                                names=['Term', 'Branch'],
                                index_col=0)
go_human.update_node_attr(go_branches)

# Upload GO to NDEx (and write to local file)
* GO's annotation files uses UniProt IDs for human genes. Use the mygene.info package to convert UniProt IDs to Ensembl, HUGO, and Entrez IDs.

In [8]:
# Install the mygene package (it is recommend you run this in a separate bash terminal, not in this Jupyter notebook. If you want to use a conda virtual environment, then you first need to activate the environment)
! pip install mygene

[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [9]:
import mygene
mg = mygene.MyGeneInfo()

In [10]:
name = 'Human-specific Gene Ontology'

## Format GO with UniProt IDs and upload to NDEx

In [11]:
go_human_uniprot = go_human.copy()

# Write GO to file
go_human_uniprot.to_table('collapsed_go.uniprot', clixo_format=True)
go_human_uniprot.to_pickle('collapsed_go.uniprot.pkl')

url, G = go_human_uniprot.to_ndex(name='%s, %s' % (name, 'UniProt'),
                                  ndex_server=ndex_server,
                                  ndex_user=ndex_user,
                                  ndex_pass=ndex_pass,
                                  layout=None,
                                  visibility='PUBLIC')
print(url)


http://dev2.ndexbio.org/v2/network/dbe79abb-fa70-11e8-ad43-0660b7976219


## Format GO with gene symbols and upload to NDEx

In [12]:
uniprot_2_symbol_df = mg.querymany(go_human.genes, scopes='uniprot', fields='symbol', species='human', as_dataframe=True)

def f(x):
    x = x['symbol']
    if len(x)==1:
        return x[0]
    else:
        return x.tolist()
uniprot_2_symbol = uniprot_2_symbol_df.dropna(subset=['symbol']).groupby('query').apply(f)

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-19737...done.
Finished.
318 input query terms found dup hits:
	[(u'G5E9R7', 2), (u'Q6ZTI6', 2), (u'P62807', 6), (u'P62805', 10), (u'Q5DJT8', 3), (u'P50391', 2), (u
751 input query terms found no hit:
	[u'A0A075B734', u'A0A087WSY4', u'A0A087WUL8', u'A0A087WVM7', u'A0A087WX78', u'A0A087WZG4', u'A0A087X
Pass "returnall=True" to return complete lists of duplicate or missing query terms.


In [13]:
go_human_symbol = go_human.delete(to_delete=set(go_human.genes) - set(uniprot_2_symbol.keys()))
go_human_symbol = go_human_symbol.rename(genes=uniprot_2_symbol.to_dict())
print(go_human_symbol)

# Write GO to file
go_human_symbol.to_table('collapsed_go.symbol', clixo_format=True)
go_human_symbol.to_pickle('collapsed_go.symbol.pkl')

url, G = go_human_symbol.to_ndex(name='%s, %s' % (name, 'Symbol'),
                                 ndex_server=ndex_server,
                                 ndex_user=ndex_user,
                                 ndex_pass=ndex_pass,
                                 layout=None,
                                 visibility='PUBLIC')
print(url)

19124 genes, 19873 terms, 220165 gene-term relations, 44842 term-term relations
node_attributes: ['Term_Description', 'Branch']
edge_attributes: []
http://dev2.ndexbio.org/v2/network/3baa3908-fa71-11e8-ad43-0660b7976219


## Format GO with Entrez gene IDs and upload to NDEx

In [14]:
uniprot_2_entrezgene_df = mg.querymany(go_human.genes, scopes='uniprot', fields='entrezgene', species='human', as_dataframe=True)

def f(x):
    x = x['entrezgene'].astype(int).astype(str)
    if len(x)==1:
        return x[0]
    else:
        return x.tolist()
uniprot_2_entrezgene = uniprot_2_entrezgene_df.dropna(subset=['entrezgene']).groupby('query').apply(f)

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-19737...done.
Finished.
318 input query terms found dup hits:
	[(u'G5E9R7', 2), (u'Q6ZTI6', 2), (u'P62807', 6), (u'P62805', 10), (u'Q5DJT8', 3), (u'P50391', 2), (u
751 input query terms found no hit:
	[u'A0A075B734', u'A0A087WSY4', u'A0A087WUL8', u'A0A087WVM7', u'A0A087WX78', u'A0A087WZG4', u'A0A087X
Pass "returnall=True" to return complete lists of duplicate or missing query terms.


In [15]:
go_human_entrez = go_human.delete(to_delete=set(go_human.genes) - set(uniprot_2_entrezgene.keys()))
go_human_entrez = go_human_entrez.rename(genes=uniprot_2_entrezgene.to_dict())
print go_human_entrez

# Write GO to file
go_human_entrez.to_table('collapsed_go.entrez', clixo_format=True)
go_human_entrez.to_pickle('collapsed_go.entrez.pkl')

url, G = go_human_entrez.to_ndex(name='%s, %s' % (name, 'Entrez'),
                                 ndex_server=ndex_server,
                                 ndex_user=ndex_user,
                                 ndex_pass=ndex_pass,
                                 layout=None,
                                 visibility='PUBLIC')
print(url)

18513 genes, 19873 terms, 217021 gene-term relations, 44842 term-term relations
node_attributes: ['Term_Description', 'Branch']
edge_attributes: []
http://dev2.ndexbio.org/v2/network/6c21c12b-fa71-11e8-ad43-0660b7976219
