# Gene Ontology Network

Download the [Gene Ontology](http://geneontology.org/docs/download-ontology/) and convert it into a network graph for inspection, visualisation, and hypothesis generation.

## Downloads

Download the core gene ontology in [OBO](http://owlcollab.github.io/oboformat/doc/obo-syntax.html) format

In [1]:
%%sh
# get the core gene ontology in OBO format
curl -O "http://current.geneontology.org/ontology/go.obo"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 32.3M  100 32.3M    0     0  1791k      0  0:00:18  0:00:18 --:--:-- 2451k


Download the human annotations in [GAF](http://geneontology.org/docs/go-annotation-file-gaf-format-2.2/) format.

In [23]:
%%sh
curl -O "http://current.geneontology.org/annotations/goa_human.gaf.gz"
curl -O "http://current.geneontology.org/annotations/goa_human_complex.gaf.gz"
curl -O "http://current.geneontology.org/annotations/goa_human_isoform.gaf.gz"
curl -O "http://current.geneontology.org/annotations/goa_human_rna.gaf.gz"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.3M  100 11.3M    0     0  1048k      0  0:00:11  0:00:11 --:--:-- 2017k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 55972  100 55972    0     0  59940      0 --:--:-- --:--:-- --:--:-- 63245
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2566k  100 2566k    0     0   781k      0  0:00:03  0:00:03 --:--:--  784k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  496k  100  496k    0     0   776k      0 --:--:-- --:--:-- --:--:--  814k


## Ontology

Load the OBO file

In [3]:
import pronto
import pandas as pd
go = pronto.Ontology("go.obo")

Get non-obsolete terms

In [20]:
current_terms = [term for term in go.terms() if not term.obsolete]

Get the GO id, name, and namespace of all terms and turn them into a data frame.

In [35]:

go_terms = pd.DataFrame([[term.id, term.name, term.namespace] for term in current_terms], columns=['Id', 'Label', 'namespace'])
go_terms

Unnamed: 0,Id,Label,namespace
0,GO:0000001,mitochondrion inheritance,biological_process
1,GO:0000002,mitochondrial genome maintenance,biological_process
2,GO:0000003,reproduction,biological_process
3,GO:0000006,high-affinity zinc transmembrane transporter a...,molecular_function
4,GO:0000007,low-affinity zinc ion transmembrane transporte...,molecular_function
...,...,...,...
43694,GO:1905213,negative regulation of mitotic chromosome cond...,biological_process
43695,GO:1905214,regulation of RNA binding,biological_process
43696,GO:1905215,negative regulation of RNA binding,biological_process
43697,GO:1905216,positive regulation of RNA binding,biological_process


Get a list of `is_a` relationships from superclasses.

In [22]:
go_is_a = pd.concat([pd.DataFrame([[term.id, t.id, 'is_a'] for t in term.superclasses(1, with_self=False)], columns=['Source', 'Target', 'relationship']) for term in current_terms])
go_is_a

Unnamed: 0,Source,Target,relationship
0,GO:0000001,GO:0048308,is_a
1,GO:0000001,GO:0048311,is_a
0,GO:0000002,GO:0007005,is_a
0,GO:0000003,GO:0008150,is_a
0,GO:0000006,GO:0005385,is_a
...,...,...,...
0,GO:1905215,GO:0051100,is_a
1,GO:1905215,GO:1905214,is_a
0,GO:1905216,GO:0051099,is_a
1,GO:1905216,GO:1905214,is_a


Get terms that have relationships

In [7]:
terms_with_relationships = [term for term in current_terms if len(term.relationships.items()) > 0]

Build data frames for each relationship of each term, then concatenate them together

In [11]:
rels_list = [[pd.DataFrame([[term.id, t.id, i[0].name] for t in i[1]], columns=['Source', 'Target', 'relationship']) for i in term.relationships.items()] for term in terms_with_relationships]
# list comprehension creates extra nested list, so strip that out first
go_rels = pd.concat([rel[0] for rel in rels_list])

Make a single data frame for all the relationships

## Annotations

Load the annotation `GAF` file

In [33]:
goa = pd.read_csv('goa_human.gaf.gz', sep='\t', comment='!', header=None, names=['DB', 'DBObjectID', 'DBObjectSymbol', 'Qualifier', 'GOID', 'DBReference', 'EvidenceCode', 'WithFrom', 'Aspect', 'DBObjectName', 'DBObjectSynonym', 'DBObjectType', 'Taxon', 'Date', 'AssignedBy', 'AnnotationExtension'], index_col=False)

  goa = pd.read_csv('goa_human.gaf.gz', sep='\t', comment='!', header=None, names=['DB', 'DBObjectID', 'DBObjectSymbol', 'Qualifier', 'GOID', 'DBReference', 'EvidenceCode', 'WithFrom', 'Aspect', 'DBObjectName', 'DBObjectSynonym', 'DBObjectType', 'Taxon', 'Date', 'AssignedBy', 'AnnotationExtension'], index_col=False)


Protein annotations as nodes and edges

In [47]:
# nodes
goa_proteins = goa[['DBObjectID', 'DBObjectSymbol', 'DBObjectType']].drop_duplicates()
goa_proteins.columns = ['Id','Label','namespace']
goa_proteins


Unnamed: 0,Id,Label,namespace
0,A0A024RBG1,NUDT4B,protein
5,A0A075B6H7,IGKV3-7,protein
8,A0A075B6H8,IGKV1D-42,protein
11,A0A075B6H9,IGLV4-69,protein
14,A0A075B6I0,IGLV8-61,protein
...,...,...,...
622517,Q86VQ1,GLCCI1,protein
623171,Q8N9H9,C1orf127,protein
623260,Q49AJ0,FAM135B,protein
623970,Q5SQS7,SH2D4B,protein


In [53]:
# edges
goa_qualifiers = goa[['DBObjectID', 'GOID', 'Qualifier']].drop_duplicates()
goa_qualifiers.columns = ['Source','Target', 'relationship']
goa_qualifiers

Unnamed: 0,Source,Target,relationship
0,A0A024RBG1,GO:0003723,enables
1,A0A024RBG1,GO:0046872,enables
2,A0A024RBG1,GO:0052840,enables
3,A0A024RBG1,GO:0052842,enables
4,A0A024RBG1,GO:0005829,located_in
...,...,...,...
624103,Q9UKU6,GO:0042277,enables
624106,Q9UM73,GO:0042127,involved_in
624107,Q9H4H8,GO:0007165,involved_in
624108,Q9BXS6,GO:0005730,is_active_in


## Network

Concatenate all nodes and edges together

In [60]:
goa_terms = pd.concat([go_terms,goa_proteins]).drop_duplicates()
goa_terms

Unnamed: 0,Id,Label,namespace
0,GO:0000001,mitochondrion inheritance,biological_process
1,GO:0000002,mitochondrial genome maintenance,biological_process
2,GO:0000003,reproduction,biological_process
3,GO:0000006,high-affinity zinc transmembrane transporter a...,molecular_function
4,GO:0000007,low-affinity zinc ion transmembrane transporte...,molecular_function
...,...,...,...
622517,Q86VQ1,GLCCI1,protein
623171,Q8N9H9,C1orf127,protein
623260,Q49AJ0,FAM135B,protein
623970,Q5SQS7,SH2D4B,protein


In [59]:
go_relationships = pd.concat([go_is_a, go_rels]).drop_duplicates()
goa_relationships = pd.concat([go_relationships,goa_qualifiers])
goa_relationships

Unnamed: 0,Source,Target,relationship
0,GO:0000001,GO:0048308,is_a
1,GO:0000001,GO:0048311,is_a
0,GO:0000002,GO:0007005,is_a
0,GO:0000003,GO:0008150,is_a
0,GO:0000006,GO:0005385,is_a
...,...,...,...
0,GO:1905212,GO:1990956,positively regulates
0,GO:1905213,GO:0007076,negatively regulates
0,GO:1905214,GO:0003723,regulates
0,GO:1905215,GO:0003723,negatively regulates


In [57]:
go_terms.to_csv('go.terms.csv', index=False)
go_relationships.to_csv('go.relationships.csv', index=False)
goa_terms.to_csv('goa.terms.csv', index=False)
goa_relationships.to_csv('goa.relationships.csv', index=False)