# Preparing Gene Ontology (GO) Helper Files

In this notebook we generate a series of helper files for downstream analysis:

* **`goID_2_name.tab`** – mapping from every GO term ID to its textual name.
* **`GO_PC_full.txt`** – parent → child edges for *biological‑process*, *molecular function* and *cellular component* terms, containing only `is_a` and `part_of` relationships, with cycles removed.
* **`GO_BP_PC_CC_full.txt`** – combines the ontology parent–child edges with human gene → BP - MF - CC annotations, ready for graph‑based enrichment tools.

The whole workflow is reproducible end‑to‑end. Feel free to adapt paths or filters to your needs.


## Download the latest ontology (`go.obo`)


In [1]:
import pathlib, urllib.request, sys

data_dir = pathlib.Path('GO_files')
data_dir.mkdir(exist_ok=True)

obo_path = data_dir / 'go.obo'
if not obo_path.exists():
    print('Downloading Gene Ontology…')
    urllib.request.urlretrieve('http://purl.obolibrary.org/obo/go.obo', obo_path)
    print('Saved to', obo_path)
else:
    print('Ontology already present – skipping download.')


Downloading Gene Ontology…
Saved to GO_files/go.obo


## Required Python packages


In [2]:
# Add any further packages here; most environments already have them.
import obonet
import networkx as nx
import pandas as pd
print('Versions → obonet', obonet.__version__, '| networkx', nx.__version__, '| pandas', pd.__version__)


Versions → obonet 1.1.1 | networkx 3.1 | pandas 2.0.3


## Load ontology into a NetworkX graph


In [3]:
obo_graph = obonet.read_obo(obo_path)
print(f'Loaded {obo_graph.number_of_nodes():,} nodes and {obo_graph.number_of_edges():,} edges.')


Loaded 40,122 nodes and 79,604 edges.


## Build `goID_2_name.tab`


In [4]:
out_path = data_dir / 'goID_2_name.tab'
with out_path.open('w') as fh:
    for go_id, attrs in obo_graph.nodes(data=True):
        fh.write(f"{go_id}	{attrs.get('name','')}\n")
print('Wrote', out_path)


Wrote GO_files/goID_2_name.tab


## Extract parent → child edges for *all* GO namespaces (BP, MF, CC) (`GO_PC_full.txt`)

In [5]:
# NOTE: This cell now captures edges from **all** three GO namespaces.
allowed_ns = {'biological_process', 'molecular_function', 'cellular_component'}

bp_edges = []
for child, parent, edge_attrs in obo_graph.edges(data=True):
    # Keep GO terms within the three main namespaces only
    if (obo_graph.nodes[child].get('namespace') not in allowed_ns or
        obo_graph.nodes[parent].get('namespace') not in allowed_ns):
        continue

    # Keep edges that do not cross namespaces
    if obo_graph.nodes[child].get('namespace') != obo_graph.nodes[parent].get('namespace'):
        continue

    # Keep strictly hierarchical relations
    relation = edge_attrs.get('relation', 'is_a')
    if relation not in {'is_a', 'part_of'}:
        continue

    # Exclude edges that would introduce a reverse path (cycle)
    if not nx.has_path(obo_graph, parent, child):
        bp_edges.append((parent, child))

pc_path = data_dir / 'GO_PC_full.txt'
with pc_path.open('w') as fh:
    fh.writelines(f"{p}\t{c}\n" for p, c in bp_edges)

print('Parent‑child edges saved:', pc_path, '| count:', len(bp_edges))


Parent‑child edges saved: GO_files/GO_PC_full.txt | count: 77932


## Guarantee that the BP graph is acyclic


In [6]:
G = nx.DiGraph(bp_edges)

# Break any residual cycles that slipped through (rare)
while not nx.is_directed_acyclic_graph(G):
    cycle = next(nx.simple_cycles(G))
    print('Breaking residual cycle:', cycle)
    # Remove the last edge in the cycle
    G.remove_edge(cycle[-1], cycle[0])

# Overwrite file with cleaned edges
with (data_dir / 'GO_PC_full.txt').open('w') as fh:
    fh.writelines(f"{u}\t{v}\n" for u, v in G.edges())

print('Graph is now a DAG →', nx.is_directed_acyclic_graph(G))


Graph is now a DAG → True


## Download human Gene Ontology annotations (GAF format)


In [7]:
gaf_path = data_dir / 'goa_human.gaf.gz'
if not gaf_path.exists():
    print('Downloading GOA Human GAF…')
    urllib.request.urlretrieve('http://current.geneontology.org/annotations/goa_human.gaf.gz', gaf_path)
    print('Saved to', gaf_path)
else:
    print('GAF already present – skipping download.')


Downloading GOA Human GAF…
Saved to GO_files/goa_human.gaf.gz


## Parse and filter GAF for human BP annotations


In [8]:

# Column indices (0-based):
#   2 → DB_Object_Symbol  | gene symbol
#   4 → GO_ID             | GO term
#  12 → Taxon             | organism ID(s)

gaf = pd.read_csv(
    gaf_path,
    sep='\t',
    comment='!',
    header=None,
    compression='gzip',
    low_memory=False
)

# Keep human annotations only (taxon:9606)
gaf = gaf[gaf[12].str.contains('taxon:9606')]

# Rename columns for clarity
gaf = gaf.rename(columns={2: 'gene_symbol', 4: 'go_id'})

# Restrict to the three main GO namespaces
allowed_ns = {'biological_process', 'molecular_function', 'cellular_component'}
allowed_terms = {n for n, d in obo_graph.nodes(data=True) if d.get('namespace') in allowed_ns}
gaf_bp = gaf[gaf['go_id'].isin(allowed_terms)].copy()

print(f'{len(gaf_bp):,} gene→GO annotations retained across BP, MF, CC.')


993,512 gene→GO annotations retained across BP, MF, CC.


## Combine ontology edges with gene annotations (`GO_BP_MF_CP_full.txt`)


In [9]:
bp_file = data_dir / 'GO_BP_MF_CC_full.txt'

ns_map = {
    'biological_process': 'BP',
    'molecular_function': 'MF',
    'cellular_component': 'CC'
}

with bp_file.open('w') as fh:
    # 1) Write the ontology parent→child edges, tagging the namespace (BP/MF/CC)
    for p, c in G.edges():
        tag = ns_map.get(obo_graph.nodes[p].get('namespace'), 'NA')
        fh.write(f"{p}\t{c}\t{tag}\n")

    # 2) Append gene→GO edges (tagged as "gene")
    for _, row in gaf_bp.iterrows():
        fh.write(f"{row.go_id}\t{row.gene_symbol}\tgene\n")

print('Wrote combined file →', bp_file)

Wrote combined file → GO_files/GO_BP_MF_CC_full.txt


## Sanity check: resulting file defines a DAG


In [10]:
df = pd.read_csv(bp_file, sep='\t', header=None, names=['parent','child','edge_type'])
check_graph = nx.DiGraph()
check_graph.add_edges_from(zip(df['parent'], df['child']))
print('Is directed acyclic graph?', nx.is_directed_acyclic_graph(check_graph))


Is directed acyclic graph? True
