# Gene Ontology Annotations (GOA) Preprocessing
Author: Cleverson Matiolli, Ph.D.

This notebook focuses on preprocessing the Gene Ontology Annotations (GOA) to:
1. Obtain the *ground-truth* of protein-GO term associations
2. Calculate the *Information Content (IC)* of GO terms

**Key Steps:**
1. Download and parse GOA dataset
2. Filter GOA dataset by Evidence Codes

In [5]:
# Standard libraries
import os
import math
import ftplib
from pathlib import Path
import gzip

# Third-party imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
import bitmath

# # Fast I/O imports
# import zstandard as zstd
# import pyarrow.csv as pv
# import pyarrow as pa
# import dask.dataframe as dd
# import cudf
# from dask_cuda import LocalCUDACluster
# from dask.distributed import Client

# Bioinformatics
from Bio import SeqIO
from obonet import read_obo

# Custom libraries
import aid2go.goa as aidgoa
import aid2go.utils as aidutils

# Configuration
pd.options.mode.copy_on_write = True

## 1. Download and parse GOA dataset

- **Source:** ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT.

- **Annotations (goa_uniprot_gcrp.gaf):**: Contains all GO annotations for canonical accessions from the UniProt reference proteomes for all species, which provide one protein per gene. The *reference proteomes comprise the protein sequences annotated in Swiss-Prot or the longest TrEMBL transcript if there is no Swiss-Prot record*.

- **Metadata (goa_uniprot_gcrp.gpi):** Contains metadata (name, symbol, synonyms, etc) for all canonical entries from the UniProt reference proteomes for all species, regardless of whether they have GO annotations.

In [7]:
# Login to FTP GOA directory
ftp_url = "ftp.ebi.ac.uk"
ftp_directory = "/pub/databases/GO/goa/UNIPROT/"
ftp = ftplib.FTP(ftp_url)
ftp.login()  # Anonymous login
ftp.cwd(ftp_directory)

'250 Directory successfully changed.'

In [8]:
# Download annotations (GCPR-SwissProt)
aidutils.download_ftp(
    ftp_url,
    ftp_directory,
    "./data/goa",
    "goa_uniprot_gcrp.gaf.gz",
)


The size of goa_uniprot_gcrp.gaf.gz is 6859.50 MB
Download progress: 100.00% (6859.50 MB)
Downloaded goa_uniprot_gcrp.gaf.gz successfully to data/goa/goa_uniprot_gcrp.gaf.gz


In [9]:
# Download metadata
aidutils.download_ftp(
    ftp_url,
    ftp_directory,
    "./data/goa",
    "goa_uniprot_gcrp.gpi.gz",
)


The size of goa_uniprot_gcrp.gpi.gz is 1489.65 MB
Download progress: 100.00% (1489.65 MB)
Downloaded goa_uniprot_gcrp.gpi.gz successfully to data/goa/goa_uniprot_gcrp.gpi.gz


## 2. Filter GOA dataset by Evidence Codes

- **Obsolete Terms**: The GO ontology has been updated to include new terms and to correct and improve the definitions of existing terms. The obsolete terms are no longer considered valid for annotation.

- **Code list**: The GO AID2GO library provides a list of evidence codes (EVS) and their definitions. The evidence code list used for automatic annotation experiments contains **experimental**, **high throughput** and **infered by curator or traceable author statement**, as defined by CAFA5.

In [10]:
# Get evidence codes (experimental only)
experiment_codes = list(aidgoa.experimental_evidence_codes.keys())
highthroughput_codes = list(aidgoa.highthroughput_evidence_codes.keys())
expert_codes = list(aidgoa.statement_evidence_codes.keys())

selected_evidence_codes = experiment_codes + highthroughput_codes + expert_codes
print(f"Selected evidence codes: {selected_evidence_codes}")

Selected evidence codes: ['EXP', 'IDA', 'IPI', 'IMP', 'IGI', 'IEP', 'HTP', 'HDA', 'HMP', 'HGI', 'HEP', 'IC', 'TAS']


In [None]:
# Filter GOA dataset
gaf_file = Path("./data/goa/goa_uniprot_gcrp.gaf.gz")
save_path = Path("./data/goa")
save_path.mkdir(parents=True, exist_ok=True)

associations_df, annot_df, evidence_freq = aidgoa.filter_goa_by_evidence(
    gaf_file=gaf_file,
    evidence_codes=experiment_codes,
    chunk_size=20000000,
    remove_obsolete_terms=True,
    generate_annot_file=True,
    save_path=save_path,
)

print(associations_df.shape)
associations_df.head()

Loading and filtering GAF file in chunks:   2%|▏         | 1/59 [00:29<28:51, 29.85s/it]

In [None]:
# Load preprocessed associations
associations_df = pd.read_csv("./data/goa/goa_hc.tsv", sep="\t")
print(associations_df.shape)
associations_df.head()

In [None]:
# Unique protein identifiers (GO terms and proteins)

unique_goterms = associations_df["go_id"].unique()
print(f"Number of GO terms: {len(unique_goterms)}")

unique_proteins = associations_df["uniprot_id"].unique()
print(f"Number of unique proteins: {len(unique_proteins)}")