# Hardware requirements
- GPU: 8 x A100 80GB
- CPU: AMD 2TB, 256 cores
- Disk Storage: 2TB (At least 300GB for this project)
# Code

In [4]:
import sys
import os
import glob
import yaml

In [9]:
!cat config.yaml

with open('config.yaml') as f:
    CONFIG = yaml.safe_load(f)
    
BASE_PATH = CONFIG['base_path']
CONFIG_PATH = os.path.join(BASE_PATH, 'config.yaml')
RAPIDS_ENV = os.path.join(BASE_PATH, CONFIG['rapids-env'])
PYTORCH_ENV = os.path.join(BASE_PATH, CONFIG['pytorch-env'])

base_path: ./ # working dir
cafa_data_path: ../data/raw/cafa6 # working dir
# environments
rapids-env: rapids-env/bin/python
pytorch-env: pytorch-env/bin/python
# artifacts paths
embeds_path: ../features/embeds # path to embeddings 
models_path: ../models # store the models
helpers_path: ../features/helpers # store reformated datasets
temporal_path: ../features/temporal # store external data from FTP (temporal because different report dates are used)


base_models: # all models and postprocessing path
    pb_t5esm4500_raw:
        embeds: 
            - t5
            - esm_small
        conditional: false
        bp: 3000
        mf: 1000
        cc: 500
        
    pb_t5esm4500_cond:
        embeds: 
            - t5
            - esm_small
        conditional: true
        bp: 3000
        mf: 1000
        cc: 500
        
    pb_t54500_raw:
        embeds: 
            - t5
        conditional: false
        bp: 3000
        mf: 1000
        cc: 500
        
    pb_t54500_cond:
  

# 1. Preparation

### 1.1. Setup envs

Create the following python envs:

* `pytorch-env` - env to deal with all DL models
* `rapids-env`  - env to preprocess via RAPIDS and train py-boost and logregs

**You should run this outside terminal**

In [None]:
!./create-rapids-env.sh {BASE_PATH}
!./create-pytorch-env.sh {BASE_PATH}

[1;33mJupyter detected[0m[1;33m...[0m
[1;32m2[0m[1;32m channel Terms of Service accepted[0m
Channels:
 - rapidsai
 - conda-forge
 - nvidia
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 25.7.0
    latest version: 25.11.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /data/hien/CAFA-6/U900/rapids-env

  added / updated specs:
    - cuda-version=11.8
    - cudatoolkit=11.8
    - python=3.10


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
  bzip2              conda-forge/linux-64::bzip2-1.0.8-hda65f42_8 
  ca-certificates    conda-forge/noarch::ca-certificates-2025.11.12-hbd8a1cb_0 
  cuda-version       conda-forge/noarch::cuda-version-11.8-h70ddcb2_3 
  cudatoolkit        conda-fo

### 1.2. Get the input data

Here we describe what should be stored in the working dir to reproduce the results

Following data scheme was provided by Kaggle:

    ./Train - cafa train data
    ./Test - cafa test data
    ./sample_submission.tsv - cafa ssub
    ./IA.txt - cafa IA

In [None]:
# Standard data
!{RAPIDS_ENV} standard_data/standard.py
!{RAPIDS_ENV} standard_data/check_ia.py
# If you want to use taxonomy features, add this to preprocess.py
!{RAPIDS_ENV} standard_data/get_tax_list.py

Following are the solution code libraries, scipts, and notebooks used for training:

    ./protlib
    ./protnn
    ./nn_solution
    
And the installed envs

    ./pytorch-env
    ./rapids-env

### 1.3. Produce the helpers data

First, we made some preprocessing of the input data to store everything in format that is convinient to us to handle and manipulate. Here is the structure:

    ./helpers
        ./fasta - fasta files stored as feather
            ./train_seq.feather
            ./test_seq.feather
        ./real_targets - targets stored as n_proteins x n_terms parquet containing 0/1/NaN values
            ./biological_process
                ./part_0.parquet
                ...
                ./part_14.parquet
                ./nulls.pkl - NaN rate of each term
                ./priors.pkl - prior mean of each term (excluding NaN cells, like np.nanmean)
            ./cellular_component
            ./molecular_function
            

In [None]:
%%time
# parse fasta files and save as feather
!{RAPIDS_ENV} protlib/scripts/parse_fasta.py \
    --config-path {CONFIG_PATH}

# convert targets to parquet and calculate priors
!{RAPIDS_ENV} protlib/scripts/create_helpers.py \
    --config-path {CONFIG_PATH} \
    --batch-size 40000 \
    --propagate

845244it [00:00, 1013180.07it/s]
1941130it [00:00, 2044347.95it/s]
/data/hien/CAFA-6/U900
Propagate:  True
2it [09:49, 294.53s/it]
CPU times: user 3.45 s, sys: 690 ms, total: 4.14 s
Wall time: 10min 39s


### 1.4. Get external data

Datasets downloaded from outside and then processed. First step is downloading and parsing the datasets. After parsing, script will separate the datasets by the evidence codes. The most important split for us is kaggle/no-kaggle split. We refer `kaggle` as experimental codes, `no-kaggle` as electornic labeling, that will be used as features for the stacker models. Downloading takes quite a long time, while processing takes about 1 hour. The required structure after execution

    ./temporal - extra data downloaded from http://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
    ./labels   - extracted and propagated labeling
        ./prop_test_leak_no_dup.tsv - labels from newest version
        ./prop_test_no_kaggle.tsv   - electronic labels test
        ./prop_train_no_kaggle.tsv  - electronic labels train
        
    ./cafa-terms-diff.tsv - different labels from between 2 recent versions.
    ./prop_quickgo51.tsv  - labels from old version
    
    
Other files are temporary and not needed for future work

First, **you should run this on terminal** with **tmux**, with default 16 threads.

Remember to use `conda install aria2c`

In [None]:
# download external data from ebi.ac.uk
!{RAPIDS_ENV} protlib/scripts/downloads/dw_goant.py --config-path {CONFIG_PATH}

Then, process the files

In [None]:
# # parse the files
!{RAPIDS_ENV} protlib/scripts/parse_go_single.py \
    --file goa_uniprot_all.gaf.226.gz \
    --config-path {CONFIG_PATH} \
    --output ver226

!{RAPIDS_ENV} protlib/scripts/parse_go_single.py \
    --file goa_uniprot_all.gaf.228.gz \
    --config-path {CONFIG_PATH} \
    --output ver228

270it [05:14,  1.18s/it]^C
270it [05:14,  1.17s/it]
Traceback (most recent call last):
  File "/data/hien/CAFA-6/U900/protlib/scripts/parse_go_single.py", line 52, in <module>
    for n, batch in tqdm.tqdm(enumerate(reader)):
  File "/data/hien/.local/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1843, in __next__
    return self.get_chunk()
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1985, in get_chunk
    return self.read(nrows=size)
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_mem

The next step is propagation. Since ebi.ac datasets contains the labeling without propagation, we will apply the rules provided in organizer's repo to labeling more terms. We will do it only for `goa_uniprot_all.gaf.228.gz` datasets since it is the actual dataset at the active competition phase

In [118]:
folder = BASE_PATH + '/temporal/ver228'

for file in glob.glob(folder + '/labels/train*') + glob.glob(folder + '/labels/test*'):
    name = folder + '/labels/prop_' + file.split('/')[-1]

    !{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/prop_tsv.py \
        --path {file} \
        --graph {BASE_PATH}/Train/go-basic.obo \
        --output {name} \
        --device 0 \
        --batch_size 30000 \
        --batch_inner 5000

^C
^C
^C


We used `cafa-terms-diff` dataset, that represents the difference between our labeling obtained by parsing `goa_uniprot_all.gaf.228.gz` dataset and `goa_uniprot_all.gaf.226.gz`. So the difference is actually the temporal. We removed duplicated protein/terms pairs from the dataset.


In [119]:
# create datasets
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/reproduce_mt.py \
    --path {BASE_PATH}/temporal/ver228 \
    --old-path {BASE_PATH}/temporal/ver226 \
    --new-path {BASE_PATH}/temporal/ver228 \
    --graph {BASE_PATH}/Train/go-basic.obo \
    --target {BASE_PATH}/embeds/esm_small/train_ids.npy

# # make propagation for quickgo51.tsv
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/prop_tsv.py \
    --path {BASE_PATH}/temporal/ver228/quickgo51.tsv \
    --graph {BASE_PATH}/Train/go-basic.obo \
    --output {BASE_PATH}/temporal/ver228/prop_quickgo51.tsv \
    --device 0 \
    --batch_size 30000 \
    --batch_inner 5000

/data/hien/CAFA-6/U900
^C
^C
Traceback (most recent call last):
  File "/data/hien/CAFA-6/U900/.//protlib/scripts/prop_tsv.py", line 40, in <module>
    import cupy as cp
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/cupy/__init__.py", line 30, in <module>
    import cupyx as _cupyx  # NOQA
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/cupyx/__init__.py", line 8, in <module>
    from cupyx import linalg  # NOQA
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/cupyx/linalg/__init__.py", line 2, in <module>
    from cupyx.linalg import sparse  # NOQA
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/cupyx/linalg/sparse/__init__.py", line 3, in <module>
    from cupyx.linalg.sparse._solve import lschol  # NOQA
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/cupyx/linalg/sparse/_solve.py", line 6, in <module>
    from cupyx.scipy import sparse
  File "/data/hien/CAFA-6/U900/r

In [None]:
# here we mix solutions with cafa-terms-diff and quickgo51 datasets from 1.4
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/make_submission2.py \
    --config-path {CONFIG_PATH} \
    --input-file "sub/submission.tsv" \
    --device 0

Reading main submission: sub/submission-308.tsv
Loading GOA leak...
Loading QuickGO51...
Loading Diff terms...
Concatenating all sources...
Parsing OBO graph for namespaces...
Saving final submission to: ./sub/submission.tsv
Done.


# Result

Result is stored in `./sub/submission.tsv`

In [146]:
!head {BASE_PATH}/sub/submission.tsv

P62500	GO:0030217	0.0265
Q06449	GO:0005215	0.015
Q9M4C1	GO:0043229	0.961
P59368	GO:0017080	0.549
Q9Y7B3	GO:0045892	0.016
Q9FMQ1	GO:0009416	0.0185
P42335	GO:0040014	0.004
Q9USW8	GO:0051130	0.0115
P04759	GO:0038023	0.91425
Q2SX36	GO:0009058	0.4485
