# Code

Running this notebook end-to-end will reproduce the solution. Step by step guide is also provided. You can skip some long running steps by executing corresponding cells of `Download.ipynb` to download artifacts.

In [4]:
import sys
import os
import glob
import yaml

In [9]:
!cat config.yaml

with open('config.yaml') as f:
    CONFIG = yaml.safe_load(f)
    
BASE_PATH = CONFIG['base_path']
CONFIG_PATH = os.path.join(BASE_PATH, 'config.yaml')
RAPIDS_ENV = os.path.join(BASE_PATH, CONFIG['rapids-env'])
PYTORCH_ENV = os.path.join(BASE_PATH, CONFIG['pytorch-env'])

base_path: ./ # working dir
cafa_data_path: ../data/raw/cafa6 # working dir
# environments
rapids-env: rapids-env/bin/python
pytorch-env: pytorch-env/bin/python
# artifacts paths
embeds_path: ../features/embeds # path to embeddings 
models_path: ../models # store the models
helpers_path: ../features/helpers # store reformated datasets
temporal_path: ../features/temporal # store external data from FTP (temporal because different report dates are used)


base_models: # all models and postprocessing path
    pb_t5esm4500_raw:
        embeds: 
            - t5
            - esm_small
        conditional: false
        bp: 3000
        mf: 1000
        cc: 500
        
    pb_t5esm4500_cond:
        embeds: 
            - t5
            - esm_small
        conditional: true
        bp: 3000
        mf: 1000
        cc: 500
        
    pb_t54500_raw:
        embeds: 
            - t5
        conditional: false
        bp: 3000
        mf: 1000
        cc: 500
        
    pb_t54500_cond:
  

# 1. Preparation

### 1.1. Setup envs

Create the following python envs:

* `pytorch-env` - env to deal with all DL models
* `rapids-env`  - env to preprocess via RAPIDS and train py-boost and logregs

In [6]:
# !./create-rapids-env.sh {BASE_PATH}
# !./create-pytorch-env.sh {BASE_PATH}

[1;33mJupyter detected[0m[1;33m...[0m
[1;32m2[0m[1;32m channel Terms of Service accepted[0m
Channels:
 - rapidsai
 - conda-forge
 - nvidia
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 25.7.0
    latest version: 25.11.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /data/hien/CAFA-6/U900/rapids-env

  added / updated specs:
    - cuda-version=11.8
    - cudatoolkit=11.8
    - python=3.10


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
  bzip2              conda-forge/linux-64::bzip2-1.0.8-hda65f42_8 
  ca-certificates    conda-forge/noarch::ca-certificates-2025.11.12-hbd8a1cb_0 
  cuda-version       conda-forge/noarch::cuda-version-11.8-h70ddcb2_3 
  cudatoolkit        conda-fo

### 1.2. Get the input data

Here we describe what should be stored in the working dir to reproduce the results

Following data scheme was provided by Kaggle:

    ./Train - cafa train data
    ./Test (targets) - cafa test data
    ./sample_submission.tsv - cafa ssub
    ./IA.txt - cafa IA

    
Following are the solution code libraries, scipts, and notebooks used for training:

    ./protlib
    ./protnn
    ./nn_solution
    
And the installed envs

    ./pytorch-env
    ./rapids-env

### 1.3. Produce the helpers data

First, we made some preprocessing of the input data to store everything in format that is convinient to us to handle and manipulate. Here is the structure:

    ./helpers
        ./fasta - fasta files stored as feather
            ./train_seq.feather
            ./test_seq.feather
        ./real_targets - targets stored as n_proteins x n_terms parquet containing 0/1/NaN values
            ./biological_process
                ./part_0.parquet
                ...
                ./part_14.parquet
                ./nulls.pkl - NaN rate of each term
                ./priors.pkl - prior mean of each term (excluding NaN cells, like np.nanmean)
            ./cellular_component
            ./molecular_function
            

In [110]:
%%time
# parse fasta files and save as feather
!{RAPIDS_ENV} protlib/scripts/parse_fasta.py \
    --config-path {CONFIG_PATH}

# convert targets to parquet and calculate priors
# batch size 10000
!{RAPIDS_ENV} protlib/scripts/create_helpers.py \
    --config-path {CONFIG_PATH} \
    --batch-size 40000 \
    --propagate

845244it [00:00, 1013180.07it/s]
1941130it [00:00, 2044347.95it/s]
/data/hien/CAFA-6/U900
Propagate:  True
2it [09:49, 294.53s/it]
CPU times: user 3.45 s, sys: 690 ms, total: 4.14 s
Wall time: 10min 39s


### 1.4. Get external data

Datasets downloaded from outside and then processed. First step is downloading and parsing the datasets. After parsing, script will separate the datasets by the evidence codes. The most important split for us is kaggle/no-kaggle split. We refer `kaggle` as experimental codes, `no-kaggle` as electornic labeling, that will be used as features for the stacker models. Downloading takes quite a long time, while processing takes about 1 hour. The required structure after execution

    ./temporal - extra data downloaded from http://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
    ./labels   - extracted and propagated labeling
        ./prop_test_leak_no_dup.tsv - leakage labeling
        ./prop_test_no_kaggle.tsv   - electronic labels test
        ./prop_train_no_kaggle.tsv  - electronic labels train
        
    ./cafa-terms-diff.tsv - reproduced difference between ML's dataset and our parsed labels
    ./prop_quickgo51.tsv  - reproduced MT's quickgo 37 proteins
    
    
Other files are temporary and not needed for future work

In [100]:
# download external data from ebi.ac.uk
# !{RAPIDS_ENV} protlib/scripts/downloads/dw_goant.py \
#     --config-path {CONFIG_PATH}

# # parse the files
!{RAPIDS_ENV} protlib/scripts/parse_go_single.py \
    --file goa_uniprot_all.gaf.226.gz \
    --config-path {CONFIG_PATH} \
    --output ver226

!{RAPIDS_ENV} protlib/scripts/parse_go_single.py \
    --file goa_uniprot_all.gaf.228.gz \
    --config-path {CONFIG_PATH} \
    --output ver228

110it [02:06,  1.14s/it]^C
110it [02:07,  1.16s/it]
Traceback (most recent call last):
  File "/data/hien/CAFA-6/U900/protlib/scripts/parse_go_single.py", line 52, in <module>
    for n, batch in tqdm.tqdm(enumerate(reader)):
  File "/data/hien/.local/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1843, in __next__
    return self.get_chunk()
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1985, in get_chunk
    return self.read(nrows=size)
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_mem

The next step is propagation. Since ebi.ac datasets contains the labeling without propagation, we will apply the rules provided in organizer's repo to labeling more terms. We will do it only for `goa_uniprot_all.gaf.216.gz` datasets since it is the actual dataset at the active competition phase

In [103]:
folder = BASE_PATH + '/temporal/ver228'

for file in glob.glob(folder + '/labels/train*') + glob.glob(folder + '/labels/test*'):
    name = folder + '/labels/prop_' + file.split('/')[-1]

    !{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/prop_tsv.py \
        --path {file} \
        --graph {BASE_PATH}/Train/go-basic.obo \
        --output {name} \
        --device 0 \
        --batch_size 30000 \
        --batch_inner 5000

  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
100%|█████████████████████████████████████████████| 3/3 [00:32<00:00, 10.93s/it]
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_

The last part is reproducing MT's datasets that are commonly used in all public kernels. We didn't use it directly, but we used `cafa-terms-diff` dataset, that represents the difference between our labeling obtained by parsing `goa_uniprot_all.gaf.216.gz` dataset and `all_dict.pkl` dataset given by MT. As he claims in the dicussion [here](https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/discussion/404853#2329935) he used the same FTP source as we. But our source is more actual than the public. So the difference is actually the temporal. After analysis, we find out, that we are able to reproduce it as the difference between `goa_uniprot_all.gaf.216.gz` and `goa_uniprot_all.gaf.214.gz` sources. So, we just create `cafa-terms-diff` dataset by the given script. The only difference between the source in the kaggle script and used here is deduplication. We removed duplicated protein/terms pairs from the dataset, it has almost zero impact on the metric value (less than 1e-4)


In [107]:
# create datasets
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/reproduce_mt.py \
    --path {BASE_PATH}/temporal/ver228 \
    --old-path {BASE_PATH}/temporal/ver226 \
    --new-path {BASE_PATH}/temporal/ver228 \
    --graph {BASE_PATH}/Train/go-basic.obo \
    --target {BASE_PATH}/embeds/esm_small/train_ids.npy

# # make propagation for quickgo51.tsv
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/prop_tsv.py \
    --path {BASE_PATH}/temporal/ver228/quickgo51.tsv \
    --graph {BASE_PATH}/Train/go-basic.obo \
    --output {BASE_PATH}/temporal/ver228/prop_quickgo51.tsv \
    --device 0 \
    --batch_size 30000 \
    --batch_inner 5000

/data/hien/CAFA-6/U900
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
  mat.scatter_add((sample_ont['id'].values, sample_ont['term_id'].values), 1)
100%|█████████████████████████████████████████████| 3/3 [00:35<00:00, 11.71s/it]


### 1.5 Preparation step for neural networks

Produce some helpers to train NN model. Creates the following data:

    ./helpers/feats
        ./train_ids_cut43k.npy
        ./Y_31466_labels.npy
        ./Y_31466_sparse_float32.npz

In [111]:
%%time

!{PYTORCH_ENV} {BASE_PATH}/nn_solution/prepare.py \
    --config-path {CONFIG_PATH}

(537027, 3)
term
GO:0005515    33713
GO:0005634    13283
GO:0005829    13040
GO:0005886    10150
GO:0005737     9442
              ...  
GO:0010622        1
GO:0042357        1
GO:0009421        1
GO:0008764        1
GO:0061336        1
Name: count, Length: 26125, dtype: int64
aspect
BPO    16858
CCO     2651
MFO     6616
Name: term, dtype: int64
CPU times: user 8.97 ms, sys: 9 ms, total: 18 ms
Wall time: 1.83 s


# 2. Embeddings

In [24]:
!mkdir embeds

### 2.1 T5 pretrained inference

    ./embeds
        ./t5
            ./train_ids.npy
            ./train_embeds.npy
            ./test_ids.npy
            ./test_embeds.npy

In [59]:
%%time
!{PYTORCH_ENV} {BASE_PATH}/nn_solution/t5.py \
    --config-path {CONFIG_PATH} \
    --device 0

100%|████████████████████████████████| 82404/82404 [00:00<00:00, 1586618.68it/s]
Number of sequences in test: 224309
100%|██████████████████████████████| 224309/224309 [00:00<00:00, 1409253.02it/s]
CPU times: user 101 ms, sys: 53.7 ms, total: 155 ms
Wall time: 13.8 s


### 2.2 ESM pretrained inference

    ./embeds
        ./esm_small
            ./train_ids.npy
            ./train_embeds.npy
            ./test_ids.npy
            ./test_embeds.npy

In [60]:
%%time
!{PYTORCH_ENV} {BASE_PATH}/nn_solution/esm2sm.py \
    --config-path {CONFIG_PATH} \
    --device 0

100%|████████████████████████████████| 82404/82404 [00:00<00:00, 1396609.88it/s]
Number of sequences in test: 224309
100%|██████████████████████████████| 224309/224309 [00:00<00:00, 1110830.53it/s]
CPU times: user 87.3 ms, sys: 171 ms, total: 258 ms
Wall time: 15.4 s


# 3. Base models

In [26]:
!mkdir models

### 3.1. Train and inference py-boost models

GBDT models description:

1) Features: T5 + taxon, targets: multilabel

2) Features: T5 + taxon, targets: conditional

3) Features: T5 + ESM + taxon, targets: multilabel

4) Features: T5 + ESM + taxon, targets: conditional

Pipeline and hyperparameters are the same for all the models. Target is 4500 output: BP 3000, MF: 1000, CC: 500. All models could be ran in parallel to save a time. We used single V100 32GB and it requires about 15 hours to train 5 fold CV loop for each model type. 32GB GPU RAM is required, otherwise OOM will occur. Structure is:
    
    ./models
        ./pb_t54500_raw
            ./models_0.pkl
            ...
            ./models_4.pkl
            ./oof_pred.pkl
            ./test_pred.pkl
        ./pb_t54500_cond
            ...
        ./pb_t5esm4500_raw
            ...
        ./pb_t5esm4500_cond
            ...

In [84]:
for model_name in ['pb_t54500_raw', 'pb_t54500_cond', 'pb_t5esm4500_raw', 'pb_t5esm4500_cond', ]:

    print(f'Training {model_name}')

    !{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/train_pb.py \
        --config-path {CONFIG_PATH} \
        --model-name {model_name} \
        --device 0


Training pb_t54500_raw
True
trg filled
trg filled
trg filled
(82404, 1056) (224309, 1056)
(65906,) (16498,)
[18:51:25] Stdout logging level is INFO.
[18:51:25] GDBT train starts. Max iter 20000, early stopping rounds 300
[18:51:29] Iter 0; Sample 0, BCE = 0.01847760927347621; 
[18:51:45] Iter 100; Sample 0, BCE = 0.016180626734453446; 
[18:52:01] Iter 200; Sample 0, BCE = 0.015253586837214339; 
[18:52:17] Iter 300; Sample 0, BCE = 0.014672117275444232; 
[18:52:33] Iter 400; Sample 0, BCE = 0.014263558791376245; 
[18:52:49] Iter 500; Sample 0, BCE = 0.013957509020161131; 
[18:53:05] Iter 600; Sample 0, BCE = 0.013711631127375232; 
[18:53:22] Iter 700; Sample 0, BCE = 0.013512374108627436; 
[18:53:38] Iter 800; Sample 0, BCE = 0.013347417028317163; 
[18:53:54] Iter 900; Sample 0, BCE = 0.013204780972392393; 
[18:54:10] Iter 1000; Sample 0, BCE = 0.013082082929463185; 
[18:54:26] Iter 1100; Sample 0, BCE = 0.012978140873277224; 
[18:54:42] Iter 1200; Sample 0, BCE = 0.012886041055240865; 

### 3.2. Train and inference logreg models

Logistic Regression models description:

1) Features: T5 + taxon, targets: multilabel

2) Features: T5 + taxon, targets: conditional


Pipeline and hyperparameters are the same for all the models. Target is 13500 output: BP 10000, MF: 2000, CC: 1500. All models could be ran in parallel to save a time. We used single V100 32GB and it requires about 10 hours for model 1 and 2 hours for model 2 to train 5 fold CV loop. 32GB GPU RAM is required, otherwise OOM will occur. Structure is:

    ./helpers
        ./folds_gkf.npy
    ./models
        ./lin_t5_raw
            ./models_0.pkl
            ...
            ./models_4.pkl
            ./oof_pred.pkl
            ./test_pred.pkl
        ./lin_t5_cond
            ...

In [85]:
for model_name in ['lin_t5_raw', 'lin_t54500_cond']:

    print(f'Training {model_name}')

    !{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/train_lin.py \
        --config-path {CONFIG_PATH} \
        --model-name {model_name} \
        --device 0


Training lin_t5_raw
True
trg filled
trg filled
trg filled
(82404, 1056) (224309, 1056)
(65906,) (16498,)
  1%|▌                                        | 2/135 [00:56<1:01:46, 27.87s/it]^C
  1%|▌                                        | 2/135 [01:14<1:23:04, 37.48s/it]
Traceback (most recent call last):
  File "/data/hien/CAFA-6/U900/.//protlib/scripts/train_lin.py", line 121, in <module>
    model.fit(X[tr_idx], Y[tr_idx])
  File "/data/hien/CAFA-6/U900/protlib/models/logreg.py", line 67, in fit
    delta[:, k] = cp.dot(cp.linalg.inv(hess), grad[:, k])
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/cupy/linalg/_solve.py", line 261, in inv
    cupyx.lapack.gesv(a.T, b.T)
  File "/data/hien/CAFA-6/U900/rapids-env/lib/python3.10/site-packages/cupyx/lapack.py", line 69, in gesv
    getrf(handle, n, n, a.data.ptr, n, dwork.data.ptr, dipiv.data.ptr,
KeyboardInterrupt
Training lin_t54500_cond
^C


### 3.3. Train and inference NN models

Structure is:

    ./models
        ./nn_serg
            ./model_0_0.pt
            ...
            ./model_11_4.pt
            ./pytorch-keras-etc-3-blend-cafa-metric-etc.pkl 

In [None]:
# first, create train folds (the same as used for pb_t54500_cond model)
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/create_gkf.py \
    --config-path {CONFIG_PATH}

# train models
!{PYTORCH_ENV} {BASE_PATH}/nn_solution/train_models.py \
    --config-path {CONFIG_PATH} \
    --device 0

# inference models
!{PYTORCH_ENV} {BASE_PATH}/nn_solution/inference_models.py \
    --config-path {CONFIG_PATH} \
    --device 0

# reformat to use in stack
!{PYTORCH_ENV} {BASE_PATH}/nn_solution/make_pkl.py \
    --config-path {CONFIG_PATH}

# 4. Final model

### 4.1. Train GCN models

This step is training 3 independent stacking models for each ontology. Models are trained on single V100 GPU and it takes about 13 hours for BP, 4 hours for MF and 2 hours for CC. 32 GB GPU RAM is required to fit. Could be trained in parallel if 2 GPUs are avaliable - BP and MF/CC. Structure:

    ./models
        ./gcn
            ./bp
                ./checkpoint.pth
            ./mf
                ./checkpoint.pth
            ./cc
                ./checkpoint.pth
                

In [None]:
%%time

for ont in ['bp', 'mf', 'cc']:
    !{PYTORCH_ENV} {BASE_PATH}/protnn/scripts/train_gcn.py \
        --config-path {CONFIG_PATH} \
        --ontology {ont} \
        --device 0

### 4.2. Inference GCN models and TTA

Inference and Test-Time-Augmentation. Structure:

    ./models
        ./gcn
            ./pred_tta_0.tsv
            ...
            ./pred_tta_3.tsv


In [112]:
%%time

!{PYTORCH_ENV} {BASE_PATH}/protnn/scripts/predict_gcn.py \
    --config-path {CONFIG_PATH} \
    --device 0

[['./models/pb_t54500_cond', [3000, 1000, 500], True], ['./models/pb_t54500_raw', [3000, 1000, 500], False], ['./models/lin_t5_cond', [10000, 2000, 1500], True], ['./models/lin_t5_raw', [10000, 2000, 1500], False]]
  x0.index_reduce(1, dst, x0[:, src], reduce='mean', include_self=False),
100%|███████████████████████████████████████| 1753/1753 [23:34<00:00,  1.24it/s]
[['./models/pb_t5esm4500_cond', [3000, 1000, 500], True], ['./models/pb_t54500_raw', [3000, 1000, 500], False], ['./models/lin_t5_cond', [10000, 2000, 1500], True], ['./models/lin_t5_raw', [10000, 2000, 1500], False]]
 74%|████████████████████████████▉          | 1300/1753 [17:45<06:03,  1.25it/s]^C
CPU times: user 14.8 s, sys: 3.35 s, total: 18.1 s
Wall time: 42min 10s


### 4.3. Postprocessing and build submission file

Here we do the following:

1) Average TTA predictions
2) Perform min prop
3) Perform max prop
4) Average min/max prop steps, add external leakage data and make submission

Structure:

    ./models
        ./postproc
            ./pred.tsv     - avg TTA
            ./pred_min.tsv - min prop
            ./pred_max.tsv - max prop
            
    ./sub
        ./submission.tsv   - final results

In [113]:
# since we have 4 TTA predictions, we need to aggregate all as an average
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/collect_ttas.py \
    --config-path {CONFIG_PATH} \
    --device 0

# create 0.3 * pred + 0.7 * max children propagation
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/step.py \
    --config-path {CONFIG_PATH} \
    --device 0 \
    --batch_size 30000 \
    --batch_inner 3000 \
    --lr 0.7 \
    --direction min

# create 0.3 * pred + 0.7 * min parents propagation
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/step.py \
    --config-path {CONFIG_PATH} \
    --device 0 \
    --batch_size 30000 \
    --batch_inner 3000 \
    --lr 0.7 \
    --direction max

# here we average min prop and max prop solutions, mix with cafa-terms-diff and quickgo51 datasets from 1.4
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/make_submission.py \
    --config-path {CONFIG_PATH} \
    --device 0 \
    --max-rate 0.5

0it [00:00, ?it/s]
[DEBUG] Bắt đầu kiểm tra mapping ID:
 - Tổng số protein trong file dự đoán (sub.tsv): 0           O14503
1           O14503
2           O14503
3           O14503
4           O14503
             ...  
27772794    Q08118
27774851    Q8MKL1
27774852    Q8VE95
27774853    Q8CDV0
27774854    Q8GWI2
Name: entry_id, Length: 19852943, dtype: object
 - Tổng số protein có trong nhãn (validation): entry_id
A0A098D1J7        0
A0A0P0XII1        1
A0A0R4IB93        2
A0A1D8PDP8        3
A0A1W2P7I0        4
              ...  
Q9ZW22        79026
Q9ZW81        79027
R9QMR2        79028
T2HG31        79029
W8JIS5        79030
Name: index, Length: 79031, dtype: int64
 - Tổng số dòng dự đoán: 19852943
 - Số dòng bị NaN (không khớp ID): 0

1it [00:15, 15.95s/it]                                    | 0/3 [00:00<?, ?it/s][A
2it [00:27, 13.21s/it]                            | 1/3 [00:12<00:25, 12.80s/it][A
3it [00:38, 12.48s/it]█████████████               | 2/3 [00:24<00:11, 11.91s/it]

# Result

Result is stored in `./sub/submission.tsv`

In [114]:
!head {BASE_PATH}/sub/submission.tsv

Q6DER1	GO:0005654	0.066
Q6B8W6	GO:0031974	0.063
P54956	GO:0051179	0.012
O02747	GO:1900428	0.009
Q66HF8	GO:0006720	0.025
Q09265	GO:0019538	0.004
O01835	GO:0010605	0.636
Q4R4R0	GO:0043933	0.028
A5DNX9	GO:0010605	0.023
B4KA44	GO:0051234	0.024
