# Benchmarking ProstT5 target class bias
In previous prjects we observed a tendency of ProstT5 to show more bias towards majority target classes.
Thus, this notebook investigates this phenomenon.

## Prep test data
For testing use the 100 longest entries in SCOPe

In [None]:
mkdir -p data/scopeseq-2.08/
wget -P data/scopeseq-2.08/ https://scop.berkeley.edu/downloads/scopeseq-2.08/astral-scopedom-seqres-gd-all-2.08-stable.fa

In [None]:
# SCOPe sequence fasta file is in multi line lower case format, we want one line upper case
from pathlib import Path

scope_path = Path('data/scopeseq-2.08/astral-scopedom-seqres-gd-all-2.08-stable.fa')
assert scope_path.exists()
scope_entries = list()
n_longest = 100

# read fasta and save longest sequecnes on the fly
# TODO: use doubly linked list or dequeue and 
with open('data/prostt5_bias/uniprot_test.fasta', 'r') as fasta_in:
    header = ''
    seq = ''
    for line in fasta_in:
        if line.startswith('>'):
            if seq != '':
                scope_entries.append((header, seq))
            header = line.strip().removeprefix('>')
            seq = ''
        else:
            seq += line.strip().upper()
    scope_entries.append((header, seq))

scope_entries = sorted(scope_entries, key=lambda x: len(x[1]), reverse=True)

# write n longest sequence to file
with open('data/prostt5_bias/test.fasta', 'w') as fasta_out:
    for i in range(n_longest):
        fasta_out.write('>{}\n{}\n'.format(scope_entries[i][0], scope_entries[i][1]))

Conclusion: the longest SCOPe entries range from 1205 to 1664 in sequence length. So this dataset does not fit our needs.
However, UniProtKB only lists 70 entires longer than 6000 with PDB entry and 4 entries with AFDB entry.
For simplicity I just selected 6 entries:

In [None]:
DATA_PATH=data/prostt5_bias_test/pdb
mkdir -p $DATA_PATH
wget -P $DATA_PATH https://ftp.ebi.ac.uk/pub/databases/rcsb/pdb-remediated/data/structures/divided/pdb/oi/pdb1oig.ent.gz https://ftp.ebi.ac.uk/pub/databases/rcsb/pdb-remediated/data/structures/divided/pdb/ia/pdb2iak.ent.gz https://ftp.ebi.ac.uk/pub/databases/rcsb/pdb-remediated/data/structures/divided/pdb/ko/pdb1koa.ent.gz https://ftp.ebi.ac.uk/pub/databases/rcsb/pdb-remediated/data/structures/divided/pdb/ej/pdb3ejf.ent.gz https://ftp.ebi.ac.uk/pub/databases/rcsb/pdb-remediated/data/structures/divided/pdb/nh/pdb7nh7.ent.gz https://ftp.ebi.ac.uk/pub/databases/rcsb/pdb-remediated/data/structures/divided/pdb/ar/pdb1ark.ent.gz

In [None]:
DB_PATH=data/prostt5_bias_test/dbs/foldseek
mkdir -p $DB_PATH

# conda activate prostt5
foldseek createdb $DATA_PATH/* $DB_PATH/baseline

foldseek convert2fasta $DB_PATH/baseline $(dirname $DATA_PATH)/baseline.aa.fasta

In [None]:
TEST_DATA=data/prostt5_bias_test/baseline.aa.fasta
PROSTT5_MODEL=/home/sukhwan/foldseek_ctranslate/foldseek/weights/model/
# generate foldseek databases using ctranslate2 branch using different split legnth for prostt5
srun -p gpu --gres=gpu:1 -c 4 -t 1-0 --pty /bin/bash
for SPLIT_LENGTH in {500..6000..1250}: do
    lib/foldseek createdb $TEST_DATA $DB_PATH/prsott5_bias_$SPLIT_LENGTH --prostt5-model $PROSTT5_MODEL --prostt5-split-length $SPLIT_LENGTH;
done

In [None]:
# TODO: calculate value for 3Di class balance

# TODO: plot class imbalance value (y) versus split length (x)

# TODO: compare versus "ground truth" (vanilla foldseek generated 3Di from .pdb or .mmcif files)