#### CAFA6-03-Enhanced feature extraction

Improvement from CAFA6-02: 77 hand-crafted features --> 100 automatic features.

Using: Meta-AI ESM-2 + PCA

ESM-2 is a protein language model (PLM) created by Meta AI. It is trained on sequences of amino acid, just like what we are having as inputs, and then output a vector embedding.

Future direction: Use MLP instead of OneVsRestClassifier(LogisticRegression)

References:
- https://www.kaggle.com/code/analyticaobscura/cafa-6-decoding-protein-mysteries
- (ESM-2 embeddings 320 features) https://www.kaggle.com/code/dalloliogm/compute-protein-embeddings-with-esm2-esm-c/notebook
- (ProtT5 embeddings 1024 features) https://www.kaggle.com/code/ahsuna123/t5-embedding-calculation-cafa-6/output?select=train_ids.npy
- (MLP with ESM2) https://www.kaggle.com/code/jwang2025learning/cafa-6-function-prediction-using-prott5?scriptVersionId=282801093

---

## Example with ESM-2 (esm2_t6_8M_UR50D)

---

In [2]:
import torch
import esm

# Load lightest ESM-2
model, alphabet = esm.pretrained.esm2_t6_8M_UR50D()
model.eval()
batch_converter = alphabet.get_batch_converter()

# Example sequence
sequences = [("protein1", "MKTAYIAKQRQISFVKSHFSRQDILD")]
batch_labels, batch_strs, batch_tokens = batch_converter(sequences)

with torch.no_grad():
    results = model(batch_tokens, repr_layers=[6], return_contacts=False)
    token_representations = results["representations"][6]

# Mean-pool over sequence tokens
sequence_embedding = token_representations[:, 1:-1].mean(1)

print(sequence_embedding.shape)  # (1, 320)

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t6_8M_UR50D.pt" to C:\Users\MS/.cache\torch\hub\checkpoints\esm2_t6_8M_UR50D.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t6_8M_UR50D-contact-regression.pt" to C:\Users\MS/.cache\torch\hub\checkpoints\esm2_t6_8M_UR50D-contact-regression.pt
torch.Size([1, 320])


## Step 1: Load CAFA6 files

---

In [1]:
from Bio import SeqIO  # parse fasta file
import pandas as pd
import numpy as np

In [4]:
# CAFA6 file paths
TRAIN_TERMS = "data/Train/train_terms.tsv"
TRAIN_SEQ = "data/Train/train_sequences.fasta"
TRAIN_TAXONOMY = "data/Train/train_taxonomy.tsv"
TEST_SEQ = "data/Test/testsuperset.fasta"

In [5]:
# Load CAFA6 files into dataFrame
train_terms_df = pd.read_csv(TRAIN_TERMS, sep="\t")  # identifier --> label
train_taxonomy_df = pd.read_csv(TRAIN_TAXONOMY, sep="\t", names=['EntryID', 'taxonomyID'])  # identifier --> taxonomy

def load_fasta_to_dataframe(file_path, is_train=True):
    records = []
    parser = SeqIO.parse(file_path, "fasta")
    for record in parser:
        entry_id = record.id.split('|')[1] if is_train and '|' in record.id else record.id.split()[0]
        records.append({'EntryID': entry_id, 'sequence': str(record.seq)})
    return pd.DataFrame(records)

train_sequences_df = load_fasta_to_dataframe(TRAIN_SEQ, is_train=True)  # identifier --> input: amino acid sequence

In [6]:
# Group the dataFrames above into a single data frame
protein_labels = train_terms_df.groupby('EntryID')['term'].apply(list).reset_index(name='labels')  # turn all EntryID duplicates into one EntryID with their terms forming a list
train_df_eda = pd.merge(train_sequences_df, train_taxonomy_df, on='EntryID', how='left')
train_df_eda = pd.merge(train_df_eda, protein_labels, on='EntryID', how='inner')

In [7]:
train_df_eda

Unnamed: 0,EntryID,sequence,taxonomyID,labels
0,A0A0C5B5G6,MRWQEMGYIFYPRKLR,9606,"[GO:0001649, GO:0033687, GO:0005615, GO:000563..."
1,A0JNW5,MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...,9606,"[GO:0120013, GO:0034498, GO:0005769, GO:012000..."
2,A0JP26,MVAEVCSMPAASAVKKPFDLRSKMGKWCHHRFPCCRGSGKSNMGTS...,9606,[GO:0005515]
3,A0PK11,MPGWFKKAWYGLASLLSFSSFILIIVALVVPHWLSGKILCQTGVDL...,9606,"[GO:0007605, GO:0005515]"
4,A1A4S6,MGLQPLEFSDCYLDSPWFRERIRAHEAELERTNKFIKELIKDGKNL...,9606,"[GO:0005829, GO:0010008, GO:0005515, GO:000509..."
...,...,...,...,...
82399,Q9UTM1,MSKLKAQSALQKLIESQKNPNANEDGYFRRKRLAKKERPFEPKKLV...,284812,[GO:0005730]
82400,Q9Y7I1,MSSNSNTDHSTGDNRSKSEKQTDLRNALRETESHGMPPLRGPAGFP...,284812,"[GO:0005634, GO:0005829]"
82401,Q9Y7P7,MRSNNSSLVHCCWVSPPSLTRLPAFPSPRILSPCYCYNKRIRPFRG...,284812,"[GO:0005634, GO:0005829]"
82402,Q9Y7Q3,MHSSRRKYNDMWTARLLIRSDQKEEKYPSFKKNAGKAINAHLIPKL...,284812,"[GO:0005634, GO:0005739, GO:0005829]"


In [8]:
# Dict {entryId, seq}
train_sequences = {rec.id: str(rec.seq) for rec in SeqIO.parse(TRAIN_SEQ, 'fasta')}
test_sequences  = {rec.id: str(rec.seq) for rec in SeqIO.parse(TEST_SEQ,  'fasta')}

print(f'Loaded {len(train_sequences)} train and {len(test_sequences)} test sequences')

Loaded 82404 train and 224309 test sequences


In [9]:
print("Train dict:", list(train_sequences.items())[0])
print("Test dict:", list(test_sequences.items())[0])

Train dict: ('sp|A0A0C5B5G6|MOTSC_HUMAN', 'MRWQEMGYIFYPRKLR')
Test dict: ('A0A0C5B5G6', 'MRWQEMGYIFYPRKLR')


## Step 2: Feature extraction

We use PCA + ESM-2 to squash 320 features into 100 features.

WHY?

OneVsRestClassifier(LogisticRegression) predicts each GO term with a linear model. This, combined with 320 features from ESM-2, makes the training time too long. In short, it is inefficient in design.

Future direction: Use MLP
  
---

In [11]:
# Embeddings file paths
ESM_EMBEDDINGS = "ESM2/"
TRAIN_EMBEDDINGS = ESM_EMBEDDINGS + "/protein_embeddings_train.npy"
TRAIN_IDS = ESM_EMBEDDINGS + "/train_ids.npy"
TEST_EMBEDDINGS = ESM_EMBEDDINGS + "/protein_embeddings_test.npy"
TEST_IDS = ESM_EMBEDDINGS + "/test_ids.npy"

In [12]:
# Load embeddings
X_train = np.load(TRAIN_EMBEDDINGS)
train_ids = np.load(TRAIN_IDS)
X_test = np.load(TEST_EMBEDDINGS)
test_ids = np.load(TEST_IDS)

In [13]:
print("X_train shape:", X_train.shape)
print("train_ids shape:", train_ids.shape)
print("X_test shape:", X_test.shape)
print("test_ids shape:", test_ids.shape)

X_train shape: (82404, 320)
train_ids shape: (82404,)
X_test shape: (224309, 320)
test_ids shape: (224309,)


In [14]:
from sklearn.decomposition import PCA

pca = PCA(n_components=100, random_state=42)
X_train_reduced = pca.fit_transform(X_train)
X_test_reduced  = pca.transform(X_test)

In [15]:
print("X_train_reduced shape:", X_train_reduced.shape)
print("X_test_reduced shape:", X_test_reduced.shape)

X_train_reduced shape: (82404, 100)
X_test_reduced shape: (224309, 100)


ðŸ’¡ As illustrated above, the features `X_train` are of shape (82404, 77) and the `protein_identifies` stores 82404 entryIDs accordingly.

## Step 3: Label encoding and Training

We divide all labels into three subsets (one for MF, one for BP, and one for CC).
-  As a result, for each sequence, we will have 3 vectors as follows using multi-hot encoding (i.e. simply one-hot for multi-classification problems)
    - For MF: [0 1 0 1 ... 0] of length num_unique(MF_GO_temrs)
    - For BP: [0 1 0 1 ... 0] of length num_unique(BP_GO_temrs)
    - For CC: [0 1 0 1 ... 0] of length num_unique(CC_GO_temrs)
- Then we will train three separate models for each ontology. We use three models to predict for a single example in the test set, and gather the predictions.

---

In [16]:
# Create three label sets
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from tqdm import tqdm

mlb_dict = dict()
models = dict()

entry_ids = [s.split('|')[1] for s in train_ids]  # 82404 entryIDs in the orginal order  

for aspect in tqdm(['P', 'C', 'F'], desc="Training models"):
    # Filter the train_terms_df based on aspect
    ont_terms_df = train_terms_df[train_terms_df['aspect'] == aspect]

    # Group the dataFrame based on the EntryID, turn all the GO terms to a list, finally turns this dataFrame to a dict {entryID: terms}
    protein_terms = ont_terms_df.groupby('EntryID')['term'].apply(list).to_dict()

    # Create a list of labels for this aspect, if an entryID doesn't exist in this aspect, give it a []
    # This ensures y_train is of shape (82404, ...)
    labels = [protein_terms.get(entry_id, []) for entry_id in entry_ids]

    # Multi-hot encoding, use sparse representation
    mlb = MultiLabelBinarizer(sparse_output=True)
    y_train = mlb.fit_transform(labels)
    
    print(f"y_train shape for {aspect} ontology: {y_train.shape} \t\t Number of unique {aspect} terms: {y_train.shape[1]}")

    # Save to dict
    mlb_dict[aspect] = mlb
    model = OneVsRestClassifier(LogisticRegression(max_iter=600, solver='lbfgs', C=0.5, random_state=42), n_jobs=-1)
    model.fit(X_train_reduced, y_train)
    models[aspect] = model
    print(f"Model for {aspect} trained successfully.")

Training models:   0%|          | 0/3 [00:00<?, ?it/s]

y_train shape for P ontology: (82404, 16858) 		 Number of unique P terms: 16858


Training models:  33%|â–ˆâ–ˆâ–ˆâ–Ž      | 1/3 [35:04<1:10:08, 2104.17s/it]

Model for P trained successfully.
y_train shape for C ontology: (82404, 2651) 		 Number of unique C terms: 2651


Training models:  67%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–‹   | 2/3 [40:35<17:41, 1061.29s/it]  

Model for C trained successfully.
y_train shape for F ontology: (82404, 6616) 		 Number of unique F terms: 6616


Training models: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3/3 [54:07<00:00, 1082.63s/it]

Model for F trained successfully.





## Step 4: Inference and Submission

In [17]:
BATCH_SIZE = 5000  # avoid memory overflow
submission_list = []

for i in tqdm(range(0, len(test_ids), BATCH_SIZE), desc="Predicting on Test Set"):
    batch_entry_ids = test_ids[i : i + BATCH_SIZE]

    # Slice PCA-reduced features
    X_batch = X_test_reduced[i : i + BATCH_SIZE]

    for aspect, model in models.items():
        mlb = mlb_dict[aspect]
        y_pred_proba = model.predict_proba(X_batch)

        for j, entry_id in enumerate(batch_entry_ids):
            probs = y_pred_proba[j]
            candidate_indices = np.where(probs > 0.02)[0]

            for idx in candidate_indices:
                submission_list.append((entry_id, mlb.classes_[idx], round(probs[idx], 3)))

Predicting on Test Set: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 45/45 [41:42<00:00, 55.60s/it]


In [18]:
submission_df = pd.DataFrame(submission_list, columns=['Protein Id', 'GO Term Id', 'Prediction'])

print("Applying 1500 prediction limit per protein...")
submission_df = submission_df.sort_values(by=['Protein Id', 'Prediction'], ascending=[True, False])
final_submission_df = submission_df.groupby('Protein Id').head(1500).reset_index(drop=True)
final_submission_df.to_csv('submission.tsv', sep='\t', index=False, header=False)

print("\nSubmission file 'submission.tsv' created successfully.")
print(f"Total predictions in final submission: {len(final_submission_df):,}")
print("Submission DataFrame Head:")
display(final_submission_df.head())

Applying 1500 prediction limit per protein...

Submission file 'submission.tsv' created successfully.
Total predictions in final submission: 5,214,598
Submission DataFrame Head:


Unnamed: 0,Protein Id,GO Term Id,Prediction
0,A0A017SE81,GO:0005886,0.151
1,A0A017SE81,GO:0005515,0.135
2,A0A017SE81,GO:0005829,0.096
3,A0A017SE81,GO:0005739,0.079
4,A0A017SE81,GO:0005789,0.076
