#### CAFA6-09-Model Improvement

Improvement from CAFA6-08: use model ensemble (1D-CNN + MLP)

Future direction: Use model ensemble

References:
- https://www.kaggle.com/code/analyticaobscura/cafa-6-decoding-protein-mysteries
- (ESM-2 embeddings 320 features) https://www.kaggle.com/code/dalloliogm/compute-protein-embeddings-with-esm2-esm-c/notebook
- (MLP with ESM2) https://www.kaggle.com/code/jwang2025learning/cafa-6-function-prediction-using-prott5?scriptVersionId=282801093
- (1D-CNN) https://www.kaggle.com/code/momerer/cafa-6-protein-function-prediction-with-1d-cnn#----5.-GENERATING-PREDICTIONS----
- (Idea) https://www.kaggle.com/code/nina2025/cafa-6-protein-function-prediction

---

In [1]:
!pip install biopython > /dev/null

## Step 1: Load CAFA6 files

---

In [2]:
from Bio import SeqIO  # parse fasta file
import pandas as pd
import numpy as np

In [3]:
# CAFA6 file paths
TRAIN_TERMS = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_terms.tsv"
TRAIN_SEQ = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_sequences.fasta"
TRAIN_TAXONOMY = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_taxonomy.tsv"
TEST_SEQ = "/kaggle/input/cafa-6-protein-function-prediction/Test/testsuperset.fasta"

In [4]:
# Dict {entryId, seq}
train_sequences = {rec.id: str(rec.seq) for rec in SeqIO.parse(TRAIN_SEQ, 'fasta')}
test_sequences  = {rec.id: str(rec.seq) for rec in SeqIO.parse(TEST_SEQ,  'fasta')}

print(f'Loaded {len(train_sequences)} train and {len(test_sequences)} test sequences')

Loaded 82404 train and 224309 test sequences


In [5]:
print("Train dict:", list(train_sequences.items())[0])
print("Test dict:", list(test_sequences.items())[0])

Train dict: ('sp|A0A0C5B5G6|MOTSC_HUMAN', 'MRWQEMGYIFYPRKLR')
Test dict: ('A0A0C5B5G6', 'MRWQEMGYIFYPRKLR')


In [6]:
train_ids = [i.split('|')[1] for i in train_sequences.keys()]
test_ids = list(test_sequences.keys())

## Step 2: Feature extraction
  
---

In [7]:
# Embeddings file paths
ESM_EMBEDDINGS = "/kaggle/input/esm2-t6-8m-ur50d-cafa6"
TRAIN_EMBEDDINGS = ESM_EMBEDDINGS + "/protein_embeddings_train.npy"
TEST_EMBEDDINGS = ESM_EMBEDDINGS + "/protein_embeddings_test.npy"

In [8]:
# Load embeddings
X_train = np.load(TRAIN_EMBEDDINGS)
X_test = np.load(TEST_EMBEDDINGS)

In [9]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (82404, 320)
X_test shape: (224309, 320)


In [10]:
# Normalization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)
print("X_train_norm shape:", X_train_norm.shape)
print("X_test_norm shape:", X_test_norm.shape)

X_train_norm shape: (82404, 320)
X_test_norm shape: (224309, 320)


## Step 3: Customized MLP

---

In [11]:
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [12]:
class MLPClassifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.ReLU(),
            nn.BatchNorm1d(2048),
            nn.Dropout(0.3),

            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.BatchNorm1d(1024),
            nn.Dropout(0.25),

            nn.Linear(1024, output_dim)  # logits
        )

    def forward(self, x):
        return self.net(x)

## Step 4: Customized CNN1D

Reference: https://www.kaggle.com/code/momerer/cafa-6-protein-function-prediction-with-1d-cnn#----5.-GENERATING-PREDICTIONS----

---

In [13]:
class CNN1D(nn.Module):
    def __init__(self, input_dim, num_classes):
        super(CNN1D, self).__init__()
        self.conv_block1 = nn.Sequential(
            nn.Conv1d(in_channels=1, out_channels=32, kernel_size=5, padding=2),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2, stride=2)
        )
        self.conv_block2 = nn.Sequential(
            nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2, stride=2)
        )
        
        flattened_size = int(64 * input_dim / 4)
        
        self.fc_block = nn.Sequential(
            nn.Linear(in_features=flattened_size, out_features=1024),
            nn.ReLU(),
            nn.Dropout(p=0.4), # Added Dropout to prevent overfitting
            nn.Linear(in_features=1024, out_features=num_classes)
        )

    def forward(self, x):
        # (batch_size, embed_size) -> (batch_size, 1, embed_size)
        x = x.unsqueeze(1)  # (batch_size, 1, 320)
        x = self.conv_block1(x)  # (batch_size, 32, 160)
        x = self.conv_block2(x)  # (batch_size, 64, 80)
        x = torch.flatten(x, 1)  # (batch_size, 64*80)
        x = self.fc_block(x)  # (batch_size, num_classes)
        return x

## Step 5: Label encoding + MLP training

We divide all labels into three subsets (one for MF, one for BP, and one for CC).
-  As a result, for each sequence, we will have 3 vectors as follows using multi-hot encoding (i.e. simply one-hot for multi-classification problems)
    - For MF: [0 1 0 1 ... 0] of length num_unique(MF_GO_temrs)
    - For BP: [0 1 0 1 ... 0] of length num_unique(BP_GO_temrs)
    - For CC: [0 1 0 1 ... 0] of length num_unique(CC_GO_temrs)
- Then we will train three separate models for each ontology. We use three models to predict for a single example in the test set, and gather the predictions.

---

In [14]:
# Create three label sets
from sklearn.preprocessing import MultiLabelBinarizer
from tqdm import tqdm

mlp_y_trains = dict()
mlp_mlb_dict = dict()
mlp_models = dict()

train_terms_df = pd.read_csv(TRAIN_TERMS, sep="\t")

for aspect in ['P', 'C', 'F']:
    # Filter the train_terms_df based on aspect
    ont_terms_df = train_terms_df[train_terms_df['aspect'] == aspect]

    # Group the dataFrame based on the EntryID, turn all the GO terms to a list, finally turns this dataFrame to a dict {entryID: terms}
    protein_terms = ont_terms_df.groupby('EntryID')['term'].apply(list).to_dict()

    # Create a list of labels for this aspect, if an entryID doesn't exist in this aspect, give it a []
    # This ensures y_train is of shape (82404, ...)
    labels = [protein_terms.get(entry_id, []) for entry_id in train_ids]

    # Multi-hot encoding, use sparse representation
    mlb = MultiLabelBinarizer(sparse_output=True)
    y_train = mlb.fit_transform(labels)
    mlp_y_trains[aspect] = y_train
    
    print(f"y_train shape for {aspect} ontology: {y_train.shape} \t\t Number of unique {aspect} terms: {y_train.shape[1]}")

    # Save to dict
    mlp_mlb_dict[aspect] = mlb
    model = MLPClassifier(input_dim=X_train.shape[1], output_dim=y_train.shape[1])
    mlp_models[aspect] = model

y_train shape for P ontology: (82404, 16858) 		 Number of unique P terms: 16858
y_train shape for C ontology: (82404, 2651) 		 Number of unique C terms: 2651
y_train shape for F ontology: (82404, 6616) 		 Number of unique F terms: 6616


In [15]:
mlp_models

{'P': MLPClassifier(
   (net): Sequential(
     (0): Linear(in_features=320, out_features=2048, bias=True)
     (1): ReLU()
     (2): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (3): Dropout(p=0.3, inplace=False)
     (4): Linear(in_features=2048, out_features=1024, bias=True)
     (5): ReLU()
     (6): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (7): Dropout(p=0.25, inplace=False)
     (8): Linear(in_features=1024, out_features=16858, bias=True)
   )
 ),
 'C': MLPClassifier(
   (net): Sequential(
     (0): Linear(in_features=320, out_features=2048, bias=True)
     (1): ReLU()
     (2): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (3): Dropout(p=0.3, inplace=False)
     (4): Linear(in_features=2048, out_features=1024, bias=True)
     (5): ReLU()
     (6): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (7): Dropout(p=0.25

In [16]:
mlp_y_trains

{'P': <Compressed Sparse Row sparse matrix of dtype 'int64'
 	with 250805 stored elements and shape (82404, 16858)>,
 'C': <Compressed Sparse Row sparse matrix of dtype 'int64'
 	with 157770 stored elements and shape (82404, 2651)>,
 'F': <Compressed Sparse Row sparse matrix of dtype 'int64'
 	with 128452 stored elements and shape (82404, 6616)>}

In [17]:
# DataLoader
from torch.utils.data import DataLoader, TensorDataset

X_train_tensor = torch.tensor(X_train_norm, dtype=torch.float32)

loaders = {}

for aspect in ['P', 'C', 'F']:
    # Convert sparse CSR → dense numpy → float tensor
    y_dense = mlp_y_trains[aspect].toarray().astype('float32')
    y_tensor = torch.tensor(y_dense, dtype=torch.float32)

    dataset = TensorDataset(X_train_tensor, y_tensor)

    loaders[aspect] = DataLoader(dataset, batch_size=128, shuffle=True)

    print(aspect, "loader ready:", y_tensor.shape)

P loader ready: torch.Size([82404, 16858])
C loader ready: torch.Size([82404, 2651])
F loader ready: torch.Size([82404, 6616])


In [18]:
# Training
import torch.optim as optim

criterion = nn.BCEWithLogitsLoss()

def train_one_model(model, loader, epochs=5, lr=1e-3, device='cuda'):
    model = model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)

    model.train()
    for ep in range(epochs):
        total_loss = 0.0

        for X_batch, y_batch in loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            logits = model(X_batch)               # shape: (batch, num_labels)
            loss = criterion(logits, y_batch)

            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {ep+1}/{epochs}   Loss = {total_loss/len(loader):.4f}")

    return model

In [19]:
# Train 3 models
trained_mlp_models = dict()

for aspect in ['P', 'C', 'F']:
    print("\nTraining", aspect, "...")
    trained_mlp_models[aspect] = train_one_model(
        model=mlp_models[aspect],
        loader=loaders[aspect],
        epochs=8,
        lr=1e-3,
        device=device
    )


Training P ...
Epoch 1/8   Loss = 0.0666
Epoch 2/8   Loss = 0.0018
Epoch 3/8   Loss = 0.0015
Epoch 4/8   Loss = 0.0015
Epoch 5/8   Loss = 0.0015
Epoch 6/8   Loss = 0.0014
Epoch 7/8   Loss = 0.0014
Epoch 8/8   Loss = 0.0013

Training C ...
Epoch 1/8   Loss = 0.0676
Epoch 2/8   Loss = 0.0038
Epoch 3/8   Loss = 0.0036
Epoch 4/8   Loss = 0.0034
Epoch 5/8   Loss = 0.0033
Epoch 6/8   Loss = 0.0032
Epoch 7/8   Loss = 0.0031
Epoch 8/8   Loss = 0.0031

Training F ...
Epoch 1/8   Loss = 0.0663
Epoch 2/8   Loss = 0.0016
Epoch 3/8   Loss = 0.0014
Epoch 4/8   Loss = 0.0013
Epoch 5/8   Loss = 0.0013
Epoch 6/8   Loss = 0.0012
Epoch 7/8   Loss = 0.0011
Epoch 8/8   Loss = 0.0011


## Step 6: Label encoding + CNN1D model training

---

In [20]:
# Create three label sets
cnn_y_trains = dict()
cnn_mlb_dict = dict()
cnn_models = dict()

train_terms_df = pd.read_csv(TRAIN_TERMS, sep="\t")

for aspect in ['P', 'C', 'F']:
    ont_terms_df = train_terms_df[train_terms_df['aspect'] == aspect]
    protein_terms = ont_terms_df.groupby('EntryID')['term'].apply(list).to_dict()
    labels = [protein_terms.get(entry_id, []) for entry_id in train_ids]

    mlb = MultiLabelBinarizer(sparse_output=True)
    y_train = mlb.fit_transform(labels)
    cnn_y_trains[aspect] = y_train

    print(f"{aspect} y_train shape:", y_train.shape)

    cnn_mlb_dict[aspect] = mlb
    model = CNN1D(input_dim=X_train.shape[1], num_classes=y_train.shape[1])
    cnn_models[aspect] = model

P y_train shape: (82404, 16858)
C y_train shape: (82404, 2651)
F y_train shape: (82404, 6616)


In [21]:
cnn_models

{'P': CNN1D(
   (conv_block1): Sequential(
     (0): Conv1d(1, 32, kernel_size=(5,), stride=(1,), padding=(2,))
     (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (2): ReLU()
     (3): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   )
   (conv_block2): Sequential(
     (0): Conv1d(32, 64, kernel_size=(3,), stride=(1,), padding=(1,))
     (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (2): ReLU()
     (3): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   )
   (fc_block): Sequential(
     (0): Linear(in_features=5120, out_features=1024, bias=True)
     (1): ReLU()
     (2): Dropout(p=0.4, inplace=False)
     (3): Linear(in_features=1024, out_features=16858, bias=True)
   )
 ),
 'C': CNN1D(
   (conv_block1): Sequential(
     (0): Conv1d(1, 32, kernel_size=(5,), stride=(1,), padding=(2,))
     (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=

In [22]:
cnn_y_trains

{'P': <Compressed Sparse Row sparse matrix of dtype 'int64'
 	with 250805 stored elements and shape (82404, 16858)>,
 'C': <Compressed Sparse Row sparse matrix of dtype 'int64'
 	with 157770 stored elements and shape (82404, 2651)>,
 'F': <Compressed Sparse Row sparse matrix of dtype 'int64'
 	with 128452 stored elements and shape (82404, 6616)>}

In [23]:
from torch.utils.data import DataLoader, TensorDataset

X_train_tensor = torch.tensor(X_train_norm, dtype=torch.float32)
cnn_loaders = {}

for aspect in ['P', 'C', 'F']:
    y_dense = cnn_y_trains[aspect].toarray().astype('float32')
    y_tensor = torch.tensor(y_dense, dtype=torch.float32)

    dataset = TensorDataset(X_train_tensor, y_tensor)

    cnn_loaders[aspect] = DataLoader(dataset, batch_size=128, shuffle=True)

    print(aspect, "loader ready:", y_tensor.shape)

P loader ready: torch.Size([82404, 16858])
C loader ready: torch.Size([82404, 2651])
F loader ready: torch.Size([82404, 6616])


In [24]:
criterion = nn.BCEWithLogitsLoss()

def train_one_cnn(model, loader, epochs=8, lr=1e-3, device='cuda'):
    model = model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    model.train()
    for ep in range(epochs):
        total_loss = 0.0
        
        for X_batch, y_batch in loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            logits = model(X_batch)
            loss = criterion(logits, y_batch)

            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg = total_loss / len(loader)
        print(f"Epoch {ep+1}/{epochs}   Loss = {avg:.4f}")
    
    return model

In [25]:
trained_cnn_models = dict()

for aspect in ['P', 'C', 'F']:
    print("\nTraining CNN for", aspect)
    trained_cnn_models[aspect] = train_one_cnn(
        model=cnn_models[aspect],
        loader=cnn_loaders[aspect],
        epochs=8,
        lr=1e-3,
        device=device
    )


Training CNN for P
Epoch 1/8   Loss = 0.0046
Epoch 2/8   Loss = 0.0016
Epoch 3/8   Loss = 0.0015
Epoch 4/8   Loss = 0.0015
Epoch 5/8   Loss = 0.0014
Epoch 6/8   Loss = 0.0014
Epoch 7/8   Loss = 0.0014
Epoch 8/8   Loss = 0.0013

Training CNN for C
Epoch 1/8   Loss = 0.0065
Epoch 2/8   Loss = 0.0036
Epoch 3/8   Loss = 0.0034
Epoch 4/8   Loss = 0.0033
Epoch 5/8   Loss = 0.0032
Epoch 6/8   Loss = 0.0031
Epoch 7/8   Loss = 0.0030
Epoch 8/8   Loss = 0.0029

Training CNN for F
Epoch 1/8   Loss = 0.0042
Epoch 2/8   Loss = 0.0014
Epoch 3/8   Loss = 0.0013
Epoch 4/8   Loss = 0.0012
Epoch 5/8   Loss = 0.0011
Epoch 6/8   Loss = 0.0011
Epoch 7/8   Loss = 0.0010
Epoch 8/8   Loss = 0.0010


## Step 7: Inference and Submission

In [26]:
BATCH_SIZE = 5000  # avoid memory overflow
mlp_submission_list = []

for i in tqdm(range(0, len(test_ids), BATCH_SIZE), desc="Predicting with MLP"):
    batch_entry_ids = test_ids[i : i + BATCH_SIZE]

    # Slice features
    X_batch = X_test_norm[i : i + BATCH_SIZE]
    X_batch = torch.tensor(X_batch, dtype=torch.float32, device=device)

    # For each ontology aspect (P, F, C)
    for aspect, model in trained_mlp_models.items():
        model.eval()

        # Forward pass (logits → probabilities)
        with torch.no_grad():
            logits = model(X_batch)
            probs = torch.sigmoid(logits).cpu().numpy()  # ndarray (batch, num_labels)

        mlb = mlp_mlb_dict[aspect]  # MultiLabelBinarizer for this aspect

        # Loop over proteins in the batch
        for j, entry_id in enumerate(batch_entry_ids):
            prob_vec = probs[j]

            # threshold at 0.02
            candidate_indices = np.where(prob_vec > 0.02)[0]

            for idx in candidate_indices:
                mlp_submission_list.append(
                    (entry_id, mlb.classes_[idx], round(prob_vec[idx], 3))
                )

Predicting with MLP: 100%|██████████| 45/45 [01:12<00:00,  1.61s/it]


In [31]:
print(f"{len(mlp_submission_list):,}")

5,786,687


In [34]:
BATCH_SIZE = 5000
cnn_submission_list = []

for i in tqdm(range(0, len(test_ids), BATCH_SIZE), desc="Predicting with CNN"):
    batch_entry_ids = test_ids[i : i + BATCH_SIZE]

    # Slice features
    X_batch = X_test_norm[i : i + BATCH_SIZE]
    X_batch = torch.tensor(X_batch, dtype=torch.float32, device=device)

    # For each ontology aspect (P, F, C)
    for aspect, model in trained_cnn_models.items():
        model.eval()

        with torch.no_grad():
            logits = model(X_batch)                       # (batch, num_labels)
            probs = torch.sigmoid(logits).cpu().numpy()  # ndarray

        mlb = cnn_mlb_dict[aspect]

        # Loop over proteins in batch
        for j, entry_id in enumerate(batch_entry_ids):
            prob_vec = probs[j]

            # threshold = 0.02 (same as your MLP)
            candidate_indices = np.where(prob_vec > 0.02)[0]

            for idx in candidate_indices:
                cnn_submission_list.append(
                    (entry_id, mlb.classes_[idx], round(prob_vec[idx], 3))
                )

Predicting with CNN: 100%|██████████| 45/45 [01:10<00:00,  1.57s/it]


In [35]:
print(f"{len(cnn_submission_list):,}")

5,198,930


In [37]:
# Ensemble the models
merged_dict = {}  # key: (entry_id, GO), value: list of scores

# Add MLP predictions
for entry_id, go_term, score in tqdm(mlp_submission_list, desc="Merging MLP"):
    merged_dict.setdefault((entry_id, go_term), []).append(score)

# Add CNN predictions
for entry_id, go_term, score in tqdm(cnn_submission_list, desc="Merging CNN"):
    merged_dict.setdefault((entry_id, go_term), []).append(score)

# Final merged list
submission_list = []

for (entry_id, go_term), scores in tqdm(merged_dict.items(), desc="Averaging"):
    final_score = round(sum(scores) / len(scores), 3)
    submission_list.append((entry_id, go_term, final_score))

Merging MLP: 100%|██████████| 5786687/5786687 [00:12<00:00, 462712.66it/s] 
Merging CNN: 100%|██████████| 5198930/5198930 [00:05<00:00, 903228.67it/s] 
Averaging: 100%|██████████| 7796023/7796023 [00:56<00:00, 138840.31it/s]


In [43]:
list(merged_dict.items())[2344]

(('O60284', 'GO:0006355'), [0.076, 0.059])

In [44]:
submission_df = pd.DataFrame(
    submission_list,
    columns=['Protein Id', 'GO Term Id', 'Prediction']
)

submission_df = submission_df.sort_values(
    by=['Protein Id', 'Prediction'],
    ascending=[True, False]
)

# Limit 1500 predictions per protein
final_submission_df = (
    submission_df.groupby('Protein Id')
    .head(1500)
    .reset_index(drop=True)
)

final_submission_df.to_csv('submission.tsv', sep='\t', index=False, header=False)

In [45]:
print("\nSubmission file 'submission.tsv' created successfully.")
print(f"Total predictions in final submission: {len(final_submission_df):,}")
print("Submission DataFrame Head:")
display(final_submission_df.head())


Submission file 'submission.tsv' created successfully.
Total predictions in final submission: 7,796,023
Submission DataFrame Head:


Unnamed: 0,Protein Id,GO Term Id,Prediction
0,A0A017SE81,GO:0005515,0.213
1,A0A017SE81,GO:0004745,0.149
2,A0A017SE81,GO:0004318,0.138
3,A0A017SE81,GO:0003824,0.127
4,A0A017SE81,GO:0016616,0.113
