# Logistic Regression with ESM-2 embeddings

Idea:
- Model: OneVsRestClassifier(LogisticRegression) --> train one Logistic Regression model for one class
- Features: PCA(n_components=100) --> PCA.fit_transform(ESM-2 embeddings)
- Labels: Three sets for three ontologies (P, C, F)
    - P has 16858 classes
    - C has 2651 classes
    - F has 6616 classes

Future direction:
- Model --> MLP
- Features --> full embeddings

References:
- (EDA + OneVsRestClassifier) https://www.kaggle.com/code/analyticaobscura/cafa-6-decoding-protein-mysteries
- (ESM-2 320-D embeddings) https://www.kaggle.com/code/dalloliogm/compute-protein-embeddings-with-esm2-esm-c/notebook
- (Optional ProtT5 1024-D embeddings) https://www.kaggle.com/code/ahsuna123/t5-embedding-calculation-cafa-6/output?select=train_ids.npy

---

In [1]:
!pip install biopython > /dev/null

## Step 1: Load CAFA6 files

---

In [2]:
# CAFA6 file paths
TRAIN_TERMS = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_terms.tsv"
TRAIN_SEQ = "/kaggle/input/cafa-6-protein-function-prediction/Train/train_sequences.fasta"
TEST_SEQ = "/kaggle/input/cafa-6-protein-function-prediction/Test/testsuperset.fasta"

In [3]:
from Bio import SeqIO 

# Dict {entryId, seq}
train_sequences = {rec.id: str(rec.seq) for rec in SeqIO.parse(TRAIN_SEQ, 'fasta')}
test_sequences  = {rec.id: str(rec.seq) for rec in SeqIO.parse(TEST_SEQ,  'fasta')}

print(f'Loaded {len(train_sequences)} train and {len(test_sequences)} test sequences')

Loaded 82404 train and 224309 test sequences


In [4]:
print("Train dict:", list(train_sequences.items())[0])
print("Test dict:", list(test_sequences.items())[0])

Train dict: ('sp|A0A0C5B5G6|MOTSC_HUMAN', 'MRWQEMGYIFYPRKLR')
Test dict: ('A0A0C5B5G6', 'MRWQEMGYIFYPRKLR')


In [5]:
train_ids = [i.split('|')[1] for i in train_sequences.keys()]
test_ids = list(test_sequences.keys())

In [6]:
print("train_ids[0:10]:", train_ids[0:10])
print("test_ids[0:10]:", test_ids[0:10])

train_ids[0:10]: ['A0A0C5B5G6', 'A0JNW5', 'A0JP26', 'A0PK11', 'A1A4S6', 'A1A519', 'A1L190', 'A1L3X0', 'A1X283', 'A2A2Y4']
test_ids[0:10]: ['A0A0C5B5G6', 'A0A1B0GTW7', 'A0JNW5', 'A0JP26', 'A0PK11', 'A1A4S6', 'A1A519', 'A1L190', 'A1L3X0', 'A1X283']


## Step 2: Feature extraction

---

In [7]:
# Embeddings file paths
ESM_EMBEDDINGS = "/kaggle/input/cafa6-esm2-650m-embedding/esm2_650M"
TRAIN_EMBEDDINGS = ESM_EMBEDDINGS + "/train_sequences_emb.npy"
TEST_EMBEDDINGS = ESM_EMBEDDINGS + "/testsuperset_emb.npy"

In [8]:
import numpy as np

# Load embeddings
X_train = np.load(TRAIN_EMBEDDINGS)
X_test = np.load(TEST_EMBEDDINGS)

In [9]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (82404, 1280)
X_test shape: (224309, 1280)


In [10]:
from sklearn.decomposition import PCA

pca = PCA(n_components=100, random_state=42)
X_train_reduced = pca.fit_transform(X_train)
X_test_reduced  = pca.transform(X_test)

In [11]:
print("X_train_reduced shape:", X_train_reduced.shape)
print("X_test_reduced shape:", X_test_reduced.shape)

X_train_reduced shape: (82404, 100)
X_test_reduced shape: (224309, 100)


## Step 3: Label encoding and Training

---

In [12]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from tqdm import tqdm

mlb_dict = dict()
models = dict()

train_terms_df = pd.read_csv(TRAIN_TERMS, sep="\t")

for aspect in tqdm(['P', 'C', 'F'], desc="Training models"):
    # Filter the train_terms_df based on aspect
    ont_terms_df = train_terms_df[train_terms_df['aspect'] == aspect]

    # Group the dataFrame based on the EntryID, turn all the GO terms to a list, finally turns this dataFrame to a dict {entryID: terms}
    protein_terms = ont_terms_df.groupby('EntryID')['term'].apply(list).to_dict()

    # Create a list of labels for this aspect, if an entryID doesn't exist in this aspect, give it a []
    # This ensures y_train is of shape (82404, ...)
    labels = [protein_terms.get(entry_id, []) for entry_id in train_ids]

    # Multi-hot encoding, use sparse representation
    mlb = MultiLabelBinarizer(sparse_output=True)
    y_train = mlb.fit_transform(labels)
    
    print(f"y_train shape for {aspect} ontology: {y_train.shape} \t Number of unique {aspect} terms: {y_train.shape[1]}")

    # Save to dict
    mlb_dict[aspect] = mlb
    model = OneVsRestClassifier(
        SGDClassifier(
            loss='log_loss',
            alpha=1e-4,
            max_iter=1000,
            tol=1e-3,
            n_jobs=-1,
            random_state=42
        ),
        n_jobs=-1
    )
    model.fit(X_train_reduced, y_train)
    models[aspect] = model
    print(f"Model for {aspect} trained successfully.")

Training models:   0%|          | 0/3 [00:00<?, ?it/s]

y_train shape for P ontology: (82404, 16858) 	 Number of unique P terms: 16858


Training models:  33%|███▎      | 1/3 [28:58<57:57, 1738.65s/it]

Model for P trained successfully.
y_train shape for C ontology: (82404, 2651) 	 Number of unique C terms: 2651


Training models:  67%|██████▋   | 2/3 [33:34<14:38, 878.41s/it] 

Model for C trained successfully.
y_train shape for F ontology: (82404, 6616) 	 Number of unique F terms: 6616


Training models: 100%|██████████| 3/3 [44:52<00:00, 897.52s/it]

Model for F trained successfully.





## Step 4: Inference and Submission

In [13]:
BATCH_SIZE = 5000  # avoid memory overflow
submission_list = []

for i in tqdm(range(0, len(test_ids), BATCH_SIZE), desc="Predicting on Test Set"):
    batch_entry_ids = test_ids[i : i + BATCH_SIZE]

    X_batch = X_test_reduced[i : i + BATCH_SIZE]

    for aspect, model in models.items():
        mlb = mlb_dict[aspect]
        y_pred_proba = model.predict_proba(X_batch)

        for j, entry_id in enumerate(batch_entry_ids):
            probs = y_pred_proba[j]
            candidate_indices = np.where(probs > 0.02)[0]

            for idx in candidate_indices:
                submission_list.append((entry_id, mlb.classes_[idx], round(probs[idx], 3)))

Predicting on Test Set: 100%|██████████| 45/45 [13:58<00:00, 18.64s/it]


In [14]:
submission_df = pd.DataFrame(submission_list, columns=['Protein Id', 'GO Term Id', 'Prediction'])
submission_df.to_csv('submission_no_limit.tsv', sep='\t', index=False, header=False)

print("Applying 1500 prediction limit per protein...")
submission_df = submission_df.sort_values(by=['Protein Id', 'Prediction'], ascending=[True, False])
final_submission_df = submission_df.groupby('Protein Id').head(1500).reset_index(drop=True)
final_submission_df.to_csv('submission.tsv', sep='\t', index=False, header=False)

print("\nSubmission file 'submission.tsv' created successfully.")
print(f"Total predictions in final submission: {len(final_submission_df):,}")
print("Submission DataFrame Head:")
display(final_submission_df.head())

Applying 1500 prediction limit per protein...

Submission file 'submission.tsv' created successfully.
Total predictions in final submission: 4,543,007
Submission DataFrame Head:


Unnamed: 0,Protein Id,GO Term Id,Prediction
0,A0A017SE81,GO:0005886,0.106
1,A0A017SE81,GO:0005515,0.086
2,A0A017SE81,GO:0005739,0.082
3,A0A017SE81,GO:0005777,0.081
4,A0A017SE81,GO:0005829,0.077
