In [1]:
import numpy as np
import pandas as pd

In [2]:
# Read dataset
df = pd.read_csv("train.csv")
df

Unnamed: 0,EntryID,sequence,taxonomyID,labels,seq_length,num_labels
0,A0A0C5B5G6,MRWQEMGYIFYPRKLR,9606,"['GO:0001649', 'GO:0033687', 'GO:0005615', 'GO...",16,14
1,A0JNW5,MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...,9606,"['GO:0120013', 'GO:0034498', 'GO:0005769', 'GO...",1464,8
2,A0JP26,MVAEVCSMPAASAVKKPFDLRSKMGKWCHHRFPCCRGSGKSNMGTS...,9606,['GO:0005515'],581,1
3,A0PK11,MPGWFKKAWYGLASLLSFSSFILIIVALVVPHWLSGKILCQTGVDL...,9606,"['GO:0007605', 'GO:0005515']",232,2
4,A1A4S6,MGLQPLEFSDCYLDSPWFRERIRAHEAELERTNKFIKELIKDGKNL...,9606,"['GO:0005829', 'GO:0010008', 'GO:0005515', 'GO...",786,5
...,...,...,...,...,...,...
82399,Q9UTM1,MSKLKAQSALQKLIESQKNPNANEDGYFRRKRLAKKERPFEPKKLV...,284812,['GO:0005730'],112,1
82400,Q9Y7I1,MSSNSNTDHSTGDNRSKSEKQTDLRNALRETESHGMPPLRGPAGFP...,284812,"['GO:0005634', 'GO:0005829']",78,2
82401,Q9Y7P7,MRSNNSSLVHCCWVSPPSLTRLPAFPSPRILSPCYCYNKRIRPFRG...,284812,"['GO:0005634', 'GO:0005829']",117,2
82402,Q9Y7Q3,MHSSRRKYNDMWTARLLIRSDQKEEKYPSFKKNAGKAINAHLIPKL...,284812,"['GO:0005634', 'GO:0005739', 'GO:0005829']",149,3


# Baseline experiment

Data preparation
- Feature extraction: Extracts from sequence only (AAC format)
- Label encoding

Training
- One VS Rest classifier (Logistic Regression)


---

## Data preparation

### Feature Extraction: Turning sequence into numeric features

To produce fixed size representations of the amino acid sequence (given our different lengths of sequences), we have the following common methods in the field of protein prediction
- Amino Acid Composition (AAC) `--> choose this method as baseline`
  - Count the frequency of each of the 20 standard amino acids.
- k-mer frequencies (dipeptides or tripeptides) `--> later tested for improvement`
  - Count how often each possible 2-mer or 3-mer appears.
- Physicochemical properties `--> later tested for improvement`
  - Compute features like:
    - Hydrophobicity
    - Charge
    - Molecular weight
    - Fraction of polar, aromatic, small amino acids

In [None]:
# AAC
from collections import Counter

standard_aas = 'ACDEFGHIKLMNPQRSTVWY'  # 20 standard axit amin sequence characters

def extract_sequence_features(seq):
    """Convert a protein sequence into a 20-dim AAC vector."""
    length = len(seq)
    counts = Counter(seq)
    return np.array([counts.get(aa, 0) / length for aa in standard_aas])

# Example: Apply to dataframe
X_aac = np.stack(df['sequence'].apply(extract_sequence_features))

print("Shape of AAC feature matrix:", X_aac.shape, "\n---")
print(X_aac[0])

Shape of AAC feature matrix: (82404, 20) 
---
[0.     0.     0.     0.0625 0.0625 0.0625 0.     0.0625 0.0625 0.0625
 0.125  0.     0.0625 0.0625 0.1875 0.     0.     0.     0.0625 0.125 ]


### Label Encoding (Failed attempt): Turning GO terms to one-hot vectors

In [None]:
# # labels (GO terms)
# import ast
# from sklearn.preprocessing import MultiLabelBinarizer

# # Convert string to list
# df['labels'] = df['labels'].apply(ast.literal_eval)
# mlb = MultiLabelBinarizer()
# Y = mlb.fit_transform(df['labels'])
# print(Y.shape, Y.dtype)

(82404, 26125) int64


In [None]:
# # Since the memory requirement is too much --> Create a sparse version for Y
# from scipy.sparse import csr_matrix

# Y_sparse = csr_matrix(Y)  # stores only non-zero entries
# print(Y.shape, Y.dtype)

(82404, 26125) int64


ðŸ’¡ Reason of failing:
- The original Y (non-sparse) takes up too much memory --> unable to train
- The sparse Y allows training --> but training takes too long
- Bottom line: A target matrix of (82404, 26125) is too large, extremely sparse, and makes the modelâ€™s final layer enormous.

### Label Encoding: ontology-specific encoding

For each sequence, the output dimension of 26125 may be too large --> We divide it into three subsets (one for MF, one for BP, and one for CC). As a result, for each sequence, we will have 3 vectors as follows using multi-hot encoding (i.e. simply one-hot for multi-classification problems)
- For MF: [0 1 0 1 ... 0] of length num_unique(MF_GO_temrs)
- For BP: [0 1 0 1 ... 0] of length num_unique(BP_GO_temrs)
- For CC: [0 1 0 1 ... 0] of length num_unique(CC_GO_temrs)

Then we will train three separate models for each ontology. We use three models to predict for a single example in the test set, and gather the predictions.

However, in the implementation, make sure the label has 82,404 examples (i.e. 82,404 rows)

In [None]:
import pandas as pd
train_terms_df = pd.read_csv("data/Train/train_terms.tsv", sep="\t")

# Create three label sets
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

mlb_dict = dict()
models = dict()

X_train = X_aac

for aspect in ['P', 'C', 'F']:
    ont_terms_df = train_terms_df[train_terms_df['aspect'] == aspect]
    
    protein_terms = ont_terms_df.groupby('EntryID')['term'].apply(list).to_dict()
    labels_list = list(protein_terms.values())
    
    mlb = MultiLabelBinarizer(sparse_output=True)
    y_train_ont = mlb.fit_transform(labels_list)
    
    print(f"Number of unique {aspect} terms: {y_train_ont.shape}")
    
    mlb_dict[aspect] = mlb
    model = OneVsRestClassifier(LogisticRegression(max_iter=100, solver='lbfgs', C=0.5, random_state=42), n_jobs=-1)
    model.fit(X_train, y_train_ont)
    models[aspect] = model
    print(f"Model for {aspect} trained successfully.")

Number of unique P terms: 16858


ValueError: Found input variables with inconsistent numbers of samples: [82404, 59958]

ðŸ’¡ 16858 + 2651 + 6616 = 26125 is actually the number of unique GO terms

ðŸ’¡ Currently the shape[1] of labels are not correct --> refer to this implementation: https://www.kaggle.com/code/analyticaobscura/cafa-6-decoding-protein-mysteries

# Model training

Configuration:
- CPU: Intel Core i5-1135G7 (Family 6 Model 140), 1 processor, ~2.8â€¯GHz, 4 cores / 8 threads
- RAM: 8â€¯GB (approx. 1.3â€¯GB available for programs)
- GPU: Integrated (Intel Iris Xe Graphics)
- OS: Windows 11 Home Single Language, 64-bit
- Python: 3.11.5