# CAFA-5 Protein Function Prediction - Kaggle Environment

+ In this phase we focus on protein embeddings extraction. Protein embeddings are numerical representations of protein sequences, we convert each protein into a vector of numbers (a list of 1024 floats) that captures its biological meaning. These vectors allow us to compare proteins, implement machine learning models, and detect functional similarities. 

+ We used a pretrained Protein Language Model called ProtBert which was trained on millions of protein sequences. 

More specifically:

1. We read protein sequences from a fasta file.
2. We split each amino acid with spaces and used ProtBert's tokenizer to convert it into a model-friendly format.
3. Then, we passed each tokenized protein into ProtBert. It processed the sequence and returned embeddings for each amino acid.
4. We took the mean of all amino acid embeddings to get a single vector per protein.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/cafa-5-protein-function-prediction/sample_submission.tsv
/kaggle/input/cafa-5-protein-function-prediction/IA.txt
/kaggle/input/cafa-5-protein-function-prediction/Test (Targets)/testsuperset.fasta
/kaggle/input/cafa-5-protein-function-prediction/Test (Targets)/testsuperset-taxon-list.tsv
/kaggle/input/cafa-5-protein-function-prediction/Train/train_terms.tsv
/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta
/kaggle/input/cafa-5-protein-function-prediction/Train/train_taxonomy.tsv
/kaggle/input/cafa-5-protein-function-prediction/Train/go-basic.obo


We installed the necessary libraries.


In [None]:
!pip install biopython progressbar transformers


### Step 1: Load Protein Sequences from FASTA File

In this step, we load the raw protein sequences from the FASTA file provided in the CAFA5 dataset.

- We use the `Bio.SeqIO` module from the Biopython library to parse the `.fasta` file.
- Each protein sequence is stored in a dictionary `sequences`, where:
  - The key is the protein's unique identifier (record ID)
  - The value is the corresponding amino acid sequence as a string

📦 **Output:**  
A Python dictionary mapping each protein ID to its amino acid sequence.  
This will be used as input to the Protein Language Model in the next step.

✅ **Example:**  
```python
{'P12345': 'MKWVTFISLLFLFSSAYS', 'P67890': 'GHHHHHHHHHHHHH', ...}


In [None]:
import torch
import numpy as np
import pandas as pd
from Bio import SeqIO
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import Dataset, DataLoader
import progressbar

# Load sequences
train_fasta_path = '/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta'
sequences = {record.id: str(record.seq) for record in SeqIO.parse(train_fasta_path, 'fasta')}
print(f"Loaded {len(sequences)} protein sequences.")


Loaded 142246 protein sequences.


In [5]:
# Paths (use kaggle's built-in paths)
train_fasta_path = '/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta'
train_terms_path = '/kaggle/input/cafa-5-protein-function-prediction/Train/train_terms.tsv'


### Model Setup: ProtBERT for Protein Embeddings

In this step, we set up the environment and load the pretrained **ProtBERT** model to generate protein sequence embeddings:

- `device`: Automatically selects GPU (`cuda`) if available, otherwise uses CPU.
- `MODEL_NAME`: Specifies the pretrained model (`Rostlab/prot_bert_bfd`) from HuggingFace.
- `tokenizer`: Converts protein sequences (amino acid strings) into token IDs that the model understands.
- `model`: Loads the ProtBERT model, moves it to the selected device, converts it to half-precision (`.half()`) for faster inference, and switches it to evaluation mode.
- `max_len`: Sets the maximum sequence length (tokens) for padding/truncation.
- `batch_size`: Defines how many sequences will be processed together during inference.


In [None]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MODEL_NAME = 'Rostlab/prot_bert_bfd'

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, do_lower_case=False)
model = AutoModel.from_pretrained(MODEL_NAME).half().to(device)
model.eval()

# Configuration
max_len = 512
batch_size = 128

### Creating a Protein Dataset and DataLoader

This section prepares the protein sequences for input into the ProtBERT model:

- **`ProteinDataset` class**: 
  - A custom PyTorch dataset that stores protein IDs and their sequences.
  - `__getitem__`: returns a tuple (protein ID, sequence) at the given index.

- **`collate_fn` function**:
  - Custom function for batching data during loading.
  - Takes a batch of sequences and uses the `tokenizer` to convert them into token IDs, applying:
    - Padding (`max_length`)
    - Truncation (if too long)
    - Output format suitable for PyTorch tensors on the selected `device`.

- **`DataLoader`**:
  - Batches the dataset for efficient model inference.
  - Uses the `collate_fn` for proper tokenization.
  - `shuffle=False` ensures consistent order, and `num_workers=0` avoids CUDA errors during tokenization.


In [None]:
# Protein dataset
class ProteinDataset(Dataset):
    def __init__(self, seq_dict):
        self.ids = list(seq_dict.keys())
        self.sequences = list(seq_dict.values())

    def __len__(self):
        return len(self.ids)

    def __getitem__(self, idx):
        return self.ids[idx], self.sequences[idx]

# Collate function with tokenizer
def collate_fn(batch):
    pids, sequences = zip(*batch)
    encoded = tokenizer(
        list(sequences),
        padding='max_length',
        truncation=True,
        max_length=max_len,
        return_tensors='pt'
    )
    return list(pids), {k: v.to(device) for k, v in encoded.items()}

# Create dataset and dataloader
dataset = ProteinDataset(sequences)
loader = DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    pin_memory=False,
    collate_fn=collate_fn
)

### Embedding Extraction from ProtBERT

This section performs the actual embedding extraction using the ProtBERT model:

- **`with torch.no_grad()`**:
  - Disables gradient computation to save memory and speed up inference.

- **Progress bar**:
  - Tracks the loading and processing progress across protein batches.

- **Loop through DataLoader**:
  - Each batch contains `pids` (protein IDs) and `inputs` (tokenized sequences).
  - On the first batch, the tokenized input is decoded and printed for debugging.

- **Model inference**:
  - The model outputs token-level embeddings for each sequence.
  - `mean(dim=1)` computes the average embedding across all tokens in a sequence, producing a single fixed-size vector per protein.

- **Store results**:
  - The resulting embeddings are stored in `embeddings_dict` keyed by protein ID.

- **Save to CSV**:
  - The dictionary is converted into a DataFrame.
  - The result is saved as `protein_embeddings.csv` for later use in downstream analysis or classification tasks.


In [None]:
# Extraction of embeddings
embeddings_dict = {}
with torch.no_grad():
    bar = progressbar.ProgressBar(maxval=len(loader)).start()
    for idx, (pids, inputs) in enumerate(loader):

        # Debug print only for the first batch
        if idx == 0:
            print("🧬 Decoded input for first sequence:")
            print(tokenizer.decode(inputs['input_ids'][0]))

        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).float().cpu().numpy()
        embeddings_dict.update(dict(zip(pids, embeddings)))
        bar.update(idx + 1)
    bar.finish()

# Save to CSV
embedding_df = pd.DataFrame.from_dict(embeddings_dict, orient='index')
embedding_df.index.name = 'Protein Id'
embedding_df.reset_index(inplace=True)
embedding_df.to_csv('protein_embeddings.csv', index=False)


  0% |                                                                                            |

🧬 Decoded input for first sequence:
[CLS] [UNK] [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

100% |############################################################################################|
