# Using Pre-trained Models for Protein Data Analysis
## Morning Session: Understanding and Using Pre-trained Models

### 1. Introduction
Pre-trained models have become a cornerstone in modern protein analysis. They offer several key advantages:

- **Why use pre-trained models?**
  * Save computational resources (training from scratch can take weeks/months)
  * Leverage existing knowledge (models trained on millions of protein sequences)
  * Faster development time (focus on fine-tuning rather than architecture design)
  * Better performance on small datasets (transfer learning benefits)

- **Common pre-trained protein models:**
  * ESM (Evolutionary Scale Modeling) - Meta's protein language model
  * ProtBERT - BERT architecture adapted for proteins
  * ProtT5 - T5 architecture for protein sequences
  * UniRep - Universal protein representations
  * SaProt - Structure-aware protein language model

### Key Concepts:
- **Embeddings**: Numerical representations of protein sequences
- **Fine-tuning**: Adapting pre-trained models for specific tasks
- **Transfer Learning**: Using knowledge from one task to improve another

### Pre-Exercise: Select Computing Device
We first need to determine whether to use GPU (CUDA) or CPU for our computations.
Note: For this tutorial, we'll use CPU to ensure compatibility across all systems.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cpu")

### Exercise 1: Loading Pre-trained Models
In this exercise, we'll load the ESM-2 model, which is state-of-the-art for protein sequence analysis.
 
**Important Notes:**
- ESM2_t6_8M_UR50D is a smaller version suitable for laptops
- The full ESM2_t33_650M_UR50D model offers better performance but requires more resources
- The model is loaded from Hugging Face's model hub
- Hint: https://huggingface.co/docs/transformers/en/model_doc/esm

In [None]:
from transformers import _____, _____

model_name = "facebook/esm2_t6_8M_UR50D" # It is recommended to use esm2_t33_650M_UR50D by t6_8M is smaller and handable on your laptop.

# Complete the code to load the model and tokenizer
model = _____.from_pretrained(model_name)
tokenizer = _____.from_pretrained(model_name)

### Exercise 2: Generate Embeddings
Embeddings are dense vector representations of protein sequences. They capture:
- Amino acid properties
- Local structure information
- Evolutionary relationships

The embedding process involves:
1. Tokenization (converting amino acids to tokens)
2. Forward pass through the model
3. Extraction of the embedding vectors


In [None]:
sequence = "MLELLPTAVEGVSQAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPAIAFTMYLSMLLGYGLTMVPFGGEQNPIYWARYADWLFTTPLLLLDLALLVDADQGTILALVGADGIMIGTGLVGALTKVYSYRFVWWAISTAAMLYILYVLFFGFTSKAESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVPLNIETLLFMVLDVSAKVGFGLILLRSRAIFGEAEAPEPSAGDGAAATSD"

# Tokenize
inputs = tokenizer(sequence, return_tensors='pt')

# Get embeddings
outputs = _____(**inputs)

embeddings = outputs._____ # Select the appropriate output

### Data Pipeline Architecture

```mermaid
graph LR
    A[Raw Protein Data] --> B[ProteinDataset]
    B --> C[DataLoader]
    C --> D[ProteinClassifier]
    D --> E[Training Loop]
```

**Detailed Pipeline Steps:**
1. **Raw Protein Data**: Sequences in FASTA format or similar
2. **ProteinDataset**: Custom PyTorch Dataset class for efficient data handling
3. **DataLoader**: Batches data and enables parallel processing
4. **ProteinClassifier**: Model architecture combining pre-trained model and task-specific layers
5. **Training Loop**: Iterative process of model optimization

### Exercise 3: Prepare a Simple Fine-tuning Dataset
 
The `ProteinDataset` class is crucial for efficient data handling in PyTorch. It serves several important purposes:
 
1. **Data Organization**: 
   - Keeps sequences and labels paired
   - Handles data preprocessing consistently

2. **Memory Efficiency**:
   - Loads data on-demand rather than all at once
   - Essential when working with large protein datasets
 
3. **Tokenization**:
   - Converts amino acid sequences into model-readable format
   - Applies consistent preprocessing across all sequences


In [None]:
class ProteinDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = sequences
        self.labels = labels
        
    def __len__(self):
        return _____
        
    def __getitem__(self, idx):
        sequence = self.sequences[idx]
        label = self.labels[idx]
        
        # Tokenize sequence
        inputs = _____
        
        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': torch.tensor(label)
        }

### DataLoader Usage Example
The DataLoader is a crucial component that:
- Handles batch creation
- Shuffles data for better training
- Enables parallel data loading
- Manages memory efficiently

**Key Parameters:**
- `batch_size`: Controls memory usage and training stability
- `shuffle`: Randomizes data order for better training
- `num_workers`: Enables parallel data loading

Example of how ProteinDataset is used in the complete pipeline


In [None]:
sequences = ["MLELL...", "KVFGR...", ...]  # Your protein sequences
labels = [0, 1, ...]  # Your corresponding labels

dataset = ProteinDataset(sequences, labels)

# Create DataLoader for batch processing
train_loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True
)

# Now train_loader will yield batches like:
# {
#     'input_ids': tensor of shape [batch_size, seq_length],
#     'attention_mask': tensor of shape [batch_size, seq_length],
#     'labels': tensor of shape [batch_size]
# }

### Exercise 4: Create a Classification Head
 
The `ProteinClassifier` architecture combines:
 
1. **Pre-trained Model**:
   - Provides protein sequence understanding
   - Frozen weights to preserve learned features
 
2. **Custom Classification Head**:
   - Task-specific layers
   - Trainable parameters for your specific problem
 
**Architecture Decisions:**
- Using mean pooling for sequence representation
- Two-layer classification head with ReLU activation
- Output dimension matches number of classes

In [None]:
class ProteinClassifier(nn.Module):
    def __init__(self, pretrained_model, num_labels):
        super().__init__()
        self.pretrained_model = pretrained_model
        self.classifier = nn.Sequential(
            _____(pretrained_model.config.hidden_size, 256),
            nn.ReLU(),
            _____(256, num_labels)
        )
        
    def forward(self, input_ids, attention_mask):
        outputs = self.pretrained_model(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs[0]
        pooled_output = torch.mean(sequence_output, dim=1)
        return self.classifier(pooled_output)

Example of how ProteinClassifier fits in the pipeline


In [None]:
# Initialize the model
pretrained_model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
num_labels = 2  # binary classification example
model = ProteinClassifier(pretrained_model, num_labels)

# The model processes data like this:
# 1. Takes tokenized sequences from ProteinDataset
# 2. Passes them through pre-trained ESM model
# 3. Applies classification head to get predictions

### Exercise 5: Training Loop Implementation

The training loop is where the actual learning happens. Key components include:

1. **Optimization Process**:
   - Forward pass: Generate predictions
   - Loss calculation: Measure prediction error
   - Backward pass: Calculate gradients
   - Parameter updates: Improve model

2. **Training Considerations**:
   - Learning rate selection
   - Batch size impact
   - Gradient clipping (if needed)
   - Model checkpointing


In [None]:
def train_model(model, train_loader, optimizer, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        for batch in train_loader:
            optimizer.zero_grad()
            
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = _____(input_ids, attention_mask)
            loss = _____
            
            loss.backward()
            optimizer.step()
           
            print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Now we should be able to take a pre-trained model and start fine-tuning with our data

In [None]:
# Example usage with dummy data
sequences = [
    "MLELLPTAVEGVSQAQITGRP",
    "KVFGRCELAAAMKRHGLDNYR"
]
labels = [0, 1]  # Binary classification example

# Create dataset and dataloader
dataset = ProteinDataset(sequences, labels)
train_loader = DataLoader(dataset, batch_size=2, shuffle=True)

# Initialize model
pretrained_model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = ProteinClassifier(pretrained_model, num_labels=2)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Train
train_model(model, train_loader, optimizer, num_epochs=2)
torch.save(model.state_dict(), 'model_weights.pth')

Lets take a break here, after the break we continue with making predictions.

### Exercise 6: Making Predictions
 
 Making predictions involves several important considerations:
 
 1. **Single vs. Batch Predictions**:
    - Single: Better for quick testing
    - Batch: More efficient for large-scale predictions
 
 2. **Model Evaluation Mode**:
    - Disables dropout
    - Uses running statistics for batch normalization
 
 3. **Prediction Output**:
    - Raw logits vs. probabilities
    - Confidence scores
    - Class predictions

In [None]:
def predict_sequence(model, tokenizer, sequence, device='cuda'):
    """
    Make prediction for a single protein sequence
    """
    # Set model to evaluation mode
    model.eval()
    
    # Tokenize sequence
    inputs = tokenizer(
        sequence,
        padding='max_length',
        max_length=512,
        truncation=True,
        return_tensors='pt'
    )
    
    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Make prediction
    with torch.no_grad():
        outputs = model(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask']
        )
        
        # Get probabilities
        probs = torch.softmax(outputs, dim=1)
        
        # Get predicted class
        predicted_class = torch.argmax(probs, dim=1)
    
    return {
        'predicted_class': predicted_class.item(),
        'probabilities': probs.squeeze().cpu().numpy()
    }

In [None]:
def predict_batch(model, tokenizer, sequences, batch_size=32, device='cuda'):
    """
    Make predictions for a list of protein sequences
    """
    # Create dataset
    dataset = ProteinDataset(sequences, labels=[0]*len(sequences))  # dummy labels
    dataloader = DataLoader(dataset, batch_size=batch_size)
    
    # Lists to store predictions
    all_predictions = []
    all_probabilities = []
    
    # Set model to evaluation mode
    model.eval()
    
    with torch.no_grad():
        for batch in dataloader:
            # Move batch to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            
            # Get predictions
            outputs = model(input_ids, attention_mask)
            probs = torch.softmax(outputs, dim=1)
            predictions = torch.argmax(probs, dim=1)
            
            # Store results
            all_predictions.extend(predictions.cpu().numpy())
            all_probabilities.extend(probs.cpu().numpy())
    
    return {
        'predictions': all_predictions,
        'probabilities': all_probabilities
    }

### Model Loading and Inference
 
 When loading a trained model, consider:
 
 1. **Model State**:
    - Architecture must match training
    - Weights must be compatible
 
 2. **Device Placement**:
    - CPU vs. GPU considerations
    - Memory management
 
 3. **Inference Settings**:
    - Batch size optimization
    - Memory vs. speed tradeoffs

In [None]:
import torch
import numpy as np
from torch.utils.data import DataLoader
import pandas as pd

# Load your saved model
def load_model(model_path, device='cuda'):
    """
    Load a saved model
    """
    # Load the pretrained model first
    pretrained_model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
    
    # Initialize your classifier
    model = ProteinClassifier(pretrained_model, num_labels=2)
    
    # Load the saved weights
    model.load_state_dict(torch.load(model_path))
    
    # Move to device
    model = model.to(device)
    
    return model

Lets see if we can get some predictions

In [None]:
model_path = "path/to/your/saved/model.pth"
model = load_model(model_path)
tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

# Example sequences
test_sequences = [
    "MLELLPTAVEGVSQAQITGRP",
    "KVFGRCELAAAMKRHGLDNYR",
    "MAEGEITTFTALTEKFNLPPG"
]

# 1. Single sequence prediction
print("\nSingle Sequence Prediction:")
result = predict_sequence(model, tokenizer, test_sequences[0])
print(f"Predicted class: {result['predicted_class']}")
print(f"Class probabilities: {result['probabilities']}")

# 2. Batch prediction
print("\nBatch Prediction:")
results = predict_batch(model, tokenizer, test_sequences)

# Create DataFrame for nice output
df = pd.DataFrame({
    'Sequence': test_sequences,
    'Predicted_Class': results['predictions'],
    'Probability_Class_0': [prob[0] for prob in results['probabilities']],
    'Probability_Class_1': [prob[1] for prob in results['probabilities']]
})

print(df)

### Take home Exercise 7: Advanced Prediction Features
 
These advanced features enhance the practical utility of your model:
 
 1. **Confidence Thresholding**:
    - Reduces false predictions
    - Handles uncertainty
    - Important for production systems
 
 2. **Result Persistence**:
    - Structured data storage
    - Analysis-ready format
    - Reproducibility support

Only make predictions if the confidence is above a certain threshold:

In [None]:
def predict_with_confidence(model, tokenizer, sequence, confidence_threshold=0.8):
    """
    Make prediction only if confidence exceeds threshold
    """
    result = predict_sequence(model, tokenizer, sequence)
    max_prob = np.max(result['probabilities'])
    
    if max_prob >= confidence_threshold:
        return {
            'prediction': result['predicted_class'],
            'confidence': max_prob,
            'status': 'confident'
        }
    else:
        return {
            'prediction': None,
            'confidence': max_prob,
            'status': 'uncertain'
        }

Save your predictions to a CSV file for later:

In [None]:
def save_predictions(sequences, predictions, probabilities, output_file):
    """
    Save predictions to CSV file
    """
    results_df = pd.DataFrame({
        'sequence': sequences,
        'predicted_class': predictions,
        'probability_class_0': [p[0] for p in probabilities],
        'probability_class_1': [p[1] for p in probabilities]
    })
    
    results_df.to_csv(output_file, index=False)
    return results_df

Creating a more complete pipeline for error handling:

In [2]:
def prediction_pipeline(
    model_path,
    input_file,
    output_file,
    batch_size=32,
    confidence_threshold=0.8
):
    """
    Complete prediction pipeline with error handling
    """
    try:
        # Load model
        model = load_model(model_path)
        tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
        
        # Read sequences
        df = pd.read_csv(input_file)
        sequences = df['sequence'].tolist()
        
        # Make predictions
        results = predict_batch(model, tokenizer, sequences, batch_size)
        
        # Filter by confidence
        confident_mask = [max(probs) >= confidence_threshold 
                         for probs in results['probabilities']]
        
        # Save results
        results_df = save_predictions(
            sequences,
            results['predictions'],
            results['probabilities'],
            output_file
        )
        
        print(f"Processed {len(sequences)} sequences")
        print(f"Confident predictions: {sum(confident_mask)}")
        
        return results_df
        
    except Exception as e:
        print(f"Error in prediction pipeline: {str(e)}")
        return None

Usage example

In [None]:
# Example usage of the complete pipeline
input_file = "protein_sequences.csv"
output_file = "predictions.csv"
model_path = "trained_model.pth"

results = prediction_pipeline(
    model_path=model_path,
    input_file=input_file,
    output_file=output_file,
    batch_size=32,
    confidence_threshold=0.8
)

if results is not None:
    print("\nFirst few predictions:")
    print(results.head())