# Concept Interventions with CB-LBSTER

This tutorial demonstrates how to perform concept interventions using the Concept Bottleneck LBSTER models. Concept interventions allow you to modify specific biological properties of protein sequences in a controlled manner.

## Setup and Installation

First, ensure LBSTER is installed:

```bash
pip install -e .
```

Import necessary libraries:

In [1]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from Bio.SeqUtils.ProtParam import ProteinAnalysis

ModuleNotFoundError: No module named 'torch'

In [None]:
from lobster.model import LobsterCBMPMLM
from lobster.concepts._utils import supported_biopython_concepts

## Loading a Concept Bottleneck Model

Let's load a pre-trained Concept Bottleneck Model:

In [None]:
# Choose a concept bottleneck model
model_name = "asalam91/cb_lobster_24M"  # 24M parameter CB model

In [None]:
# Load the model
model = LobsterCBMPMLM(model_name)
model.eval()  # Set to evaluation mode

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print(f"Model loaded on {device}")

In [None]:
# Get list of available concepts
print(f"Total number of concepts: {len(model.concept_names)}")
print("Example concepts:", model.concept_names[:10])

## Define Sample Sequences

We'll use a few sample protein sequences:

In [None]:
sequences = [
    "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR",  # Hemoglobin alpha
    "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH",  # Hemoglobin beta
    "EQKLISEEDLMAMVKQTLNSNLQFIHFIQKLINSQISLLIGKLFKKFNARIAKISAKEELRKHIAEQLNREVDYLEAKYAKKNREEMRKLEKEISQIKEDLKKTVESLQAKIQDLSKKYPGADAKKMEEQRQQLEEQKNKLQAEIENLLNSIDHAKKLKEEIAQLQEEISQLEDENEKLRRDIENQKENNKLLEEELTKLQAENSSLRKELEALTERLQDLYESLKLKDDDAVN",  # Tropomyosin
]

In [None]:
sequence_names = ["Hemoglobin α", "Hemoglobin β", "Tropomyosin"]

## Analyzing Original Concepts

First, let's analyze the concepts in the original sequences:

In [None]:
def extract_concepts(model, sequences, device="cuda"):
    """Extract concepts from sequences using the model."""
    concepts_list = []
    
    with torch.no_grad():
        for seq in sequences:
            # Tokenize and process the sequence
            tokens = model.tokenizer(seq, return_tensors="pt").to(device)
            
            # Get concepts
            outputs = model.model(
                input_ids=tokens["input_ids"],
                attention_mask=tokens["attention_mask"],
                inference=True
            )
            
            # Extract the concepts
            seq_concepts = outputs["concepts"].cpu().numpy().squeeze()
            concepts_list.append(seq_concepts)
    
    return np.array(concepts_list)

In [None]:
# Extract concepts from original sequences
original_concepts = extract_concepts(model, sequences, device)
print(f"Concepts shape: {original_concepts.shape}")

In [None]:
# Display top 5 concepts for each sequence
for i, seq_name in enumerate(sequence_names):
    # Get the top 5 concept indices for this sequence
    top_concept_indices = np.argsort(original_concepts[i])[-5:][::-1]
    
    # Display the top concepts and their values
    print(f"\nTop concepts for {seq_name}:")
    for idx in top_concept_indices:
        print(f"  {model.concept_names[idx]}: {original_concepts[i][idx]:.4f}")

## Concept Intervention

Now, let's perform interventions on specific concepts. We'll modify a concept and see how it affects the generated sequence.

In [None]:
def intervene_and_generate(model, sequence, concept_index, intervention_value, device="cuda", p_mask=0.25):
    """Intervene on a concept and generate a new sequence."""
    with torch.no_grad():
        # Tokenize and process the sequence
        tokens = model.tokenizer(sequence, return_tensors="pt").to(device)
        
        # Create a masked version of the input
        masked_tokens = model._mask_inputs(tokens["input_ids"], p_mask=p_mask)
        
        # Get concepts from the original sequence
        outputs = model.model(
            input_ids=tokens["input_ids"],
            attention_mask=tokens["attention_mask"],
            inference=True
        )
        
        # Get the original concepts
        original_concepts = outputs["concepts"]
        
        # Create a modified concept tensor
        modified_concepts = original_concepts.clone()
        modified_concepts[0, concept_index] = intervention_value
        
        # Generate with the modified concepts
        logits = model.model(
            input_ids=masked_tokens,
            concepts=modified_concepts,
            inference=True,
            attention_mask=tokens["attention_mask"]
        )["logits"]
        
        # Get the predicted tokens
        pred_tokens = logits.argmax(dim=-1)
        
        # Create the new sequence using the masked tokens as a template
        mask = (masked_tokens == model.tokenizer.mask_token_id).int()
        new_tokens = (masked_tokens * (1 - mask)) + (pred_tokens * mask)
        
        # Decode to get the new sequence
        new_sequence = model.tokenizer.decode(new_tokens[0]).replace(" ", "")
        
        # Clean the sequence to include only valid amino acids
        valid_aa = "ARNDCQEGHILKMFPSTWYV"
        new_sequence = "".join([aa for aa in new_sequence if aa in valid_aa])
        
        return new_sequence

## Intervention Examples

Let's try intervening on the "helix_fraction" concept for the hemoglobin alpha sequence:

In [None]:
# Choose a sequence to modify
sequence_idx = 0  # Hemoglobin alpha
sequence = sequences[sequence_idx]
sequence_name = sequence_names[sequence_idx]

In [None]:
# Choose a concept to modify
concept_name = "helix_fraction"
concept_idx = model.concept_names.index(concept_name)

In [None]:
print(f"Original {concept_name} for {sequence_name}: {original_concepts[sequence_idx][concept_idx]:.4f}")

In [None]:
# Intervene with different values
intervention_values = [0.1, 0.3, 0.5, 0.7, 0.9]
generated_sequences = []

In [None]:
for value in intervention_values:
    new_seq = intervene_and_generate(model, sequence, concept_idx, value, device)
    generated_sequences.append(new_seq)
    print(f"\nIntervention value: {value}")
    print(f"Generated sequence: {new_seq[:50]}...")

## Analyzing the Effects of Interventions

Let's analyze how the interventions affected the actual properties of the generated sequences:

In [None]:
def calculate_helix_fraction(sequence):
    """Calculate the helix fraction using BioPython."""
    try:
        analysis = ProteinAnalysis(sequence)
        helix, turn, sheet = analysis.secondary_structure_fraction()
        return helix
    except Exception as e:
        print(f"Error analyzing sequence: {e}")
        return None

In [None]:
# Calculate actual helix fractions for original and generated sequences
original_helix = calculate_helix_fraction(sequence)
generated_helix = [calculate_helix_fraction(seq) for seq in generated_sequences]

In [None]:
print(f"\nOriginal {sequence_name} helix fraction: {original_helix:.4f}")
print("Helix fractions of generated sequences:")
for i, value in enumerate(intervention_values):
    print(f"  Intervention value {value}: {generated_helix[i]:.4f}")

In [None]:
# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(intervention_values, generated_helix, 'o-', markersize=10, linewidth=2)
plt.axhline(y=original_helix, color='r', linestyle='--', label=f'Original sequence ({original_helix:.4f})')
plt.xlabel('Intervention value for helix_fraction', fontsize=12)
plt.ylabel('Actual helix fraction of generated sequence', fontsize=12)
plt.title(f'Effect of helix_fraction intervention on {sequence_name}', fontsize=14)
plt.grid(alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

## Multi-Concept Intervention

Now, let's try intervening on multiple concepts simultaneously:

In [None]:
def multi_concept_intervene(model, sequence, concept_indices, intervention_values, device="cuda", p_mask=0.25):
    """Intervene on multiple concepts and generate a new sequence."""
    with torch.no_grad():
        # Tokenize and process the sequence
        tokens = model.tokenizer(sequence, return_tensors="pt").to(device)
        
        # Create a masked version of the input
        masked_tokens = model._mask_inputs(tokens["input_ids"], p_mask=p_mask)
        
        # Get concepts from the original sequence
        outputs = model.model(
            input_ids=tokens["input_ids"],
            attention_mask=tokens["attention_mask"],
            inference=True
        )
        
        # Get the original concepts
        original_concepts = outputs["concepts"]
        
        # Create a modified concept tensor
        modified_concepts = original_concepts.clone()
        for idx, value in zip(concept_indices, intervention_values):
            modified_concepts[0, idx] = value
        
        # Generate with the modified concepts
        logits = model.model(
            input_ids=masked_tokens,
            concepts=modified_concepts,
            inference=True,
            attention_mask=tokens["attention_mask"]
        )["logits"]
        
        # Get the predicted tokens
        pred_tokens = logits.argmax(dim=-1)
        
        # Create the new sequence using the masked tokens as a template
        mask = (masked_tokens == model.tokenizer.mask_token_id).int()
        new_tokens = (masked_tokens * (1 - mask)) + (pred_tokens * mask)
        
        # Decode to get the new sequence
        new_sequence = model.tokenizer.decode(new_tokens[0]).replace(" ", "")
        
        # Clean the sequence to include only valid amino acids
        valid_aa = "ARNDCQEGHILKMFPSTWYV"
        new_sequence = "".join([aa for aa in new_sequence if aa in valid_aa])
        
        return new_sequence

In [None]:
# Choose two concepts to modify simultaneously
concept1_name = "helix_fraction"
concept2_name = "gravy"  # hydrophobicity

In [None]:
concept1_idx = model.concept_names.index(concept1_name)
concept2_idx = model.concept_names.index(concept2_name)

In [None]:
print(f"Original {concept1_name} for {sequence_name}: {original_concepts[sequence_idx][concept1_idx]:.4f}")
print(f"Original {concept2_name} for {sequence_name}: {original_concepts[sequence_idx][concept2_idx]:.4f}")

In [None]:
# Generate with high helix, low hydrophobicity
high_helix_low_hydro = multi_concept_intervene(
    model, sequence, [concept1_idx, concept2_idx], [0.9, 0.1], device
)

In [None]:
# Generate with low helix, high hydrophobicity  
low_helix_high_hydro = multi_concept_intervene(
    model, sequence, [concept1_idx, concept2_idx], [0.1, 0.9], device
)

In [None]:
# Analyze results
def calculate_gravy(sequence):
    """Calculate the GRAVY (hydrophobicity) using BioPython."""
    try:
        analysis = ProteinAnalysis(sequence)
        return analysis.gravy()
    except Exception as e:
        print(f"Error analyzing sequence: {e}")
        return None

In [None]:
# Calculate actual properties
original_helix = calculate_helix_fraction(sequence)
original_gravy = calculate_gravy(sequence)

In [None]:
high_helix_low_hydro_helix = calculate_helix_fraction(high_helix_low_hydro)
high_helix_low_hydro_gravy = calculate_gravy(high_helix_low_hydro)

In [None]:
low_helix_high_hydro_helix = calculate_helix_fraction(low_helix_high_hydro)
low_helix_high_hydro_gravy = calculate_gravy(low_helix_high_hydro)

In [None]:
print("\nMulti-concept intervention results:")
print(f"Original sequence: Helix={original_helix:.4f}, GRAVY={original_gravy:.4f}")
print(f"High helix, low hydro: Helix={high_helix_low_hydro_helix:.4f}, GRAVY={high_helix_low_hydro_gravy:.4f}")
print(f"Low helix, high hydro: Helix={low_helix_high_hydro_helix:.4f}, GRAVY={low_helix_high_hydro_gravy:.4f}")

## Conclusion

In this tutorial, we've demonstrated how to:

1. Load and use LBSTER's Concept Bottleneck Models
2. Extract and analyze the concepts in protein sequences
3. Perform single-concept interventions to modify specific properties
4. Perform multi-concept interventions to simultaneously adjust multiple properties
5. Analyze the effects of interventions on the actual properties of generated sequences

Concept interventions provide a powerful tool for controlled protein sequence design, allowing you to target specific properties while maintaining the overall structure and function of the protein.