# Embedding Extraction with LBSTER

This tutorial demonstrates how to extract embeddings from protein sequences using pre-trained LBSTER models.

## Setup and Installation

First, make sure you have LBSTER installed:

```bash
pip install -e .
```

We'll start by importing the necessary libraries:

In [1]:
import torch
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [2]:
from lobster.model import LobsterPMLM, LobsterCBMPMLM

  from .autonotebook import tqdm as notebook_tqdm


## Loading a Pre-trained Model

LBSTER provides several pre-trained models. Let's load a masked language model:

In [3]:
# Choose a model to use
model_name = "asalam91/lobster_24M"  # 24M parameter model

In [4]:
# Load the model
model = LobsterPMLM(model_name)
model.eval()  # Set to evaluation mode

LobsterPMLM(
  (_transform_fn): PmlmTokenizerTransform()
  (model): LMBaseForMaskedLM(
    (LMBase): LMBaseModel(
      (embeddings): LMBaseEmbeddings(
        (word_embeddings): Embedding(32, 408, padding_idx=1)
        (dropout): Dropout(p=0.1, inplace=False)
        (position_embeddings): Embedding(512, 408, padding_idx=1)
      )
      (encoder): LMBaseEncoder(
        (layer): ModuleList(
          (0-9): 10 x LMBaseLayer(
            (attention): LMBaseAttention(
              (self): LMBaseSelfAttention(
                (query): Linear(in_features=408, out_features=408, bias=False)
                (key): Linear(in_features=408, out_features=408, bias=False)
                (value): Linear(in_features=408, out_features=408, bias=False)
                (dropout): Dropout(p=0.0, inplace=False)
                (rotary_embeddings): RotaryEmbedding()
              )
              (output): LMBaseSelfOutput(
                (dense): Linear(in_features=408, out_features=408, bias=True)


In [5]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print(f"Model loaded on {device}")

Model loaded on cpu


## Sample Protein Sequences

Let's define some sample protein sequences to extract embeddings from:

In [6]:
sequences = [
    "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR",  # Hemoglobin alpha
    "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH",  # Hemoglobin beta
    "MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL",  # T4 Lysozyme
    "MEAPAAGAAPPPGPALGNGVAGAGGEAAAAPGGGGEAPARKRGRPGGDNHGPGREARDGPRERLGAGPADAGPGAPGSQHPGGRGRGGGPGLSTLPGGGPGPGGFGPLGFPMRGRGGPGPGGFGPRGGPGAAGFPTRGRGGGPGPDGF",  # Heterogeneous Nuclear Ribonucleoprotein A1
]

## Extracting Embeddings

Now we'll extract embeddings from these sequences using our model:

In [7]:
# Turn off gradient calculation for inference
with torch.no_grad():
    # Get embeddings for each sequence
    embeddings = []
    for seq in sequences:
        # Tokenize and process the sequence
        tokens = model.tokenizer(seq, return_tensors="pt").to(device)
        
        # Get the embedding (using the [CLS] token representation)
        outputs = model.model(
            input_ids=tokens["input_ids"],
            attention_mask=tokens["attention_mask"]
        )
        
        # Extract the [CLS] token embedding
        cls_embedding = outputs[:, 0, :].cpu().numpy()
        embeddings.append(cls_embedding.squeeze())
    
    # Convert list to numpy array
    embeddings = np.array(embeddings)

TypeError: tuple indices must be integers or slices, not tuple

In [None]:
print(f"Embedding shape: {embeddings.shape}")

## Visualizing Embeddings

Let's visualize the embeddings using PCA to reduce dimensions:

In [None]:
# Reduce dimensions with PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

In [None]:
# Plot the embeddings
plt.figure(figsize=(10, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100)

In [None]:
# Add labels
for i, seq_name in enumerate(["Hemoglobin α", "Hemoglobin β", "T4 Lysozyme", "hnRNP A1"]):
    plt.annotate(seq_name, (embeddings_2d[i, 0], embeddings_2d[i, 1]), fontsize=12)

In [None]:
plt.title("PCA of Protein Embeddings", fontsize=14)
plt.xlabel("PC1", fontsize=12)
plt.ylabel("PC2", fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## Using Concept Bottleneck Models

LBSTER also provides concept bottleneck models that can provide interpretable concepts:

In [None]:
# Load a concept bottleneck model
cb_model_name = "asalam91/cb_lobster_24M"
cb_model = LobsterCBMPMLM(cb_model_name)
cb_model.eval()
cb_model = cb_model.to(device)

In [None]:
# Extract concepts
concepts = []
with torch.no_grad():
    for seq in sequences:
        # Tokenize and process the sequence
        tokens = cb_model.tokenizer(seq, return_tensors="pt").to(device)
        
        # Get concepts
        outputs = cb_model.model(
            input_ids=tokens["input_ids"],
            attention_mask=tokens["attention_mask"],
            inference=True
        )
        
        # Extract the concepts
        seq_concepts = outputs["concepts"].cpu().numpy().squeeze()
        concepts.append(seq_concepts)
    
    # Convert list to numpy array
    concepts = np.array(concepts)

In [None]:
print(f"Concepts shape: {concepts.shape}")

## Analyzing Top Concepts

In [None]:
# Display top 5 concepts for each sequence
concept_names = cb_model.concept_names[:concepts.shape[1]]  # Get the concept names

In [None]:
for i, seq_name in enumerate(["Hemoglobin α", "Hemoglobin β", "T4 Lysozyme", "hnRNP A1"]):
    # Get the top 5 concept indices for this sequence
    top_concept_indices = np.argsort(concepts[i])[-5:][::-1]
    
    # Display the top concepts and their values
    print(f"\nTop concepts for {seq_name}:")
    for idx in top_concept_indices:
        print(f"  {concept_names[idx]}: {concepts[i][idx]:.4f}")

## Conclusion

In this tutorial, we've demonstrated how to:

1. Load pre-trained LBSTER models
2. Extract embeddings from protein sequences
3. Visualize these embeddings using PCA
4. Extract and analyze interpretable concepts using the concept bottleneck models

These embeddings can be used for various downstream tasks such as clustering, classification, or visualization of protein sequences.