# UME Multi-Modal Embeddings Tutorial

This notebook show how to use the Universal Molecular Encoder (UME) to generate embeddings for different molecular modalities: amino acids, SMILES, and nucleotides. Stay tuned for 3D coordinate embeddings and more!

### Load from checkpoint

In [2]:
from lobster.model import UME

ume = UME.from_pretrained("ume-mini-base-12M") 

print(f"Supported modalities: {ume.modalities}")
print(f"Vocab size: {len(ume.get_vocab())}")
print(f"Embedding dimension: {ume.embedding_dim}")

Supported modalities: ['SMILES', 'amino_acid', 'nucleotide', '3d_coordinates']
Vocab size: 1280
Embedding dimension: 768


### 1. Protein sequences

Embed sample protein sequence to get full sequence embedding or per-residue embeddings.

In [None]:
# Example protein sequences
protein_sequences = [
    "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",  # Sample protein fragment
    "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH"  # Hemoglobin beta chain
]

# Get embeddings for protein sequences
protein_embeddings = ume.embed_sequences(
    protein_sequences, 
    modality="amino_acid"
)

print(f"Protein embeddings shape: {protein_embeddings.shape}")


# Get token-level embeddings (without aggregation)
protein_residue_embeddings = ume.embed_sequences(
    protein_sequences, 
    modality="amino_acid", 
    aggregate=False
)

print(f"Protein token-level embeddings shape: {protein_residue_embeddings.shape}")

### 2. SMILES
SMILES strings are a text-based representation of molecular structures. Here we embed common drug molecules.


In [None]:
# Example SMILES strings for common molecules
smiles_examples = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",  # Caffeine
    "COC1=CC=C(CCN)C=C1",  # Dopamine
    "C1=CC=C(C(=C1)C(=O)O)O",  # Salicylic acid
    "CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@@H]([C@H]([C@@H]1O)O)N1C=NC2=C1N=CN=C2N)OC1=CC=CC=C1"  # Remdesivir
]

# Get embeddings for SMILES
smiles_embeddings = ume.embed_sequences(
    smiles_examples, 
    modality="SMILES"
)

print(f"SMILES embeddings shape: {smiles_embeddings.shape}")

### 3. Nucleotides

Embed example DNA/RNA sequences.

In [None]:
# Example DNA/RNA sequences
nucleotide_sequences = [
    "ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC",  
    "GATTACACAGTGCTTGACCCGATCGATCGATCGATCGATCGATCGATCGA",  
    "AUGCUAUGCUAGCUAGCUAGCUAGCUAUGCUAGCUAUGCUAGCUAUC"  # RNA sequence 
]

# Get embeddings for nucleotide sequences
nucleotide_embeddings = ume.embed_sequences(
    nucleotide_sequences, 
    modality="nucleotide"
)
print(f"Nucleotide embeddings shape: {nucleotide_embeddings.shape}")

## Using Embeddings for Downstream Tasks
Quick example of using molecular embeddings for a classification task.

In [None]:
# Dummy classification setup 
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# SMILES with some property labels
inputs = ["CC(=O)OC1=CC=CC=C1C(=O)O", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "COC1=CC=C(CCN)C=C1", 
              "C1=CC=C(C(=C1)C(=O)O)O", "CC12CCC(CC1)CC(C2)C(C)CN"]
labels = [0, 1, 0, 1, 0]  # Binary classification example

# Get embeddings
X = ume.embed_sequences(inputs, modality="SMILES").cpu().numpy()
y = np.array(labels)

# Train a simple classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

## Evaluate on Molecular Property Prediction Tasks

Here is how to evaluate UME on tasks defined as callbacks. Note that training on and evaluating these tasks will take a few minutes.

In [None]:
import pandas as pd

from lobster.callbacks import CalmLinearProbeCallback, MoleculeACELinearProbeCallback

molecule_ace_probe = MoleculeACELinearProbeCallback(
    max_length=ume.embedding_dim
)
molecule_ace_scores = molecule_ace_probe.evaluate(ume)

In [None]:
pd.DataFrame(molecule_ace_scores).head()

In [None]:
calm_probe = CalmLinearProbeCallback(
    max_length=ume.embedding_dim
)
calm_scores = calm_probe.evaluate(ume)

In [None]:
pd.DataFrame(calm_scores).head()