## Notes
- This notebook assumes `foundry install rfd3 ligandmpnn rf3` has been run to download checkpoints, stored at default_ckpt
- MPNN sequence constraints in passes_constraints
- refolded.cif is the final output .cif, Complex of binder + target segment, Chain A is the binder, B is the target segment
- final.fasta is the final .fasta for the binder


## Changelog 
14 Jan 2026
- SKIP_RFD3 will skip RFD3 generation to load existing outputs
- Added AtomArrayPlus â†’ AtomArray conversion before MPNN.
- Export best sequence to `final.fasta` and all passing sequences to `final_passed.fasta`.


## Changelog
14 Jan 2026
- 

In [102]:
import os
from pathlib import Path

# Optional local mirrors (leave blank unless configured)
os.environ.setdefault('CCD_MIRROR_PATH', '')
os.environ.setdefault('PDB_MIRROR_PATH', '')

# Ensure Foundry can find checkpoints
default_ckpt = Path('/mnt/e/Code/RFDiffusion3/checkpoints')
if not default_ckpt.exists():
    default_ckpt = Path(r'E:\Code\RFDiffusion3\checkpoints')

if default_ckpt.exists():
    os.environ.setdefault('FOUNDRY_CHECKPOINT_DIRS', str(default_ckpt))
    print(f'Using checkpoints at: {default_ckpt}')
else:
    print('Checkpoint directory not found. Set FOUNDRY_CHECKPOINT_DIRS and run `foundry install rfd3 ligandmpnn rf3`.')

# Output directory (per-target)
RESULTS_ROOT = Path('/mnt/e/Code/RFDiffusion3/Results')

# Set True to skip RFD3 generation and reuse existing outputs
SKIP_RFD3 = True


Using checkpoints at: /mnt/e/Code/RFDiffusion3/checkpoints


In [103]:
import subprocess
import torch

# Improve matmul performance on supported GPUs
torch.set_float32_matmul_precision('high')


# Basic GPU visibility check
try:
    subprocess.run(['nvidia-smi'], check=True)
except FileNotFoundError:
    print('nvidia-smi not found. Ensure NVIDIA drivers are installed and visible.')
except subprocess.CalledProcessError as exc:
    print('nvidia-smi failed with exit code', exc.returncode)

print('torch.cuda.is_available():', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))



Wed Jan 14 18:03:51 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.52.01              Driver Version: 591.74         CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   44C    P8             19W /  285W |   10103MiB /  12282MiB |     19%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

# Example: End-To-End Binder Design Pipeline

## Overview

This notebook demonstrates an end-to-end protein binder workflow using three models from the Institute for Protein Design:

| Step | Model | Purpose |
|------|-------|---------|
| 1. **Binder backbone generation** | RFD3 | Generate a binder backbone conditioned on a target structure |
| 2. **Sequence design** | MPNN | Design amino acid sequences for the generated backbone |
| 3. **Structure prediction** | RF3 | Re-fold the designed sequence to validate designability |

This version is configured for a short peptide binder (length 7-20) against `MIN_ANCHOR_AF-P51679-F1-model_v6_trimmed_bundle.pdb`.


In [104]:
import warnings
warnings.filterwarnings('ignore', module='atomworks')

# Shared utilities for visualization (from AtomWorks)
from atomworks.io.utils.visualize import view

## Section 1: Binder Backbone Generation with RFD3

RFdiffusion3 (RFD3) can design binders by conditioning on a target structure.

**Key settings:**
- `input`: target PDB
- `contig`: binder length range and target residue span
- `infer_ori_strategy`: `com` (center-of-mass) unless hotspots are provided
- `is_non_loopy`: encourages more structured binders


In [105]:
from lightning.fabric import seed_everything
from rfd3.engine import RFD3InferenceConfig, RFD3InferenceEngine

# Set seed for reproducibility
seed_everything(0)

# Target configuration
target_pdb = r"/mnt/e/Code/BindCraft/InputTargets/MIN_ANCHOR_AF-P51679-F1-model_v6_trimmed_bundle.pdb"
target_name = Path(target_pdb).stem
output_dir = RESULTS_ROOT / target_name
output_dir.mkdir(parents=True, exist_ok=True)
rfd3_out_dir = output_dir / 'rfd3'
rfd3_out_dir.mkdir(parents=True, exist_ok=True)
binder_length_range = "7-20"
target_chain = "A"
target_range = "170-212"  # contiguous region present in PDB
contig = f"{binder_length_range},/0,{target_chain}{target_range}"

# Optional: define hotspots to bias the interface (residue or atom-level selection)
# Example: {"A56": "CG,OH", "A115": "CG,SD"}
select_hotspots = {
    "A208": "ALL",
    "A209": "ALL",
    "A210": "ALL",
    "A211": "ALL",
    "A212": "ALL",
}

infer_ori_strategy = "hotspots" if select_hotspots else "com"

spec = {
    "dialect": 2,
    "input": target_pdb,
    "contig": contig,
    "infer_ori_strategy": infer_ori_strategy,
    "is_non_loopy": True,
}
if select_hotspots:
    spec["select_hotspots"] = select_hotspots

# Configure RFD3 inference
config = RFD3InferenceConfig(
    specification=spec,
    skip_existing=False,  # overwrite existing outputs in out_dir
    diffusion_batch_size=2, # number of designs to generate in parallel
)

if not SKIP_RFD3:
    # Initialize engine and run generation
    model = RFD3InferenceEngine(**config)
    outputs = model.run(
        inputs=None,      # Use specification above
        out_dir=str(rfd3_out_dir),  # write RFD3 outputs
        n_batches=32,      # Number of batches to generate
)
else:
    print('Skipping RFD3 generation; reusing existing outputs.')
    outputs = None


Seed set to 0


Skipping RFD3 generation; reusing existing outputs.


In [106]:
from pathlib import Path
from atomworks.ml.utils.condition import load_atom_array_with_conditions_from_cif

# Inspect RFD3 outputs and extract the generated structures
if outputs:
    for idx, data in outputs.items():
        print(f"Batch {idx}: {len(data)} structure(s)")
        print(f"  Output type: {type(data[0]).__name__}")
        print(f"  AtomArray: {data[0].atom_array}")

    # Extract the first generated structure for downstream use
    first_key = next(iter(outputs.keys()))
    atom_array = outputs[first_key][0].atom_array
else:
    # Load the most recent CIF output from disk
    cif_files = sorted(Path(rfd3_out_dir).glob("*.cif.gz"))
    if not cif_files:
        raise RuntimeError(
            "No RFD3 CIFs found in "
            f"{rfd3_out_dir}. Run RFD3 once or set the path correctly."
        )
    latest = cif_files[-1]
    print(f"Loading CIF output from {latest}")
    atom_array = load_atom_array_with_conditions_from_cif(latest)

# Visualize the generated structure
# view(atom_array)




Loading CIF output from /mnt/e/Code/RFDiffusion3/Results/MIN_ANCHOR_AF-P51679-F1-model_v6_trimmed_bundle/rfd3/_9_model_1.cif.gz


In [107]:
# Convert AtomArrayPlus -> AtomArray for MPNN compatibility
if hasattr(atom_array, 'as_atom_array'):
    atom_array = atom_array.as_atom_array()



---

## Section 2: Sequence Design with MPNN

ProteinMPNN designs amino acid sequences that will fold into the RFD3-generated binder backbone.

**Constraints applied after design:**
- No Arg/Lys/Pro at positions 1-2
- No odd number of Cys
- Optional A/E/D prefix at positions 1-2

**Key Parameters:**
- `batch_size`: Number of sequences to generate per structure
- `remove_waters`: Whether to exclude water molecules from context


In [None]:
from mpnn.inference_engines.mpnn import MPNNInferenceEngine

# Configure MPNN inference engine
# See mpnn.utils.inference.MPNN_GLOBAL_INFERENCE_DEFAULTS for all options
engine_config = {
    "model_type": "protein_mpnn",  # protein-only binder design
    "is_legacy_weights": True,       # Required for now for protein_mpnn
    "out_directory": str(output_dir / 'mpnn'),  # write MPNN outputs
    "write_structures": True,
    "write_fasta": True,
}


# Configure per-input inference options
# See mpnn.utils.inference.MPNN_PER_INPUT_INFERENCE_DEFAULTS for all options
input_configs = [
    {
        "batch_size": 32,         # Number of sequences to design per structure
        "remove_waters": True,
    }
]

# Run sequence design on the RFD3-generated binder backbone
model = MPNNInferenceEngine(**engine_config)
mpnn_outputs = model.run(input_dicts=input_configs, atom_arrays=[atom_array])


In [None]:
from biotite.structure import get_residue_starts
from biotite.sequence import ProteinSequence

def atom_array_to_sequence(atom_array):
    res_starts = get_residue_starts(atom_array)
    return ''.join(
        ProteinSequence.convert_letter_3to1(res_name)
        for res_name in atom_array.res_name[res_starts]
    )


# Set of constraints to filter designed sequences
def passes_constraints(seq, require_aed_prefix=False):
    seq = seq.upper()
    if not seq:
        return False
    # No Arg/Lys/Pro at N-terminus (positions 1-2)
    if seq[0] in "RKP":
        return False
    if len(seq) >= 2 and seq[1] in "RKP":
        return False
    # No odd number of cysteines
    if seq.count('C') % 2 == 1:
        return False
    # Optional: enforce A/E/D at positions 1-2
    if require_aed_prefix and (len(seq) < 2 or not ({seq[0], seq[1]} <= {'A', 'E', 'D'})):
        return False
    return True

require_aed_prefix = False  # set True to require A/E/D at positions 1-2

sequences = []
valid_indices = []

for i, item in enumerate(mpnn_outputs):
    seq = atom_array_to_sequence(item.atom_array)
    sequences.append(seq)
    if passes_constraints(seq, require_aed_prefix=require_aed_prefix):
        valid_indices.append(i)

print(f"Generated {len(sequences)} designed sequences")
print(f"Valid sequences (constraints passed): {len(valid_indices)}")

for i, seq in enumerate(sequences):
    status = "OK" if i in valid_indices else "FAIL"
    print(f"Sequence {i+1} [{status}]: {seq}")

if not valid_indices:
    raise ValueError('No sequences passed constraints. Increase MPNN BATCH_SIZE or relax constraints.')

selected_idx = valid_indices[0]
selected_sequence = sequences[selected_idx]
designed_atom_array = mpnn_outputs[selected_idx].atom_array
print(f"Selected sequence {selected_idx+1}: {selected_sequence}")



Generated 32 designed sequences
Valid sequences (constraints passed): 0
Sequence 1 [FAIL]: LPLGLLLLGLLALLLLLLSLLALKYLREYTENGVTHRELVFEENPLLQELLYLLLLLLLGVL
Sequence 2 [FAIL]: LPLGAAAAALAAAAAALAASLLERYLREYTVDGVTHRDIIFEEDALSKALAFALLTLLLGVL
Sequence 3 [FAIL]: KPKGKKKKKKKKEKLKKLEKMKKKYLRKYKVNGKTKEKIIFKKKKKKKKKKYKKYKKKKGKK
Sequence 4 [FAIL]: APLGAALAAALLAALAAAAALLARYLREHTVDGVTHRALVFTEDAAAAAAAYAAALLALGVA
Sequence 5 [FAIL]: GPEGERRRLEEERRRLLEEEMEERYLREYTENGVTHTDLIFLENPEEKEREYEERLRREGVE
Sequence 6 [FAIL]: LPAGARLAAEARERLERERDLEARYLREYTEDGVTHTALVFTEDAEAQKALYLAYRLAEGVL
Sequence 7 [FAIL]: KPKGEEKKKKEELEKKEKESMKKKYLREYTKDGVTKKKIIFKKDAKKKKKKYKKYKKKKGVK
Sequence 8 [FAIL]: KKKGKEKKEKLKELKEKKEKMKKKYKRKYTENGVTYTKLVFKKKAKKKKKLYKKKKKLKGVL
Sequence 9 [FAIL]: GPAGAAAAAAAAAAAAAAAALLATYTREYTENGVTHKEVVFLEDPEAQKAAYAALLAYLGVT
Sequence 10 [FAIL]: APLGGALLLALLAALLTAADLEKKYTRTYTENGVTHTDLIFTENAAEKEALYKALKLLLGVL
Sequence 11 [FAIL]: GPLGLLLLTLLLLTLLLLSELEKKYLREYTENGVTHRELIFTENPEEKKLAYELLKLFLGVI
Sequence 12 [FAIL]: LPKG

ValueError: No sequences passed constraints. Increase MPNN batch_size or relax constraints.

---

## Section 3: Structure Prediction with RF3

RF3 (RoseTTAFold 3) predicts protein structures from sequences. By re-folding the MPNN-designed sequence, we can validate whether the design is likely to adopt the intended backbone structure.

**Outputs:** `RF3Output` objects containing:
- `atom_array`: Predicted structure as Biotite AtomArray
- `summary_confidences`: Overall confidence metrics (pLDDT, PAE, pTM, etc.)
- `confidences`: Per-atom/residue confidence scores

**Confidence Metrics:**
| Metric | Description |
|--------|-------------|
| pLDDT | Per-residue confidence (0-1, higher is better) |
| PAE | Predicted Aligned Error (lower is better) |
| pTM | Predicted TM-score |
| ranking_score | Overall model quality score |

In [None]:
from rf3.inference_engines.rf3 import RF3InferenceEngine
from rf3.utils.inference import InferenceInput

# Initialize RF3 inference engine
inference_engine = RF3InferenceEngine(ckpt_path='rf3', verbose=False)

# Re-fold the MPNN-designed sequence to validate it adopts the intended structure
input_structure = InferenceInput.from_atom_array(designed_atom_array, example_id='example_protein')
rf3_outputs = inference_engine.run(inputs=input_structure)

# Outputs: dict mapping example_id -> list[RF3Output] (multiple models per input)
print(f"Output keys: {rf3_outputs.keys()}")
print(f"Number of models for 'example_protein': {len(rf3_outputs['example_protein'])}")


In [None]:
# Extract the top-ranked prediction
rf3_output = rf3_outputs["example_protein"][0]

# Inspect RF3Output structure
print(f"RF3Output contains:")
print(f"  - atom_array: {len(rf3_output.atom_array)} atoms")
print(f"  - summary_confidences: {list(rf3_output.summary_confidences.keys())}")
print(f"  - confidences: {list(rf3_output.confidences.keys()) if rf3_output.confidences else None}")

# Visualize the predicted structure
view(rf3_output.atom_array)

In [None]:
# Summary confidences: overall model quality metrics
summary = rf3_output.summary_confidences

print("=== Summary Confidences ===")
print(f"  Overall pLDDT:    {summary['overall_plddt']:.3f}")
print(f"  Overall PAE:      {summary['overall_pae']:.2f} A")
print(f"  Overall PDE:      {summary['overall_pde']:.3f}")
print(f"  pTM:              {summary['ptm']:.3f}")
print(f"  ipTM:             {summary.get('iptm', 'N/A (single chain)')}")
print(f"  Ranking score:    {summary['ranking_score']:.3f}")
print(f"  Has clash:        {summary['has_clash']}")

In [None]:
# Detailed per-atom/residue confidences
conf = rf3_output.confidences

print("=== Per-Atom/Residue Confidences ===")
print(f"  atom_plddts:      {len(conf['atom_plddts'])} values (one per atom)")
print(f"  atom_chain_ids:   {len(conf['atom_chain_ids'])} values")
print(f"  token_chain_ids:  {len(conf['token_chain_ids'])} values (one per residue)")
print(f"  token_res_ids:    {len(conf['token_res_ids'])} values")
print(f"  PAE matrix:       {len(conf['pae'])}x{len(conf['pae'][0])}")

# Preview first 10 atom pLDDT scores
import numpy as np
print(f"\nFirst 10 atom pLDDTs: {np.round(conf['atom_plddts'][:10], 2).tolist()}")

---

## Section 4: Validation and Export

The final step compares the RF3-predicted structure against the original RFD3-generated backbone. A low backbone RMSD indicates the designed sequence is likely to fold into the intended structure (high designability).

In [None]:
from biotite.structure import rmsd, superimpose
from atomworks.constants import PROTEIN_BACKBONE_ATOM_NAMES
import numpy as np

# Get structures for comparison
aa_generated = atom_array              # Original RFD3 backbone (from Section 1)
aa_refolded = rf3_output.atom_array    # RF3-predicted structure

# Filter to backbone atoms (N, CA, C, O)
bb_generated = aa_generated[np.isin(aa_generated.atom_name, PROTEIN_BACKBONE_ATOM_NAMES)]
bb_refolded = aa_refolded[np.isin(aa_refolded.atom_name, PROTEIN_BACKBONE_ATOM_NAMES)]

# Superimpose structures and calculate RMSD
bb_refolded_fitted, _ = superimpose(bb_generated, bb_refolded)
rmsd_value = rmsd(bb_generated, bb_refolded_fitted)

print(f"Backbone RMSD: {rmsd_value:.2f} A")
print(f"\nInterpretation: {'Excellent' if rmsd_value < 1.0 else 'Good' if rmsd_value < 2.0 else 'Moderate'} designability")

In [None]:
from atomworks.io.utils.io_utils import to_cif_file

# Export structures to CIF format for visualization in PyMOL/ChimeraX
generated_path = output_dir / 'generated.cif'
refolded_path = output_dir / 'refolded.cif'
to_cif_file(aa_generated, str(generated_path))
to_cif_file(aa_refolded, str(refolded_path))

print('Exported structures:')
print(f'  - {generated_path}: Original RFD3 backbone')
print(f'  - {refolded_path}:  RF3-predicted structure')

# Export all passing sequences to FASTA (after RF3)
final_passed = output_dir / "final_passed.fasta"
with open(final_passed, "w") as f:
    for i in valid_indices:
        f.write(f">{target_name}_pass_{i+1}\n{sequences[i]}\n")
print(f"Wrote {len(valid_indices)} passing sequences to {final_passed}")
