GitHub - iamhamzaabdullah/BindFM: The first open foundation model for binding prediction across all five molecular modalities — protein, RNA, DNA, small molecule. Built from scratch. No pretrained encoders.

██████╗ ██╗███╗   ██╗██████╗ ███████╗███╗   ███╗
██╔══██╗██║████╗  ██║██╔══██╗██╔════╝████╗ ████║
██████╔╝██║██╔██╗ ██║██║  ██║█████╗  ██╔████╔██║
██╔══██╗██║██║╚██╗██║██║  ██║██╔══╝  ██║╚██╔╝██║
██████╔╝██║██║ ╚████║██████╔╝██║     ██║ ╚═╝ ██║
╚═════╝ ╚═╝╚═╝  ╚═══╝╚═════╝ ╚═╝     ╚═╝     ╚═╝

Universal Biomolecular Binding Foundation Model

Built atom-by-atom. From scratch. Five modalities. One representation.

Terminal Bio · Lead Researcher: Hamza Abdullah

The Problem with Every Other Approach

Every serious attempt at universal binding prediction hits the same wall: incompatible representation spaces.

ESM-C lives in protein sequence space. Small molecule encoders live in SMILES graph space. RNA models live in covariance structure space. When you stack these together and call it "universal," you're not unifying anything — you're building a routing table.

BindFM takes the only principled approach: represent every molecular entity — protein, RNA, DNA, small molecule — as the same thing it actually is: a graph of atoms in 3D space. One encoder. Shared weights. Universal representation.

A protein carbon and an RNA carbon pass through identical learned transformations. This is not a design constraint. This is the entire point.

What BindFM Does

Input Output

Any two molecules in any format:

Protein sequence (MKTLL...)
RNA/DNA sequence (GGTTGG...)
SMILES string (CC(=O)Oc1...)
PDB file path (complex.pdb)
Modified aptamer ([LNA-A][2F-C]...)

Three co-trained predictions:

Binding affinity — log Kd, P(bind), kon/koff, t½, uncertainty
3D complex structure — atom coordinates via SE(3)-equivariant flow matching
De novo binder design — generated candidates conditioned on target Kd

Binding Modalities Covered

#	Modality	Examples
1	Protein ↔ Small Molecule	Drug-target binding, covalent inhibitors, fragments
2	Protein ↔ Protein	PPI inhibitors, antibody-antigen, SKEMPI2 ΔΔG mutations
3	Protein ↔ Nucleic Acid	RNA-binding proteins, CRISPR-Cas9, splicing factors
4	Nucleic ↔ Small Molecule	Aptamer-drug binding, riboswitches, G-quadruplex ligands
5	Nucleic ↔ Nucleic	siRNA-mRNA, aptamer secondary structure, miRNA targeting

Special depth on therapeutic aptamers. Full support for LNA, 2'F, 2'OMe, phosphorothioate, FANA, and morpholino modifications — not as special cases, but through explicit modification tokens in the universal atom representation.

Architecture

                          ╔══════════════════════════════════════╗
                          ║        197-dim Atom Tokenizer        ║
                          ║  element · hybridization · chirality ║
                          ║  ring · H-bond · charge · mod-type   ║
                          ╚═══════════════╦══════════════════════╝
                                          │
              ┌───────────────────────────┼───────────────────────────┐
              │                           │                           │
        Entity A ──────────────►  ╔═══════════════╗  ◄────────────── Entity B
   (protein/RNA/DNA/              ║ Shared EGNN   ║               (any modality)
    small molecule)               ║   Encoder     ║
                                  ║               ║
                                  ║  8 equivariant║
                                  ║  EGNN layers  ║
                                  ║               ║
                                  ║  Same weights ║
                                  ║  both sides   ║
                                  ╚═══════════════╝
                                  h_a [N_a, 512]      h_b [N_b, 512]
                                  x_a [N_a, 3]        x_b [N_b, 3]
                                          │
                          ╔═══════════════╩═════════════════╗
                          ║         PairFormer Trunk         ║
                          ║                                  ║
                          ║  32 layers of:                   ║
                          ║  · outer product mean            ║
                          ║  · triangle multiplicative update ║
                          ║  · triangle self-attention       ║
                          ║  · single representation FFN     ║
                          ╚══════════╦══════════════════════╝
                                     │
         ╔═══════════════════════════╬══════════════════════════════╗
         ▼                           ▼                              ▼
  ┌─────────────┐           ┌─────────────────┐           ┌─────────────────┐
  │  Affinity   │           │    Structure    │           │   Generative   │
  │    Head     │           │      Head       │           │     Head        │
  │─────────────│           │─────────────────│           │─────────────────│
  │ log_kd [nM] │           │ Euler ODE       │           │ Flow matching   │
  │ P(bind)     │           │ flow matching   │           │ in (feat,coord) │
  │ kon / koff  │           │ 3D complex      │           │ space           │
  │ half-life   │           │ coordinates     │           │                 │
  │ uncertainty │           │ (Å)             │           │ conditioned on  │
  │ (Kendall)   │           │                 │           │ target Kd       │
  └─────────────┘           └─────────────────┘           └─────────────────┘

Core Design Decisions

Why EGNN (not SE(3)-Transformer or NequIP)? Satorras et al. 2021's EGNN achieves the same E(n)-equivariance as more complex architectures through a beautiful simplification: update coordinates as a weighted sum of displacement vectors. No spherical harmonics. No Clebsch-Gordan coefficients. Same theoretical guarantees, dramatically simpler implementation, proven on molecular binding tasks.

Why shared encoder weights? Because that's the only way two molecules end up in the same representation space. When Entity A and Entity B share an encoder, their atom embeddings are directly comparable. The trunk can compute meaningful pairwise interactions between a protein residue's nitrogen and an aptamer's phosphate backbone — not just their respective pre-computed embeddings.

Why flow matching (not diffusion)? OT-CFM (Lipman et al. 2022) uses straight interpolation paths between noise and data. This means fewer NFE at inference, better mode coverage, and a cleaner loss landscape for multi-task co-training.

Why train from scratch? Because there's no pretrained model that can be fine-tuned to this. ESM-C, Evo2, RNA-FM — all of these learned representations shaped by evolutionary signals, not binding physics. Borrowing them would mean inheriting the wrong inductive biases for every modality boundary crossing.

Installation

git clone https://github.com/iamhamzaabdullah/BindFM.git
cd BindFM

Minimal (CPU / testing):

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install numpy scipy biopython

Full (GPU / training):

pip install -r requirements.txt

Verify:

python3 quickstart.py --device cpu --size small --skip-training
# ✓ Tokenizer        ATOM_FEAT_DIM=197
# ✓ Encoder          DualEntityEncoder — 2 entities, shared weights
# ✓ Trunk            PairFormerTrunk
# ✓ Heads            Affinity + Structure + Generative
# ✓ Full forward pass (aspirin ↔ peptide)
# ✓ Thrombin aptamer ↔ thrombin protein
# ✓ HIV TAR RNA ↔ argininamide
# ✓ ALL TESTS PASSED

Usage

Predict Binding Affinity

from inference.api import BindFMPredictor
from model.bindfm import BindFMConfig, BindFM

# Load trained checkpoint
predictor = BindFMPredictor.from_checkpoint("checkpoints/bindfm_stage3_best.pt")

# --- or initialize with random weights for testing ---
predictor = BindFMPredictor.from_config(BindFMConfig.small())

# Auto-detects input format (SMILES / sequence / PDB path)
result = predictor.predict_affinity(
    binder = "GGTTGGTGTGGTTGG",        # thrombin DNA aptamer
    target = "MKTLLLTLVVVTIVCLDLGYT",  # thrombin (fragment)
)
print(result)

BindFM Affinity Prediction
  Binding probability:  0.912
  Kd:                   8.4 nM
  kon:                  10^6.3 M⁻¹s⁻¹
  koff:                 10^-1.6 s⁻¹
  Residence time (t½):  18.7 min
  Uncertainty:          ±0.38 log units

Predict 3D Complex Structure

struct = predictor.predict_structure(
    binder      = "CC(=O)Oc1ccccc1C(=O)O",   # aspirin SMILES
    target      = "MKTLLLTLVVVTIVCLDL",
    n_steps     = 100,                         # Euler ODE steps
    output_pdb  = "aspirin_complex.pdb",
)
# StructureResult: 21 binder atoms, 162 target atoms

Generate De Novo Aptamers

candidates = predictor.generate_binders(
    target       = "MKTLLLTLVVVTIVCLDLGYT",
    modality     = "aptamer",        # or: "dna_aptamer", "protein", "small_mol"
    n_candidates = 100,
    target_kd_nM = 10.0,             # condition generation on desired affinity
    n_steps      = 200,
)

for c in candidates[:5]:
    print(c)
# Candidate #1: GCGGTTGGTGTGGTTGGCG...
#   Pred. Kd:  6.2 nM
#   P(bind):   0.953

Screen a Compound Library

library = [
    "CC(=O)Oc1ccccc1C(=O)O",   # aspirin
    "Cn1cnc2c1c(=O)n(C)c(=O)n2C",  # caffeine
    # ... thousands of compounds
]

hits = predictor.screen_library(
    library    = library,
    target     = "MKTLLLTLVVVTIVCLDLGYT",
    top_k      = 10,
    assay_type = "Kd",
    verbose    = True,
)
# Screened 1000/1000...
# Rank 1  CC(=O)Oc1ccccc1...  Kd=4.1 nM  P(bind)=0.97

Explicit Modality Hints

# For ambiguous sequences or modified aptamers:
result = predictor.predict_affinity(
    binder      = "[LNA-G][LNA-G]TTGGTGTGGTTGG",
    target      = "MKTLLLTLVVVTIVCLDLGYT",
    binder_hint = "aptamer",    # force RNA aptamer interpretation
    target_hint = "protein",
    assay_type  = "SPR_Kd",     # Kd | Ki | IC50 | EC50 | SPR_Kd | ITC | EMSA
)

Training

1. Data Setup

# Download all training sources (~500 GB)
bash scripts/download_data.sh --data-dir ./data

# Build PDB structure index (100k structures)
python3 scripts/download_pdb_subset.py \
    --output-dir ./data/pdb \
    --max-structures 100000

# Merge all affinity sources into unified CSV
python3 scripts/build_affinity_index.py --data-dir ./data

Training data sources

Source	Modality	Scale
PDBbind 2020	Protein–ligand	~19k complexes with Kd
BindingDB	Protein–SM, Protein–Protein	~2.8M measurements
ChEMBL 33	Protein–SM	~20M bioactivity points
AptaBase	Aptamer–Protein	SELEX-derived binding data
SKEMPI2	Protein–Protein	7,085 ΔΔG on mutation
RNAcompete	RNA–Protein	244 RBP binding profiles
CovalentDB	Covalent inhibitors	Warhead + target pairs
PDB co-crystals	All modalities	~150k complex structures

2. Curriculum Training (4 Stages)

# Stage 0 — Geometry pretraining: learn what molecules look like in 3D
# Denoising objective on single-molecule coordinates
# Hardware: 4×A100, ~2 weeks
python3 training/train.py \
    --config configs/training_configs.yaml \
    --key full_stage0 \
    --data-dir ./data \
    --checkpoint-dir ./checkpoints

# Stage 1 — Complex structural pretraining: learn binding poses
# Flow matching on PDB co-crystal complexes, all modalities
# Hardware: 8×A100, ~3 weeks
python3 training/train.py --key full_stage1 \
    --prev-checkpoint ./checkpoints/bindfm_stage0_final.pt [...]

# Stage 2 — Affinity regression: introduce Kd/Ki/IC50 signal
# Multi-task: regression + binary classification + uncertainty
# Hardware: 4×A100, ~1 week
python3 training/train.py --key full_stage2 \
    --prev-checkpoint ./checkpoints/bindfm_stage1_final.pt [...]

# Stage 3 — Joint fine-tuning + generation
# All heads active. Generative flow matching co-trained end-to-end.
# Hardware: 8×A100, ~2 weeks
python3 training/train.py --key full_stage3 \
    --prev-checkpoint ./checkpoints/bindfm_stage2_final.pt [...]

Using make:

make train-small    # Small model, all 4 stages
make train-medium   # Medium model
make train-full     # Full model (A100 cluster required)

Curriculum Rationale

The 4-stage curriculum mirrors how a physicist would think about this problem:

Stage 0 — Learn molecular geometry without any binding signal. No label noise, no task complexity. Just: what do atoms look like in 3D space?
Stage 1 — Learn what bound complexes look like structurally. The model learns interface geometry before it sees any affinity numbers.
Stage 2 — Introduce quantitative affinity. The encoder and trunk are now primed with structural intuition, so affinity signals propagate meaningfully.
Stage 3 — Unlock everything simultaneously. Generation is co-trained with affinity and structure, ensuring generated binders are both chemically valid and predicted to bind.

Benchmarks

python3 -c "
from benchmarks.evaluate import BindFMEvaluator
from model.bindfm import BindFM

model     = BindFM.load('checkpoints/bindfm_stage3_best.pt')
evaluator = BindFMEvaluator(model, data_dir='./data')
results   = evaluator.run_all()
evaluator.save_results(results, 'benchmark_results.json')
"

Benchmark	Task	Primary Metric	Target
PDBbind Core Set	Protein–ligand Kd	Pearson R	> 0.80
CASF-2016	Scoring / ranking / docking	Scoring R	> 0.78
BindingDB held-out	Generalization to new targets	RMSE (log Kd)	< 0.80
SKEMPI2 ΔΔG	Mutation effect on PPI	Spearman R	> 0.65
AptaBase	Aptamer–protein binding	AUROC	> 0.85
Virtual Screening	Enrichment in compound library	EF @ 1%	> 20×
Novel Scaffold	Chemotype generalization	Pearson R	> 0.65
Cross-Modality	Train SM → test aptamer	Pearson R	> 0.60
Allosteric Sites	Non-orthosteric binding	AUROC	> 0.75

Model Sizes

Variant	Encoder	Trunk	Parameters	Hardware	Stage 0–3 Time
Small	3L, d=64	3L, d_s=128	~2.1M	Free T4 (Colab)	40 min
Medium	6L, d=128	12L, d_s=256	~45M	4×A100	1 week
Full	8L, d=256	32L, d_s=512	~380M	8×A100	8 weeks

Micro-Run on Free GPU

Run the small model end-to-end on Google Colab or Kaggle in ~40 minutes:

The notebook trains the nano model (~2.1M params) on synthetic data and runs all three heads. It's designed to be a legible, self-contained demonstration of every architectural component.

Repository Structure

BindFM/
│
├── model/                          # Core neural network
│   ├── tokenizer.py                # 197-dim universal atom featurizer
│   ├── encoder.py                  # SE(3)-equivariant EGNN encoder (shared weights)
│   ├── trunk.py                    # PairFormer binding trunk (triangle updates)
│   ├── heads.py                    # Affinity + Structure + Generative heads
│   └── bindfm.py                   # Complete BindFM model + BindFMConfig
│
├── data/
│   ├── parsers.py                  # SMILES (RDKit), PDB (biopython), SELEX, SDF
│   └── dataset.py                  # Stage 0–3 curriculum Dataset classes
│
├── training/
│   └── train.py                    # BindFMTrainer: AMP, grad accum, curriculum
│
├── inference/
│   └── api.py                      # BindFMPredictor: clean public API
│
├── benchmarks/
│   └── evaluate.py                 # 9 standardized benchmark evaluations
│
├── configs/
│   ├── training_configs.yaml       # All hyperparameters (3 sizes × 4 stages)
│   └── config_loader.py
│
├── scripts/
│   ├── download_data.sh            # Master data download
│   ├── download_pdb_subset.py      # RCSB PDB API downloader
│   ├── build_affinity_index.py     # Merge all affinity sources → unified CSV
│   └── preprocessing_utils.py     # 9 preprocessing routines
│
├── notebooks/
│   └── BindFM_MicroRun.ipynb       # Colab/Kaggle micro-run (T4, ~40 min)
│
├── quickstart.py                   # End-to-end smoke test (real molecules)
├── Makefile                        # All operations as one-liners
├── requirements.txt
├── CITATION.cff
└── CONTRIBUTING.md

Comparison to Related Work

System	Modalities	Affinity	Structure	Generation	Aptamers	Open
BindFM	All 5	✓	✓	✓	✓ (modified)	✓
AlphaFold3	Protein+SM+NA	✗	✓	✗	partial	✗
Boltz-2	Protein+SM	✓	✓	✗	✗	✓
DiffDock	Protein+SM	✗	✓	✗	✗	✓
AptaBLE	Aptamer+Protein	partial	✗	✗	✓	✓
RoseTTAFold-All-Atom	Protein+SM+NA	✗	✓	✗	partial	✓

The closest work is EPT (Nature Comms 2025) — an all-atom equivariant pretrained transformer. BindFM differs in: (1) explicit aptamer modification tokens, (2) co-trained generative head, (3) kinetic outputs (kon/koff), and (4) it's genuinely universal rather than protein-centric with nucleic acid support bolted on.

Contributing

See CONTRIBUTING.md. Issues and PRs welcome.

Key areas where contributions are most valuable:

Modified nucleotide chemistry: extending the modification token vocabulary
SELEX data processing: parsing diverse raw SELEX experimental formats
Benchmark additions: allosteric site annotations, aptamer SPR validation sets
Inference optimization: batching strategies, ONNX export

Citation

@software{bindfm2025,
  author       = {Abdullah, Hamza},
  title        = {{BindFM}: Universal Biomolecular Binding Foundation Model},
  year         = {2025},
  institution  = {Terminal Bio},
  url          = {https://github.com/iamhamzaabdullah/BindFM},
  note         = {SE(3)-equivariant · atom-level · five binding modalities · from scratch}
}

If you use BindFM in academic work, please also cite the core architectural papers:

Core citations (EGNN, AlphaFold2 PairFormer, OT-CFM, Kendall uncertainty)

@inproceedings{satorras2021egnn,
  title     = {{E(n)} Equivariant Graph Neural Networks},
  author    = {Satorras, Victor Garcia and Hoogeboom, Emiel and Welling, Max},
  booktitle = {ICML},
  year      = {2021}
}

@article{jumper2021alphafold,
  title   = {Highly accurate protein structure prediction with {AlphaFold}},
  author  = {Jumper, John and others},
  journal = {Nature},
  volume  = {596},
  pages   = {583--589},
  year    = {2021}
}

@article{lipman2022flow,
  title   = {Flow Matching for Generative Modeling},
  author  = {Lipman, Yaron and Chen, Ricky T. Q. and others},
  year    = {2022},
  journal = {arXiv:2210.02747}
}

@inproceedings{kendall2018uncertainty,
  title     = {Multi-Task Learning Using Uncertainty to Weigh Losses},
  author    = {Kendall, Alex and Gal, Yarin},
  booktitle = {CVPR},
  year      = {2018}
}

Terminal Bio · MIT License · 2025

The goal is a single model that knows what binds to what. Every molecule. Every modality. One representation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Problem with Every Other Approach

What BindFM Does

Binding Modalities Covered

Architecture

Core Design Decisions

Installation

Usage

Predict Binding Affinity

Predict 3D Complex Structure

Generate De Novo Aptamers

Screen a Compound Library

Explicit Modality Hints

Training

1. Data Setup

2. Curriculum Training (4 Stages)

Curriculum Rationale

Benchmarks

Model Sizes

Micro-Run on Free GPU

Repository Structure

Comparison to Related Work

Contributing

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
configs		configs
inference		inference
model		model
notebooks		notebooks
scripts		scripts
training		training
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
quickstart.py		quickstart.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

The Problem with Every Other Approach

What BindFM Does

Binding Modalities Covered

Architecture

Core Design Decisions

Installation

Usage

Predict Binding Affinity

Predict 3D Complex Structure

Generate De Novo Aptamers

Screen a Compound Library

Explicit Modality Hints

Training

1. Data Setup

2. Curriculum Training (4 Stages)

Curriculum Rationale

Benchmarks

Model Sizes

Micro-Run on Free GPU

Repository Structure

Comparison to Related Work

Contributing

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages