DemoDiff is a diffusion-based molecular foundation model for in-context inverse molecular design. It leverages graph diffusion transformers to generate molecules based on contextual examples, enabling few-shot molecular design across diverse chemical tasks without task-specific fine-tuning.
- In-Context Learning: Generate molecules using only contextual examples (no fine-tuning required)
- Graph-Based Tokenization: Novel molecular graph tokenization with BPE-style vocabulary
- Comprehensive Benchmarks: 30+ downstream tasks covering drug discovery, docking, and polymer design
The main interface for using DemoDiff on molecular design tasks.
Core Scripts:
inference.py
Main inference for molecular generation and evaluationprepare_model.py
: Downloads pretrained DemoDiff-0.7B model from HuggingFaceprepare_data_and_oracle.py
: Downloads benchmark datasets and oracles
Key Components:
context_data/
: 30+ benchmark tasks organized by category (which can be downloaded byprepare_data_and_oracle.py
)docking/
: Molecular docking infrastructure with AutoDock Vinaoracle/
: Property prediction oracles for specialized tasks (which can be downloaded byprepare_data_and_oracle.py
)pretrained/
: Model checkpoints and configuration filesmodels/
: Neural network architecturesutils/
: Molecular processing and tokenization utilities
Coming soon
processed/vocab3000ring300/
: Cached metadatatokenizer/vocab3000ring300/
: Molecular graph tokenization filespretrain-token.node
: Node vocabularypretrain-token.edge
: Edge vocabularypretrain-token.ring
: Ring structure vocabulary
config.yaml
: Hyperparameters, training settings, and tokenization config
- Model Type: Discrete diffusion transformer with marginal transitions
- Architecture: 24-layer transformer, 16 attention heads, 1280 hidden dimensions
- Tokenization: Graph BPE with 3000 vocabulary + 300 ring tokens
- Training: 500 diffusion steps with cosine noise schedule
- Context Length: Up to 150 tokens per molecule