DemoDiff: Graph Diffusion Transformers are In-Context Molecular Designers

DemoDiff is a diffusion-based molecular foundation model for in-context inverse molecular design. It leverages graph diffusion transformers to generate molecules based on contextual examples, enabling few-shot molecular design across diverse chemical tasks without task-specific fine-tuning.

🌟 Key Features

In-Context Learning: Generate molecules using only contextual examples (no fine-tuning required)
Graph-Based Tokenization: Novel molecular graph tokenization with BPE-style vocabulary
Comprehensive Benchmarks: 30+ downstream tasks covering drug discovery, docking, and polymer design

📁 Project Structure

🎯 `downstream/` - Inference & Evaluation

The main interface for using DemoDiff on molecular design tasks.

Core Scripts:

inference.py Main inference for molecular generation and evaluation
prepare_model.py: Downloads pretrained DemoDiff-0.7B model from HuggingFace
prepare_data_and_oracle.py: Downloads benchmark datasets and oracles

Key Components:

context_data/: 30+ benchmark tasks organized by category (which can be downloaded by prepare_data_and_oracle.py)
docking/: Molecular docking infrastructure with AutoDock Vina
oracle/: Property prediction oracles for specialized tasks (which can be downloaded by prepare_data_and_oracle.py)
pretrained/: Model checkpoints and configuration files
models/: Neural network architectures
utils/: Molecular processing and tokenization utilities

🧬 `pretrain/` - Model Training Pipeline

Coming soon

📊 `data/` - Tokenization & Preprocessing

processed/vocab3000ring300/: Cached metadata
tokenizer/vocab3000ring300/: Molecular graph tokenization files
- pretrain-token.node: Node vocabulary
- pretrain-token.edge: Edge vocabulary
- pretrain-token.ring: Ring structure vocabulary

⚙️ `configs/` - Model Configuration

config.yaml: Hyperparameters, training settings, and tokenization config

🔧 Model Architecture

Model Type: Discrete diffusion transformer with marginal transitions
Architecture: 24-layer transformer, 16 attention heads, 1280 hidden dimensions
Tokenization: Graph BPE with 3000 vocabulary + 300 ring tokens
Training: 500 diffusion steps with cosine noise schedule
Context Length: Up to 150 tokens per molecule

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
data/tokenizer/vocab3000ring300		data/tokenizer/vocab3000ring300
downstream		downstream
pretrain		pretrain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DemoDiff: Graph Diffusion Transformers are In-Context Molecular Designers

🌟 Key Features

📁 Project Structure

🎯 `downstream/` - Inference & Evaluation

🧬 `pretrain/` - Model Training Pipeline

📊 `data/` - Tokenization & Preprocessing

⚙️ `configs/` - Model Configuration

🔧 Model Architecture

About

Uh oh!

Releases

Packages

Languages

License

liugangcode/DemoDiff

Folders and files

Latest commit

History

Repository files navigation

DemoDiff: Graph Diffusion Transformers are In-Context Molecular Designers

🌟 Key Features

📁 Project Structure

🎯 downstream/ - Inference & Evaluation

🧬 pretrain/ - Model Training Pipeline

📊 data/ - Tokenization & Preprocessing

⚙️ configs/ - Model Configuration

🔧 Model Architecture

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🎯 `downstream/` - Inference & Evaluation

🧬 `pretrain/` - Model Training Pipeline

📊 `data/` - Tokenization & Preprocessing

⚙️ `configs/` - Model Configuration

Packages