Skip to content

DemoDiff is a diffusion-based molecular foundation model for in-context inverse molecular design.

License

Notifications You must be signed in to change notification settings

liugangcode/DemoDiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DemoDiff: Graph Diffusion Transformers are In-Context Molecular Designers

DemoDiff is a diffusion-based molecular foundation model for in-context inverse molecular design. It leverages graph diffusion transformers to generate molecules based on contextual examples, enabling few-shot molecular design across diverse chemical tasks without task-specific fine-tuning.

arXiv HuggingFace Model HuggingFace Data (Downstream)

🌟 Key Features

  • In-Context Learning: Generate molecules using only contextual examples (no fine-tuning required)
  • Graph-Based Tokenization: Novel molecular graph tokenization with BPE-style vocabulary
  • Comprehensive Benchmarks: 30+ downstream tasks covering drug discovery, docking, and polymer design

📁 Project Structure

🎯 downstream/ - Inference & Evaluation

The main interface for using DemoDiff on molecular design tasks.

Core Scripts:

  • inference.py Main inference for molecular generation and evaluation
  • prepare_model.py: Downloads pretrained DemoDiff-0.7B model from HuggingFace
  • prepare_data_and_oracle.py: Downloads benchmark datasets and oracles

Key Components:

  • context_data/: 30+ benchmark tasks organized by category (which can be downloaded by prepare_data_and_oracle.py)
  • docking/: Molecular docking infrastructure with AutoDock Vina
  • oracle/: Property prediction oracles for specialized tasks (which can be downloaded by prepare_data_and_oracle.py)
  • pretrained/: Model checkpoints and configuration files
  • models/: Neural network architectures
  • utils/: Molecular processing and tokenization utilities

🧬 pretrain/ - Model Training Pipeline

Coming soon

📊 data/ - Tokenization & Preprocessing

  • processed/vocab3000ring300/: Cached metadata
  • tokenizer/vocab3000ring300/: Molecular graph tokenization files
    • pretrain-token.node: Node vocabulary
    • pretrain-token.edge: Edge vocabulary
    • pretrain-token.ring: Ring structure vocabulary

⚙️ configs/ - Model Configuration

  • config.yaml: Hyperparameters, training settings, and tokenization config

🔧 Model Architecture

  • Model Type: Discrete diffusion transformer with marginal transitions
  • Architecture: 24-layer transformer, 16 attention heads, 1280 hidden dimensions
  • Tokenization: Graph BPE with 3000 vocabulary + 300 ring tokens
  • Training: 500 diffusion steps with cosine noise schedule
  • Context Length: Up to 150 tokens per molecule

About

DemoDiff is a diffusion-based molecular foundation model for in-context inverse molecular design.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages