Golem

Descriptor pretraining for Graph Transformers on molecular descriptors. Inspired by CheMeleon, with improvements including NaN-aware validity masking and scaling, and isoform enumeration for data augmentation.

Golem pretrains a gt-pyg GraphTransformerNet backbone to predict Mordred 2D molecular descriptors, with optional 3D descriptor targets and ECFP-latent alignment.

Installation

Prerequisites

Python 3.10+
pip (included with Python)
A checkout of the gt-pyg package

golem imports gt_pyg at runtime, so gt-pyg must be installed in the same environment before running golem pretrain.

Setup

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

# Install gt-pyg first.
# Local development checkout:
python -m pip install -e /path/to/gt-pyg

# Or install gt-pyg from GitHub instead:
# python -m pip install "gt-pyg @ git+https://github.com/pgniewko/gt-pyg.git"

# Install golem (editable)
python -m pip install -e .

# (Optional) Install dev dependencies
python -m pip install -e ".[dev]"

If you are working in this sibling-checkout layout:

cd /Users/pawelgniewek/projects/golem
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ../gt-pyg
python -m pip install -e .
# Optional: install dev dependencies
python -m pip install -e ".[dev]"

Verify installation

golem --help
# Should show 'pretrain' and 'report' commands

python -c "from gt_pyg import GraphTransformerNet; print('gt-pyg OK')"
python -c "from golem.config import PretrainConfig; print('golem OK')"

Running Pretraining

Quick smoke run (~1 minute)

golem pretrain \
  --smiles data/openadmet/expansion_rx/train_test_smiles.smi \
  --output experiments/test_pretrain \
  --max-epochs 10 \
  --subsample 0.1 \
  --no-isoforms

Production run

golem pretrain \
  --smiles data/openadmet/expansion_rx/train_test_smiles.smi \
  --config configs/golem-2d.yaml \
  --output experiments/pretrain

Config files in configs/ are intended to contain overrides over the defaults in golem.config.PretrainConfig, not a full copy of every setting.

Optional ECFP-latent alignment can be enabled in YAML:

ecfp_latent_alignment:
  enabled: true

Optional 3D descriptor targets can be enabled in YAML:

descriptors:
  include_3d_targets: true

Set descriptors.include_2d_targets: false together with descriptors.include_3d_targets: true to train on 3D descriptors only. If you want the run to optimize only the ECFP-latent alignment objective while still keeping descriptor heads active, set descriptors.loss_weight: 0.0 and enable ecfp_latent_alignment. ElectroShape uses fixed gasteiger charges, conformer embedding is fixed to ETKDGv3, conformer optimization uses fixed MMFF with UFF fallback, and the single lowest-energy conformer from conformers.n_generate attempts is used for 3D descriptors. If conformer generation or a 3D descriptor family fails, the molecule is kept and the affected 3D targets are masked the same way invalid 2D descriptor entries are masked. 3D descriptor columns that are invalid for every molecule are dropped across the dataset, and the run fails if no descriptor columns remain.

CLI options

Flag	Description	Default
`--smiles`	Path to SMILES file (`.smi` or `.csv`)	Required
`--config`	Path to YAML config file	Built-in defaults
`--output`	Output directory for checkpoints and logs	Required
`--max-epochs`	Override max training epochs	500
`--batch-size`	Override batch size	128
`--lr`	Override learning rate	1e-4
`--num-workers`	Override PyG data loading workers	0
`--subsample`	Subsample fraction (e.g. 0.1 for 10%)	None (use all)
`--seed`	Override random seed	42
`--no-isoforms`	Disable isoform enumeration	Enabled
`--verbose`	Show DEBUG-level logs on console	Disabled

What pretraining produces

After a run completes, the output directory contains:

experiments/pretrain/
  best_checkpoint.pt        # Best model by validation objective
  last_checkpoint.pt        # Most recent completed-epoch weights
  resolved_config.yaml      # Full resolved config used for the run
  pretrain_report.html      # HTML dashboard with training curves and metrics (not tracked)
  metrics.csv               # Per-epoch objective, descriptor, RMSE, LR, and optional alignment metrics (not tracked)
  pretrain.log              # Full log output (not tracked)

Generating Reports

After a pretraining run completes, an HTML report with training curves is automatically generated in the output directory. You can also regenerate or create a report from any existing experiment directory:

golem report experiments/pretrain

This reads metrics.csv and resolved_config.yaml from the experiment directory and produces a single-file HTML dashboard (pretrain_report.html) with:

Training & validation objective curves
Training & validation descriptor-loss curves
Validation RMSE curve
Learning rate schedule
Optional ECFP-latent alignment chart when those metrics are present
Summary cards (best epoch, best val loss, elapsed time, architecture)
Epoch-by-epoch table with the best row highlighted

Note: the generated HTML references Chart.js from a CDN, so it is not fully offline/self-contained.

To write the report to a custom path:

golem report experiments/pretrain --output path/to/report.html

Key module responsibilities

Module	What it does
`cli.py`	Parses CLI args, merges config, calls `pretrain()`
`config.py`	Defines `PretrainConfig` dataclass tree; merges defaults / YAML overrides / CLI
`conformers.py`	Builds the lowest-energy RDKit conformer used for optional 3D descriptor targets
`isoforms.py`	Enumerates tautomers, protonation states, and neutralized forms per molecule
`descriptors.py`	Computes 2D/3D descriptor targets; provides `NaNAwareStandardScaler`
`pretrain.py`	Orchestrates the full pipeline: load SMILES → split parents → expand isoforms within each split → descriptors → scale → train → checkpoint
`utils.py`	Shared utilities: seeding, train/val/test splitting, PyG DataLoader creation, SMILES file loading

Where things live

Looking for...	Go to
How the model is constructed	`pretrain.py` model creation section
The masked MSE pretraining loss	`pretrain.py:_train_one_epoch()`
NaN handling / validity masking	`descriptors.py:compute_mordred_descriptors()`
Scaler fit (train-only, no leakage)	`pretrain.py:pretrain()` step 5
Config defaults	`config.py` dataclass definitions
Production config overrides	`configs/golem-2d.yaml`
Pretraining pipeline flow	`pretrain.py` module docstring

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
configs		configs
data/openadmet/expansion_rx		data/openadmet/expansion_rx
experiments		experiments
golem		golem
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Golem

Installation

Prerequisites

Setup

Verify installation

Running Pretraining

Quick smoke run (~1 minute)

Production run

CLI options

What pretraining produces

Generating Reports

Key module responsibilities

Where things live

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Golem

Installation

Prerequisites

Setup

Verify installation

Running Pretraining

Quick smoke run (~1 minute)

Production run

CLI options

What pretraining produces

Generating Reports

Key module responsibilities

Where things live

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages