<a href="https://colab.research.google.com/github/hesther/teaching/blob/main/demos/short_demo_chemtorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ChemTorch demo
Connect to a T4 GPU for the best experience!

In [1]:
!pip install rdkit numpy==1.26.4 scikit-learn pandas
!pip install torch==2.6.0
!pip install hydra-core
!pip install torch_geometric
!pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-2.6.0+cu124.html
!pip install wandb
!pip install lightning
!pip install ipykernel

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0)
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0)
  Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.6.0)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.6.0)
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.6.0)
  Using cached nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (fro

In [2]:
!git clone https://github.com/heid-lab/chemtorch.git
%cd chemtorch
!pip install .

Cloning into 'chemtorch'...
remote: Enumerating objects: 346, done.[K
remote: Counting objects: 100% (346/346), done.[K
remote: Compressing objects: 100% (270/270), done.[K
remote: Total 346 (delta 76), reused 325 (delta 61), pack-reused 0 (from 0)[K
Receiving objects: 100% (346/346), 270.50 KiB | 10.02 MiB/s, done.
Resolving deltas: 100% (76/76), done.
/content/chemtorch
Processing /content/chemtorch
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: chemtorch
  Building wheel for chemtorch (pyproject.toml) ... [?25l[?25hdone
  Created wheel for chemtorch: filename=chemtorch-2025.6.23-py3-none-any.whl size=105831 sha256=846f3a00292182bc61a1743767f4cede353f556a8483648597f4e6e03f2b4d2e
  Stored in directory: /tmp/pip-ephem-wheel-cache-h406z4r8/wheels/bb/49/2b/a719528ad2395fd98ddd8c902be8719ebca66ffd3e5d093b26
Successfully b

We need to download some datasets we can play around with:

In [3]:
!git clone https://github.com/heid-lab/reaction_database.git
!ln -s reaction_database/data data

Cloning into 'reaction_database'...
remote: Enumerating objects: 37054, done.[K
remote: Counting objects: 100% (37054/37054), done.[K
remote: Compressing objects: 100% (37002/37002), done.[K
remote: Total 37054 (delta 35), reused 37038 (delta 25), pack-reused 0 (from 0)[K
Receiving objects: 100% (37054/37054), 38.39 MiB | 15.72 MiB/s, done.
Resolving deltas: 100% (35/35), done.
Updating files: 100% (35801/35801), done.


## Training

### Let's start with a GNN trained on CGRs to predict reaction barrier heights (on 5% of the RDB7 dataset):

In [6]:
!python scripts/main.py +experiment=graph dataset.subsample=0.05

Using device: cuda
INFO: Data ingestor instantiated successfully
INFO: Data ingestor finished successfully
INFO: Data module factory instantiated successfully
INFO: Precomputing 1014 items...
INFO: Precomputation finished in 4.29s.
INFO: Precomputing 60 items...
INFO: Precomputation finished in 0.35s.
INFO: Precomputing 119 items...
INFO: Precomputation finished in 0.52s.
INFO: Data modules instantiated successfully
INFO: Dataloaders instantiated successfully
INFO: Updating global config with properties of train dataset:
INFO: Final config:
data_ingestor:
  data_source:
    _target_: chemtorch.data_ingestor.data_source.SingleCSVSource
    data_path: data/rdb7/barriers/forward_reverse_spiekermann_splits/data.csv
  column_mapper:
    _target_: chemtorch.data_ingestor.column_mapper.ColumnFilterAndRename
    column_mapping:
      smiles: rxn_smiles
      label: ea
  data_splitter:
    _target_: chemtorch.data_ingestor.data_splitter.IndexSplitter
    split_index_path: data/rdb7/barriers/for

### Language models

ChemTorch can deal with many different data modalities, and tasks. For example, to train a language model on the task of reaction classification using SMILES strings as representation (on 5% of the USPTO-1K dataset), run:

In [7]:
!python scripts/main.py +experiment=token dataset.subsample=0.05 routine.epochs=2

Using device: cuda
INFO: Data ingestor instantiated successfully
INFO: Data ingestor finished successfully
INFO: Data module factory instantiated successfully
INFO: Precomputing 20030 items...
INFO: Precomputation finished in 9.51s.
INFO: Precomputing 2226 items...
INFO: Precomputation finished in 1.03s.
INFO: Precomputing 2226 items...
INFO: Precomputation finished in 1.21s.
INFO: Data modules instantiated successfully
INFO: Dataloaders instantiated successfully
INFO: Updating global config with properties of train dataset:
INFO: Final config:
data_ingestor:
  data_source:
    _target_: chemtorch.data_ingestor.data_source.PreSplitCSVSource
    data_folder: data/uspto-1k/classes/pre_split
  column_mapper:
    _target_: chemtorch.data_ingestor.column_mapper.ColumnFilterAndRename
    column_mapping:
      smiles: reaction
      label: labels
  _target_: chemtorch.data_ingestor.SimpleDataIngestor
dataset:
  representation:
    tokenizer:
      _target_: chemtorch.tokenizer.simple_tokenize

## How about an MLP on a reaction fingerprint for reaction classification?

In [8]:
!python scripts/main.py +experiment=fingerprint dataset.subsample=0.001

Using device: cuda
INFO: Data ingestor instantiated successfully
INFO: Data ingestor finished successfully
INFO: Data module factory instantiated successfully
INFO: Precomputing 401 items...
INFO: Precomputation finished in 4.79s.
INFO: Precomputing 45 items...
INFO: Precomputation finished in 0.59s.
INFO: Precomputing 45 items...
INFO: Precomputation finished in 0.77s.
INFO: Data modules instantiated successfully
INFO: Dataloaders instantiated successfully
INFO: Updating global config with properties of train dataset:
INFO: Final config:
data_ingestor:
  data_source:
    _target_: chemtorch.data_ingestor.data_source.PreSplitCSVSource
    data_folder: data/uspto-1k/classes/pre_split
  column_mapper:
    _target_: chemtorch.data_ingestor.column_mapper.ColumnFilterAndRename
    column_mapping:
      smiles: reaction
      label: labels
  _target_: chemtorch.data_ingestor.SimpleDataIngestor
dataset:
  representation:
    _target_: chemtorch.representation.fingerprint.drfp.DRFP
    _recurs

Let's make a barrier height prediction model on our own custom data.

In [11]:
!wget https://raw.githubusercontent.com/hesther/rxn_workshop/refs/heads/main/data/e2sn2/train_full.csv

--2025-06-24 11:30:14--  https://raw.githubusercontent.com/hesther/rxn_workshop/refs/heads/main/data/e2sn2/train_full.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 522030 (510K) [text/plain]
Saving to: ‘train_full.csv’


2025-06-24 11:30:14 (141 MB/s) - ‘train_full.csv’ saved [522030/522030]



You can either drop the data csv into the data folder and create a config file in `conf/data_ingestor`, or use an existing data ingestor and overwrite the path and column names:

In [9]:
!python scripts/main.py +experiment=graph data_ingestor=rdb7_fwd data_ingestor.data_source.data_path=train_full.csv data_ingestor.column_mapper.column_mapping.label=ea data_ingestor.column_mapper.column_mapping.smiles=AAM

Using device: cuda
INFO: Data ingestor instantiated successfully
INFO: Data ingestor finished successfully
INFO: Data module factory instantiated successfully
INFO: Precomputing 2016 items...
INFO: Precomputation finished in 6.82s.
INFO: Precomputing 112 items...
INFO: Precomputation finished in 0.37s.
INFO: Precomputing 112 items...
INFO: Precomputation finished in 0.37s.
INFO: Data modules instantiated successfully
INFO: Dataloaders instantiated successfully
INFO: Updating global config with properties of train dataset:
INFO: Final config:
data_ingestor:
  data_source:
    _target_: chemtorch.data_ingestor.data_source.SingleCSVSource
    data_path: train_full.csv
  column_mapper:
    _target_: chemtorch.data_ingestor.column_mapper.ColumnFilterAndRename
    column_mapping:
      smiles: AAM
      label: ea
  data_splitter:
    _target_: chemtorch.data_ingestor.data_splitter.RatioSplitter
    train_ratio: 0.9
    val_ratio: 0.05
    test_ratio: 0.05
  _target_: chemtorch.data_ingestor.

## 3D information
We can also train on 3D data.

In [4]:
!python scripts/main.py +experiment=xyz dataset.subsample=0.05

Using device: cpu
INFO: Data ingestor instantiated successfully
INFO: Data ingestor finished successfully
INFO: Data module factory instantiated successfully
INFO: Precomputing 537 items...
INFO: Precomputation finished in 0.37s.
INFO: Precomputing 30 items...
INFO: Precomputation finished in 0.02s.
INFO: Precomputing 30 items...
INFO: Precomputation finished in 0.02s.
INFO: Data modules instantiated successfully
INFO: Dataloaders instantiated successfully
INFO: Updating global config with properties of train dataset:
INFO: Final config:
data_ingestor:
  data_source:
    _target_: chemtorch.data_ingestor.data_source.SingleCSVSource
    data_path: data/rdb7/barriers/forward/data.csv
  column_mapper:
    _target_: chemtorch.data_ingestor.column_mapper.ColumnFilterAndRename
    column_mapping:
      smiles: smiles
      reaction_dir: rxn
      label: dE0
  data_splitter:
    _target_: chemtorch.data_ingestor.data_splitter.RatioSplitter
    train_ratio: 0.9
    val_ratio: 0.05
    test_rat