Despite rapid advances in molecular and materials machine learning, most models lack physical transferability: they fit correlations across whole molecules or crystals rather than learning the quantum interactions between atomic pairs. Yet bonding, charge redistribution, orbital hybridization, and electronic coupling all emerge from these two-body interactions that define local quantum fields in many-body systems.
We introduce QuantumCanvas, a large-scale multimodal benchmark that treats two-body quantum systems as foundational units of matter. The dataset spans 2,850 elementβelement pairs, each annotated with 18 electronic, thermodynamic, and geometric properties and paired with ten-channel image representations derived from l- and m-resolved orbital densities, angular field transforms, co-occupancy maps, and charge-density projections. These physically grounded images encode spatial, angular, and electrostatic symmetries without explicit coordinates, providing an interpretable visual modality for quantum learning.
Benchmarking eight architectures across 18 targets, we report MAEs of 0.201 eV on energy gap with GATv2, 0.265 eV on HOMO and 0.274 eV on LUMO with EGNN, and 0.008 Γ on bond length with DimeNet. For energy-related quantities, DimeNet attains 2.27 eV total-energy MAE and 0.132 eV repulsive-energy MAE, while a multimodal fusion model achieves a 2.15 eV Mermin free-energy MAE. Pretraining on QuantumCanvas further improves convergence stability and generalization when fine-tuned on QM9, MD17, and CrysMTM.
By unifying orbital physics with vision-based representation learning, QuantumCanvas provides a principled and interpretable basis for learning transferable quantum interactions through coupled visual and numerical modalities.
python build_dataset.pyThis creates dataset_combined.npz (31.9 MB) with all 2850 samples in one file.
Custom output:
python build_dataset.py /path/to/raw_data my_dataset.npzPyTorch (for CNNs/ViTs):
from pytorch_dataset import TwoBodyDataset
from torch.utils.data import DataLoader
dataset = TwoBodyDataset('dataset_combined.npz', target_label='e_g_ev')
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for images, targets in loader:
outputs = model(images) # images: [32, 10, 32, 32]PyTorch Geometric (for GNNs):
from pytorch_geometric_dataset import TwoBodyGraphDataset
from torch_geometric.loader import DataLoader
dataset = TwoBodyGraphDataset('dataset_combined.npz', target_labels=['e_g_ev'])
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
outputs = gnn_model(batch.x, batch.edge_index, batch.edge_attr)dataset_combined.npz β Single file with all 2850 samples (31.9 MB)
βββ images: [2850, 10, 32, 32] - All image tensors
βββ geometries: [2850, 2, 4] - All 3D coordinates
βββ elements: list of 2850 element pairs
βββ labels: list of 2850 label dicts
βββ metadata: list of 2850 metadata dicts
βββ pair_names: list of 2850 system names
analysis/
βββ all_labels.csv β All 37 labels in CSV format
βββ geometry_data.csv β 3D coordinates for all systems
βββ labels_detailed_summary.txt β Complete label statistics
import numpy as np
# Load entire dataset
data = np.load('dataset_combined.npz', allow_pickle=True)
# Access sample 0
image = data['images'][0] # [10, 32, 32]
geometry = data['geometries'][0] # [2, 4]
elements = data['elements'][0] # ['Ag', 'Al']
labels = data['labels'][0] # {dict} 37 labels
pair_name = data['pair_names'][0] # 'Ag_Al'
band_gap = labels['e_g_ev']| Ch | Name | Description |
|---|---|---|
| 0-1 | O-Map | Orbital features (radial, angular) |
| 2-3 | RIP-GAF | Rotation-invariant orbitals (s/p, d/f) |
| 4-5 | RIP-MTF | Multipole moments (dipole, quadrupole) |
| 6-7 | COM | Density features (charge, orbital) |
| 8-9 | Q-Image | Charge distribution (positive, negative) |
Energy (8 float)
total_energy_ev,e_homo_ev,e_lumo_ev,e_g_ev(band gap)band_energy_ev,mermin_free_energy_ev,repulsive_energy_ev,fermi_level_ev
Charge (4 float)
q_absmean,q_maxabs,q_std,total_charge
Electronic (10 mixed)
i_ev,a_ev,chi_ev,mu_ev,eta_ev(float)n_levels,max_occupancy(float)softness_evinv,electrophilicity_ev(float)metal_like(bool: 0/1) π΅no_virtual_in_basis(bool: 0/1) π΅
Dipole (4 float)
dipole_mag_d,dipole_x_d,dipole_y_d,dipole_z_d
Geometric (1 float)
distance_ang(bond length)
Convergence (7 mixed)
geom_opt_step,scc_last_iter(float)scc_last_total_elec_eh,scc_last_diff_elec,scc_last_error(float)geom_converged,scc_converged(bool: 0/1) π΅
System (3 mixed)
n_atoms(float)system_id_guess(string)
Note:
- π΅ = Boolean labels (0/1 values for classification)
- 3D coordinates in
data['geometry']array, element symbols indata['elements']
from torch.utils.data import Dataset, DataLoader
import torch
import numpy as np
class SimpleDataset(Dataset):
def __init__(self, data_dir, target='e_g_ev'):
self.files = sorted(Path(data_dir).glob('*.npz'))
self.target = target
def __getitem__(self, idx):
data = np.load(self.files[idx], allow_pickle=True)
image = torch.from_numpy(data['image']).float()
target = data['labels'].item()[self.target]
return image, torch.tensor(target if target else 0.0)
def __len__(self):
return len(self.files)
# Train on band gap prediction
dataset = SimpleDataset('processed_images', target='e_g_ev')
loader = DataLoader(dataset, batch_size=32, shuffle=True)class HybridDataset(Dataset):
def __getitem__(self, idx):
data = np.load(self.files[idx], allow_pickle=True)
# Image
image = torch.from_numpy(data['image']).float()
# Geometric features
geom = data['geometry']
bond_length = data['metadata'].item()['bond_length']
geom_features = torch.tensor([
bond_length,
geom[0, 3] / geom[1, 3], # population ratio
])
# Target
target = data['labels'].item()[self.target]
return {'image': image, 'geom': geom_features}, targetimport pandas as pd
# Load labels
df = pd.read_csv('analysis/prediction_labels.csv')
# Load geometry
df_geom = pd.read_csv('analysis/geometry_data.csv')
# Merge
df_full = df.merge(df_geom, on='pair_name')
# Analyze
print(df_full[['pair_name', 'e_g_ev', 'bond_length_ang']].head())target = 'e_g_ev' # Range: [0, 19.4] eV
# Use: Semiconductor applicationstarget = 'metal_like' # Binary: 0 or 1
# Distribution: 74% metal, 26% non-metaltarget = 'total_energy_ev' # Range: [-305.6, -2.3] eV
# Use: Thermodynamic stabilitytargets = ['e_g_ev', 'total_energy_ev', 'dipole_mag_d', 'metal_like']
# Predict multiple properties at once| Property | Count | Mean | Std | Range |
|---|---|---|---|---|
| Samples | 2850 | - | - | - |
| Band Gap (eV) | 2850 | 0.47 | 1.19 | [0.0, 19.4] |
| Total Energy (eV) | 2850 | -90.5 | 48.1 | [-305.6, -2.3] |
| Bond Length (Γ ) | 2850 | 2.58 | 0.66 | [0.7, 5.6] |
| Metal Systems | 2850 | 74% | - | - |
python build_dataset.pypython build_dataset.py /path/to/raw_data /path/to/output_dir- β
Parses
detailed.outβ orbital populations - β
Parses
geo_end.xyzβ 3D coordinates - β
Creates 10-channel images β
[10, 32, 32]tensors - β Integrates CSV labels β 37 quantum properties
- β
Saves to
dataset_combined.npzβ single file (31.9 MB) - β
Creates
analysis/folder β CSVs & summaries
Processing time: ~2 minutes for 2850 samples
Output: One file with everything, easy to distribute!
.
βββ README.md β YOU ARE HERE
βββ build_dataset.py β Build everything
βββ pytorch_dataset.py β PyTorch loader
βββ pytorch_geometric_dataset.py β PyTorch Geometric loader
βββ check_npz.py β Inspect data
β
βββ dataset_combined.npz β Main dataset (31.9 MB, 2850 samples) β
β
βββ raw_data/ β Your input data
β βββ Ag_Al/detailed.out + geo_end.xyz
β βββ dftb_ptbp_combined.csv
β βββ bond_distances_all.csv
β
βββ analysis/ β Analysis files
βββ all_labels.csv
βββ geometry_data.csv
βββ labels_detailed_summary.txt
@dataset{twobody2026,
title={Two-Body Quantum System Image Dataset},
year={2026},
samples={2850},
image_channels={10},
labels={37}
}- β All 2850 samples processed successfully
- β All labels integrated and verified
- β Bond lengths validated (CSV vs XYZ match)
- β No missing critical data
- β Ready for training
e_g_ev: HOMO-LUMO gap (band gap) - KEY TARGETtotal_energy_ev: Total system energy - KEY TARGETe_homo_ev: Highest occupied molecular orbitale_lumo_ev: Lowest unoccupied molecular orbital
metal_likeπ΅: Binary metal/non-metal (0=non-metal, 1=metal)geom_convergedπ΅: Geometry convergence flag (always 1)scc_convergedπ΅: SCC convergence flag (always 1)no_virtual_in_basisπ΅: Virtual orbitals flag
All numeric labels can be used as regression targets. See analysis/all_labels.csv for the complete list.
Check the comprehensive summary to verify all labels:
cat analysis/labels_detailed_summary.txtThis file shows:
- β Coverage for all 48 labels
- β Mean, std, min, max, median for each numeric label
- β Distribution for categorical labels
- β Notes on empty labels
All labels are lowercase with underscores (e.g., e_g_ev, total_energy_ev, distance_ang)
Note: Geometry coordinates (x, y, z) are in data['geometry'] array, NOT in labels.
Yes! Your dataset is fully compatible with PyTorch Geometric!
Each two-body system is a graph with:
- 2 nodes (atoms)
- 1 edge (chemical bond)
- Node features: Element one-hot + electron population
- Edge features: Pooled image channels (10D) + bond vector (4D) = 14D
- 3D positions: Atomic coordinates
- Target: Any of the 37 labels
β
Compare image vs graph approaches for the same data
β
Hybrid models: GNN + image features
β
Use 3D geometry with SchNet, DimeNet, GemNet
β
Message passing between atoms
β
Benchmark GNNs against CNNs/ViTs
Two-Body System (e.g., Ag-Al):
Node 0 (Ag): [one-hot Ag, population=11.07]
Node 1 (Al): [one-hot Al, population=2.93]
Edge 0β1: [10 image channels (pooled), bond_length, bond_vector]
See pytorch_geometric_dataset.py for full implementation!
Recommended release package:
TwoBody-CVPR2026/
βββ dataset_combined.npz (32 MB) β
βββ pytorch_dataset.py
βββ pytorch_geometric_dataset.py
βββ README.md
βββ LICENSE
βββ analysis/
βββ all_labels.csv
βββ geometry_data.csv
βββ labels_detailed_summary.txt
Total size: ~35 MB
Upload to: Zenodo (get DOI), Hugging Face, or GitHub Release
See RELEASE_GUIDE.md for detailed recommendations.
Questions? Check analysis/labels_detailed_summary.txt for complete label statistics.