phyload

Utilities for loading multi-physics, spatio-temporal trajectory data exported by PHYGEN. phyload discovers tensor fields, exposes them through a flexible PyTorch Dataset, and provides ready-to-use DataLoader factories that respect heterogeneous grids, windowing, and metadata.

Highlights

Selective I/O – read only the requested time window and spatial stride from disk, keeping data transfers minimal even for very long trajectories.
Flexible windowing – generate fixed-length sub-trajectories with optional overlap, temporal stride, and spatial subsampling without touching the source files.
Rich metadata – access collated boundary conditions, scalar parameters, time grids, and spatial meshes alongside the trajectory tensor.
Channel management – channel-last layout (T, *space_dims, C) with stable channel names and helpers for z-score or RMS normalisation / de-normalisation.
Multi-dataset orchestration – combine multiple datasets with aligned channels using MultiDatasetCollection, including padded variants for mixed shapes.

Installation

Clone the repository and install in editable mode:

git clone https://github.com/your-org/PHYLOAD.git
cd PHYLOAD
pip install -e .

Quick start

from pathlib import Path
import yaml

from phyload import init_datasets, init_dataloaders

# Edit configs/main.yaml to point to your dataset root, target dataset,
# and desired loader options, then load it here.
cfg_path = Path("configs/main.yaml")
cfg = yaml.safe_load(cfg_path.read_text())

datasets = init_datasets(cfg["data"])
train_loader, val_loader, test_loader = init_dataloaders(datasets, cfg["loaders"])

batch = next(iter(train_loader))
print(batch["trajectory"].shape)  # (B, T, X[, Y[, Z]], C)
print(batch["time_grid"].shape)   # (B, T)
print(batch["space_grid"].shape)  # (B, T, n_dims, ...)
print(batch["index"])             # provenance info

# Optional metadata
print(batch.get("ic"))            # (B, C, X[, Y[, Z]]) initial condition if available
print(batch.get("params"))        # dict of scalar parameters per sample
print(batch.get("bc"))            # boundary-condition tensors/masks

# De-normalise a channel (when normalisation is enabled)
chan_name = datasets["train"].channel_names[0]
restored = datasets["train"].denormalize_channel(batch["trajectory"][0, :, ..., 0], chan_name)

Configuration essentials

All user-facing options are supplied through YAML configuration files:

configs/main.yaml: single-dataset training/validation/test.
configs/main_multi.yaml: multi-dataset (multi-physics) experiments.

Data section (`data`)

Key hyper-parameters controlling how trajectories are processed:

root, dataset_name: locate the source HDF5 files (data/<dataset_name>/*.hdf5).
time_range: (start, stop) slice applied before any subsampling.
num_steps: length of the temporal window returned by each sample.
window_stride: offset between successive windows (defaults to num_steps for non-overlapping windows).
time_stride / dt: temporal subsampling factors (integers or fractional).
dx: spatial subsampling per dimension (scalar or list/tuple).
return_ic, return_bc, return_params: request initial conditions, boundary conditions, and scalar parameters when available.
use_normalization, normalization_type: apply per-channel statistics (zscore or rms) stored in stats.yaml.
ntrain, nval, ntest: optional cap on the number of raw trajectories per split before windowing/subsampling.

Loader section (`loaders`)

Define how batches are constructed:

batch_size, val_batch_size, test_batch_size
shuffle_train, shuffle_val, shuffle_test
num_workers, pin_memory, prefetch_factor, persistent_workers
drop_last (for incomplete batches)
multi_gpu: toggle DistributedSampler for multi-GPU training.

Pass the full configuration to init_datasets / init_dataloaders as shown in the quick-start snippet.

Multi-physics workflows

configs/main_multi.yaml lets you load several datasets simultaneously. Each entry under datasets: describes a dataset alias with its own data configuration.

Supported combination modes:

Homogeneous: all datasets share the same channel layout and compatible spatial shapes. init_dataloaders returns HomogeneousCombinedLoader, delivering batched data per alias in sync.
Mixed (heterogeneous): datasets differ in shape or channel structure. MultiDatasetCollection keeps them separate; you can iterate per-alias batches or wrap them with PaddedTrajectoryDataset if you need aligned shapes.

Choose the mode via the fusion_mode key in main_multi.yaml, and use the same loader configuration (loaders section) to build samplers for all participating datasets.

Synchronized dataset choice (DDP): when using the homogeneous fusion mode, set loaders.sync_dataset_per_step: true (and optionally loaders.dataset_choice_seed) to force every distributed rank to draw the batch for a given step from the same dataset. This keeps optimizer updates dataset-homogeneous even across GPUs/nodes.

Channel metadata remains accessible through each sub-dataset, so downstream models can map component names consistently across physics domains.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
notebooks		notebooks
src/phyload		src/phyload
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phyload

Highlights

Installation

Quick start

Configuration essentials

Data section (`data`)

Loader section (`loaders`)

Multi-physics workflows

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

phyload

Highlights

Installation

Quick start

Configuration essentials

Data section (data)

Loader section (loaders)

Multi-physics workflows

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data section (`data`)

Loader section (`loaders`)

Packages