molsemble is planned as a Python/CLI package for explicit, restartable molecular ensemble workflows that preserve computational chemistry artifacts and export structured computational-result and provenance tables for ML-oriented chemistry work.
molsemble is currently pre-alpha and design-controlled. This repository contains project-control documents, a minimal Python package skeleton, the first backend-neutral core record/vocabulary layer, an initial config-loading layer, an initial filesystem run-storage foundation, a declarative backend-operation contract with fake backend capability, target-scope declarations, and product-output declarations, plus a first workflow-planning, dependency-resolution, target-catalog, product-availability, and input-readiness frontier foundation.
No real workflow execution functionality is implemented yet. The package can load workflow YAML and systems CSV/TSV inputs into validated records, create/open basic run directories with manifests and JSONL registries, validate stage records against declarative backend operation contracts, declare logical operation target scopes and product-output scopes/multiplicities, build in-memory workflow plans with resolved stage contracts, resolve stage-level product dependencies with strict or pipeline policies, build traversal-neutral target catalogs over known entries, systems, and conformers, query available product instances for exact entry/system/conformer targets, and compute an input-readiness frontier for stage-target pairs. It does not yet generate conformers, parse SMILES chemically, run ORCA, expand jobs, create attempt directories, resume calculations, execute fake backend jobs, expand operation target scopes into jobs, register backend artifacts, resolve exact product-instance dependencies, decide reuse or output completion, or export result tables.
Use Python 3.11 or newer. Because the project uses a src/ layout, install
the editable development extra before running validation commands.
python -m pip install -e '.[dev]'Validation commands for the current implemented layers should run through the same Python environment used for installation:
python -c "import molsemble; import molsemble.core; import molsemble.config; import molsemble.storage; import molsemble.workflow; import molsemble.backends; import molsemble.workflow.planner; import molsemble.workflow.dependencies; import molsemble.workflow.targets; import molsemble.workflow.availability; import molsemble.workflow.readiness"
python -m pytest
python -m ruff check src tests
python -m pip checkThe current package includes molsemble.core, which provides backend-neutral
Pydantic record models, status/product/artifact vocabulary, ID helpers, and
deterministic fingerprint helpers. These records are infrastructure for future
workflow/config/storage/backend work; they do not execute chemistry workflows by
themselves.
The current package includes molsemble.config, which can load user-authored
workflow YAML and systems CSV/TSV inputs into validated config/core records.
Workflow IDs and stage IDs are local provenance anchors; workflow IDs default to
workflow_001, and omitted stage IDs are generated deterministically as
stage_001, stage_002, and so on.
Systems tables currently require explicit entry_id, system_id, smiles,
charge, and multiplicity columns. The config layer validates table shape,
required fields, duplicate IDs, simple scalar types, and known product
vocabulary, but it does not parse SMILES chemically or validate backend
capabilities.
The current package includes molsemble.storage, which can create and open a
basic filesystem run directory. The initial storage layer writes top-level
run.json, workflow.json, and status.json manifests plus global JSONL
registries under registry/ for entries, systems, conformers, stages, jobs,
attempts, products, artifacts, and validations.
A run is one concrete execution/storage instance, normally represented by one
run directory. RunStore is a storage facade for that directory; it is not a
workflow runner or executor. Resume/retry behavior will later reopen the same
run directory and add new attempts for retried jobs, but TASK-004 does not yet
implement job planning, attempt directories, backend execution, artifact
registration, or restart policy.
The current package includes molsemble.workflow.StageKind and
molsemble.backends. The implemented StageKind vocabulary is intentionally
small: conformer_generation, geometry_optimization, and
single_point_calculation.
BackendOperation is a declarative operation contract: it records the backend
name, operation name, StageKind, logical job target scope, consumed product
types, product output specs, setting-name categories, expected artifacts, and
JSON-compatible metadata. Product outputs are declared with ProductOutputSpec
using product type, product scope, and OutputMultiplicity; produces is a
derived product-type view over those output specs. The in-memory
BackendOperationRegistry registers these contracts, supports lookup/filtering,
and can validate that a StageRecord refers to a known backend operation with a
matching StageKind and explicit product contract. Explicit product contracts
are treated as semantic sets, while effective contracts are normalized to the
backend operation's declared order. The required target_scope field uses the
shared ProductScope vocabulary and declares the record/object scope targeted
by one future logical job.
The fake backend currently provides declarations only:
fake.generate_conformers targeting systems, plus fake.optimize and
fake.single_point targeting conformers. These declarations include output
product scopes and multiplicities, including one-or-more conformer-level
geometry products for conformer generation and one job-level raw output bundle
for fake optimization/single-point operations. The fake backend does not yet
execute fake jobs, create products/artifacts, simulate failures, or test restart
behavior.
The current package includes molsemble.workflow.planner, which can resolve a
WorkflowRecord plus StageRecord objects against a
BackendOperationRegistry. build_workflow_plan(...) returns an in-memory
WorkflowPlan with behavior-light ResolvedStage records containing the
original stage, matched backend operation, and effective consumed/produced
ProductTypes.
The planner canonicalizes stage order by StageRecord.index, validates stage
identity and index consistency, checks backend-operation compatibility, fills
omitted stage product contracts from backend operation declarations, and performs
a coarse upstream ProductType availability check.
The current package also includes molsemble.workflow.dependencies, which
resolves stage-level product dependencies from a WorkflowPlan. STRICT mode
requires each consumed ProductType to have exactly one earlier producer.
PIPELINE mode treats the ordered stage list as a linear refinement pipeline and
selects the latest earlier producer while preserving all candidate producer
stages for inspection.
The current package also includes molsemble.workflow.targets, which builds a
traversal-neutral TargetCatalog over known Entry/System/Conformer records.
ExecutionTarget records preserve scope, target ID, and resolved parent context;
the catalog supports grouping queries by scope, entry, and system. It is a
validated target view, not a scheduler or storage layer.
The current package also includes molsemble.workflow.availability, which builds
an immutable ProductAvailabilityIndex over known ProductRecord instances and
a TargetCatalog. The index answers exact target-aware availability questions
for entry-, system-, and conformer-scoped products and can filter by
ProductType and producer stage ID. It treats only available products as
available and ignores unsupported scopes such as job-scoped raw output bundles.
It is a query layer, not a dependency selector, readiness calculator, scheduler,
storage reader, or executor.
The current package also includes molsemble.workflow.readiness, which builds
an immutable WorkflowReadinessFrontier from a WorkflowDependencyPlan and a
ProductAvailabilityIndex. The frontier contains one StageTargetReadiness
record for each currently known stage-target pair covered by each stage's
operation target scope. Consumed inputs are represented as
InputProductRequirement records that preserve the selected producer stage,
candidate producer stages, and matching available products. This is input
readiness only: no-input stages are ready, consuming stages are ready only when
every required ProductType has at least one matching available product, and
blocked stage-target pairs report missing ProductTypes. The frontier also offers
query helpers for ready/blocked records, stages, and targets.
The workflow layer does not yet expand jobs, create JobRecord instances,
choose one exact input product among multiple matching products, decide reuse or
output-completion suppression, create attempts, mutate RunStore, execute
backends, or provide explicit YAML input-selector syntax.
The first alpha is intended to cover CSV + SMILES input, YAML workflow configuration, RDKit conformer generation, ORCA geometry optimization, ORCA single-point calculation stages, filesystem provenance, restart/resume, and structured computational-product export.
The first alpha excludes SDF input, general XYZ/OpenXYZ input, xTB, CREST,
Multiwfn, ORCA frequencies, .orcacosmo generation, Gaussian, Psi4, PySCF,
native Slurm execution, database storage, dedicated molsemble-level pruning,
connectivity/stereochemistry validation, and user-facing derived/follow-up runs.
molsemble is licensed under LGPL-3.0-or-later. The repository includes the LGPL
notice in LICENSE.md and the canonical GPL/LGPL license texts in COPYING and
COPYING.LESSER.