Skip to content

muCommons/molsemble

molsemble

molsemble is planned as a Python/CLI package for explicit, restartable molecular ensemble workflows that preserve computational chemistry artifacts and export structured computational-result and provenance tables for ML-oriented chemistry work.

Status

molsemble is currently pre-alpha and design-controlled. This repository contains project-control documents, a minimal Python package skeleton, the first backend-neutral core record/vocabulary layer, an initial config-loading layer, an initial filesystem run-storage foundation, a declarative backend-operation contract with fake backend capability, target-scope declarations, and product-output declarations, plus a first workflow-planning, dependency-resolution, target-catalog, product-availability, and input-readiness frontier foundation.

No real workflow execution functionality is implemented yet. The package can load workflow YAML and systems CSV/TSV inputs into validated records, create/open basic run directories with manifests and JSONL registries, validate stage records against declarative backend operation contracts, declare logical operation target scopes and product-output scopes/multiplicities, build in-memory workflow plans with resolved stage contracts, resolve stage-level product dependencies with strict or pipeline policies, build traversal-neutral target catalogs over known entries, systems, and conformers, query available product instances for exact entry/system/conformer targets, and compute an input-readiness frontier for stage-target pairs. It does not yet generate conformers, parse SMILES chemically, run ORCA, expand jobs, create attempt directories, resume calculations, execute fake backend jobs, expand operation target scopes into jobs, register backend artifacts, resolve exact product-instance dependencies, decide reuse or output completion, or export result tables.

Development Setup

Use Python 3.11 or newer. Because the project uses a src/ layout, install the editable development extra before running validation commands.

python -m pip install -e '.[dev]'

Validation commands for the current implemented layers should run through the same Python environment used for installation:

python -c "import molsemble; import molsemble.core; import molsemble.config; import molsemble.storage; import molsemble.workflow; import molsemble.backends; import molsemble.workflow.planner; import molsemble.workflow.dependencies; import molsemble.workflow.targets; import molsemble.workflow.availability; import molsemble.workflow.readiness"
python -m pytest
python -m ruff check src tests
python -m pip check

Implemented Core Layer

The current package includes molsemble.core, which provides backend-neutral Pydantic record models, status/product/artifact vocabulary, ID helpers, and deterministic fingerprint helpers. These records are infrastructure for future workflow/config/storage/backend work; they do not execute chemistry workflows by themselves.

Implemented Config Layer

The current package includes molsemble.config, which can load user-authored workflow YAML and systems CSV/TSV inputs into validated config/core records. Workflow IDs and stage IDs are local provenance anchors; workflow IDs default to workflow_001, and omitted stage IDs are generated deterministically as stage_001, stage_002, and so on.

Systems tables currently require explicit entry_id, system_id, smiles, charge, and multiplicity columns. The config layer validates table shape, required fields, duplicate IDs, simple scalar types, and known product vocabulary, but it does not parse SMILES chemically or validate backend capabilities.

Implemented Storage Foundation

The current package includes molsemble.storage, which can create and open a basic filesystem run directory. The initial storage layer writes top-level run.json, workflow.json, and status.json manifests plus global JSONL registries under registry/ for entries, systems, conformers, stages, jobs, attempts, products, artifacts, and validations.

A run is one concrete execution/storage instance, normally represented by one run directory. RunStore is a storage facade for that directory; it is not a workflow runner or executor. Resume/retry behavior will later reopen the same run directory and add new attempts for retried jobs, but TASK-004 does not yet implement job planning, attempt directories, backend execution, artifact registration, or restart policy.

Implemented Backend Contract Foundation

The current package includes molsemble.workflow.StageKind and molsemble.backends. The implemented StageKind vocabulary is intentionally small: conformer_generation, geometry_optimization, and single_point_calculation.

BackendOperation is a declarative operation contract: it records the backend name, operation name, StageKind, logical job target scope, consumed product types, product output specs, setting-name categories, expected artifacts, and JSON-compatible metadata. Product outputs are declared with ProductOutputSpec using product type, product scope, and OutputMultiplicity; produces is a derived product-type view over those output specs. The in-memory BackendOperationRegistry registers these contracts, supports lookup/filtering, and can validate that a StageRecord refers to a known backend operation with a matching StageKind and explicit product contract. Explicit product contracts are treated as semantic sets, while effective contracts are normalized to the backend operation's declared order. The required target_scope field uses the shared ProductScope vocabulary and declares the record/object scope targeted by one future logical job.

The fake backend currently provides declarations only: fake.generate_conformers targeting systems, plus fake.optimize and fake.single_point targeting conformers. These declarations include output product scopes and multiplicities, including one-or-more conformer-level geometry products for conformer generation and one job-level raw output bundle for fake optimization/single-point operations. The fake backend does not yet execute fake jobs, create products/artifacts, simulate failures, or test restart behavior.

Implemented Workflow Planning Foundation

The current package includes molsemble.workflow.planner, which can resolve a WorkflowRecord plus StageRecord objects against a BackendOperationRegistry. build_workflow_plan(...) returns an in-memory WorkflowPlan with behavior-light ResolvedStage records containing the original stage, matched backend operation, and effective consumed/produced ProductTypes.

The planner canonicalizes stage order by StageRecord.index, validates stage identity and index consistency, checks backend-operation compatibility, fills omitted stage product contracts from backend operation declarations, and performs a coarse upstream ProductType availability check.

The current package also includes molsemble.workflow.dependencies, which resolves stage-level product dependencies from a WorkflowPlan. STRICT mode requires each consumed ProductType to have exactly one earlier producer. PIPELINE mode treats the ordered stage list as a linear refinement pipeline and selects the latest earlier producer while preserving all candidate producer stages for inspection.

The current package also includes molsemble.workflow.targets, which builds a traversal-neutral TargetCatalog over known Entry/System/Conformer records. ExecutionTarget records preserve scope, target ID, and resolved parent context; the catalog supports grouping queries by scope, entry, and system. It is a validated target view, not a scheduler or storage layer.

The current package also includes molsemble.workflow.availability, which builds an immutable ProductAvailabilityIndex over known ProductRecord instances and a TargetCatalog. The index answers exact target-aware availability questions for entry-, system-, and conformer-scoped products and can filter by ProductType and producer stage ID. It treats only available products as available and ignores unsupported scopes such as job-scoped raw output bundles. It is a query layer, not a dependency selector, readiness calculator, scheduler, storage reader, or executor.

The current package also includes molsemble.workflow.readiness, which builds an immutable WorkflowReadinessFrontier from a WorkflowDependencyPlan and a ProductAvailabilityIndex. The frontier contains one StageTargetReadiness record for each currently known stage-target pair covered by each stage's operation target scope. Consumed inputs are represented as InputProductRequirement records that preserve the selected producer stage, candidate producer stages, and matching available products. This is input readiness only: no-input stages are ready, consuming stages are ready only when every required ProductType has at least one matching available product, and blocked stage-target pairs report missing ProductTypes. The frontier also offers query helpers for ready/blocked records, stages, and targets.

The workflow layer does not yet expand jobs, create JobRecord instances, choose one exact input product among multiple matching products, decide reuse or output-completion suppression, create attempts, mutate RunStore, execute backends, or provide explicit YAML input-selector syntax.

First Alpha Scope

The first alpha is intended to cover CSV + SMILES input, YAML workflow configuration, RDKit conformer generation, ORCA geometry optimization, ORCA single-point calculation stages, filesystem provenance, restart/resume, and structured computational-product export.

First Alpha Exclusions

The first alpha excludes SDF input, general XYZ/OpenXYZ input, xTB, CREST, Multiwfn, ORCA frequencies, .orcacosmo generation, Gaussian, Psi4, PySCF, native Slurm execution, database storage, dedicated molsemble-level pruning, connectivity/stereochemistry validation, and user-facing derived/follow-up runs.

License

molsemble is licensed under LGPL-3.0-or-later. The repository includes the LGPL notice in LICENSE.md and the canonical GPL/LGPL license texts in COPYING and COPYING.LESSER.

About

Python toolkit for reproducible molecular-ensemble and quantum-chemistry workflow orchestration.

Resources

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE.md
GPL-3.0
COPYING
LGPL-3.0
COPYING.LESSER

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages