Hermes is a local orchestration runtime for the ~/.hermes workspace. The
current documented operating surface is the implemented upgrade control plane:
a single-host, persistence-backed coordination system that plans, gates,
executes, observes, and replays repository upgrade cycles.
Hermes should not be described as a generic AI framework. In the current repository state, the control-plane path is infrastructure code with explicit state files, deterministic decision hashing, lease semantics, remote federation, rollback planning, and event-sourced observability.
single-host deterministic control-plane prototype
with runtime intelligence,
lease-backed coordination,
federation-aware remote selection,
integrity-gated execution,
event-sourced replay,
and real cron-path simulations
This is not a distributed consensus system. Hermes currently does not implement Raft, cross-host quorum, multi-host lease replication, autonomous remote provisioning, or a self-healing distributed mesh.
The current control-plane behavior is implemented in:
hermes-agent/cron/scheduler.pyscripts/auto_safe_upgrade.pyhermes-agent/hermes_cli/observability_events.pyhermes-agent/hermes_cli/observability_graph.pyhermes-agent/hermes_cli/observability_api.pyscripts/runtime_intelligence_simulation.pyconfig/runtime_upgrade_policies.json
Historical docs/evolution/*, older README/* sections, and broad core/*
narratives are useful for archaeology, but they are not the source of truth for
the implemented upgrade control plane.
cron.scheduler.tick()
-> due-job selection and .tick.lock
-> script path preflight and structured argv execution
-> scripts/auto_safe_upgrade.py run_two_phase_cycle()
-> repo probe and remote failure classification
-> GlobalOrchestrationPlanner (plan_upgrade_queue)
-> AgentCoordinator reservations and leases
-> runtime federation topology
-> RuntimeTopologyGraph snapshot
-> control-plane integrity gate
-> execute / defer / block / rollback planning
-> HermesEvent ledger
-> EventLedger -> GraphProjectionEngine -> GraphSnapshot
-> state/*.json current-control-plane views
cron.scheduler.tick() is the runtime entrypoint for scheduled upgrade jobs. It
acquires a local tick lock, resolves due jobs, validates script paths before
execution, preserves script_args as structured argv, records truthful success
or failure status, and appends runtime events when configured.
scripts/auto_safe_upgrade.py owns the two-phase cycle:
probe
-> plan
-> build topology
-> validate integrity
-> execute, defer, or block
-> finalize state
It handles repository inspection, remote degradation, risk and strategy selection, lease-backed coordination, canary checks, guarded apply, rollback planning, summary persistence, and control-plane artifact writes.
Runtime intelligence is persisted local memory, not an LLM planner. It records repo stability, rollback rate, failure probability, risk hours, confidence, and remote health signals. The planner consumes these signals together with current resource telemetry and policy configuration to stabilize execution decisions.
The AgentCoordinator semantics are implemented by reservation and lease
functions in scripts/auto_safe_upgrade.py. Coordination is distributed in
shape but single-host in implementation:
- one active lease per repo key
- active unexpired foreign leases win conflicts
- dependencies block downstream repos until prerequisites are granted
- shared remote budgets limit concurrent use of the same remote
- resource pressure can reduce parallelism or defer all lanes
Leases are persisted in state/coordination_leases.json.
Runtime federation selects among primary, origin, mirror, cache, and fallback remotes. Promotion is logical: Hermes records the healthier selected endpoint in federation state, but it does not rewrite git remote configuration or provision new remotes.
Runtime events flow through:
HermesEvent
-> EventLedger(state/runtime_events.jsonl)
-> observability_graph.project()
-> GraphSnapshot / GraphPatch
-> observability_api readers
The graph projection is derived from the event ledger. It is not a second mutable source of historical truth.
Hermes provides bounded deterministic guarantees for the local host and the current persisted state set:
- planner ordering is stable by dependency depth, recommendation priority, risk score, and repo name
decision_hashis derived from normalized payloads with volatile fields removed- granted reservations must have matching active leases
- active leases are unique by repo key
- integrity failures block repository mutation
- replay graph state is projected from the append-only event stream
These guarantees do not prove correctness across multiple machines.
Primary control-plane files live under state/:
auto_safe_upgrade_summary.jsonauto_safe_upgrade_state.jsonauto_safe_upgrade_state_history.jsonlremote_health.jsonruntime_intelligence.jsonruntime_resource_state.jsonruntime_resource_telemetry.jsonagent_coordination_state.jsoncoordination_leases.jsonglobal_orchestration_plan.jsonruntime_consensus_state.jsonruntime_topology_graph.jsonremote_federation_state.jsonfederation_topology_state.jsoncontrol_plane_integrity.jsonrollback_intelligence.jsonlive_runtime_control_plane.jsonruntime_events.jsonl
See docs/runtime_state_files.md for the complete map and ownership notes.
Start here:
- docs/README.md
- docs/control_plane_architecture.md
- docs/pipeline.md
- docs/distributed_coordination_semantics.md
- docs/runtime_topology_graph.md
- docs/runtime_federation.md
- docs/replay_and_determinism.md
- docs/control_plane_integrity.md
- docs/rollback_intelligence.md
- docs/runtime_simulation_lab.md
Operator entries:
Hermes currently does not provide:
- full Raft or quorum consensus
- cross-host lock replication
- multi-host split-brain recovery
- autonomous mirror or cache provisioning
- automatic distributed worker placement
- a guaranteed always-on dashboard service
- complete telemetry coverage for every runtime subsystem
The implemented architecture is precise but bounded: deterministic local coordination over persisted state.