Spec 002: systemd agent by outergod · Pull Request #2 · outergod/core-ops

outergod · 2026-03-23T07:56:31Z

Feature Specification: Systemd-Managed Host Agent

Feature Branch: [002-systemd-agent]
Created: 2026-03-19
Status: Draft
Input: User description: "Specify the next iteration of the Fedora CoreOS Quadlet GitOps controller as an automated host agent. Goals: - install and operate the controller as a systemd-managed service or timer on a host - validate and rely on journald-based operational observability when running under systemd - extend reconciliation support beyond container Quadlets to include socket and volume Quadlet artifacts - define lifecycle, ordering, and verification behavior for these supported artifact types - preserve the current principles of native system primitives, explicit failure, idempotence, and observability Requirements: - the controller must be able to run unattended on a host through systemd automation - journald should be the default operational audit sink under service execution - socket artifacts must be treated according to their distinct lifecycle semantics - volume artifacts should be supported as first-class reconciled objects - mounting existing host config directories into containers may be supported where it fits naturally within Quadlet definitions - avoid expanding into generic host configuration management Non-goals for this iteration: - host templating or reusable manifest parameterization across multiple hosts - secret distribution - arbitrary environment file management as first-class managed objects - fleet orchestration"

Clarifications

Session 2026-03-19

Q: Which systemd automation mode should the agent ship for unattended runs? → A: Provide both a oneshot service and a timer (timer triggers service)
Q: What ordering should apply across volume/container/socket artifacts? → A: Volume → Container → Socket ordering
Q: What verification approach should be used per artifact type? → A: Verify by systemd unit state (active for container/socket; loaded for volume). Do not enable/disable generated units.
Q: Where are socket artifacts installed? → A: Socket artifacts are native systemd units stored in the system unit directory (e.g., /etc/systemd/system), not in the Quadlet directory.

User Scenarios & Testing (mandatory)

User Story 1 - Automated host agent runs unattended (Priority: P1)

As an operator, I want the controller installed as a systemd-managed service or
timer so that reconciliation runs unattended on a Fedora CoreOS host.

Why this priority: Unattended host automation is the core value of this
iteration and enables reliable day-2 operations.

Independent Test: Install the unit files, enable the timer or service, and
verify the agent executes on schedule without manual invocation.

Acceptance Scenarios:

Given the systemd unit files are installed, When the timer fires (or
the service starts), Then reconciliation runs without user interaction.
Given the agent runs under systemd, When it completes a run, Then
operational logs are visible in journald.

User Story 2 - Reconcile containers, sockets, and volumes (Priority: P2)

As an operator, I want the controller to reconcile container, socket, and volume
Quadlet artifacts so that all supported workload types converge together.

Why this priority: Broadening Quadlet support is the key functional expansion
beyond the current container-only scope.

Independent Test: Place container, socket, and volume Quadlet files in the
repository and verify they are reconciled in a single run.

Acceptance Scenarios:

Given container, socket, and volume Quadlet definitions, When
reconciliation runs, Then each artifact type is created/updated or
removed according to the desired state.
Given a socket definition that depends on a container service, When
reconciliation runs, Then ordering rules ensure a stable, converged state.

User Story 3 - Explicit verification and observability (Priority: P3)

As an operator, I want clear verification behavior and journal-based observability
so that I can validate changes and diagnose failures.

Why this priority: Observability and verification prevent silent failures and
maintain trust in unattended automation.

Independent Test: Introduce a failing Quadlet definition and confirm that the
agent logs the failure and marks the run as failed without partial success.

Acceptance Scenarios:

Given an invalid Quadlet definition, When the agent runs, Then it
reports a failure in journald with a clear reason.
Given a successful run, When I inspect journald logs, Then I can
identify the plan, actions, and outcome for that run.

Edge Cases

Timer overlaps a previous run (must avoid concurrent reconciliation).
Socket artifacts reference services that are missing or invalid.
Volume artifacts exist on disk but are removed from desired state.
Host reboots during a reconcile run.
Journald unavailable or log storage is full.
Git repository temporarily unavailable.
Quadlet files contain unsupported extensions.

Requirements (mandatory)

Functional Requirements

FR-001: The controller MUST run unattended via systemd service or timer
execution on a Fedora CoreOS host. The distribution MUST include both a
oneshot service and a timer that triggers it.
FR-002: The controller MUST emit operational audit events to journald when
running under systemd.
FR-003: The controller MUST reconcile container and volume Quadlet
artifacts (stored in the Quadlet directory) and socket artifacts as native
systemd units (stored in the system unit directory).
FR-004: The controller MUST define and enforce lifecycle ordering for
socket units and volume artifacts relative to containers and services, using a
Volume → Container → Socket ordering model.
FR-005: The controller MUST define explicit verification behavior for each
supported artifact type (container, socket, volume) using systemd unit state
checks. Container and socket units MUST report active; volume artifacts MUST
report a loaded unit state when applicable. The controller MUST NOT run
enable/disable for generated units; enablement remains driven by Quadlet
[Install] semantics.
FR-006: The controller MUST remain limited to Quadlet/systemd/container
scope and MUST NOT expand into generic host configuration management.
FR-007: Mounting existing host config directories into containers MAY be
supported when specified in Quadlet definitions; the controller MUST NOT manage
those directories as independent objects.
FR-008: The controller MUST ensure reconciliation runs are idempotent and
safe to repeat.
FR-009: The controller MUST surface failures explicitly in operator-visible
diagnostics and logs.
FR-010: The controller MUST prevent overlapping reconcile runs (single-run
lock semantics).

Key Entities (include if feature involves data)

Host Agent Run: A single scheduled execution with plan, actions, and outcome.
Quadlet Artifact: A desired-state object (container, socket, volume).
Reconciliation Plan: The ordered actions required to converge.
Verification Result: Outcome of post-apply checks per artifact type.
Audit Event: Structured operational log entry emitted to journald.

Constitution Alignment (mandatory)

Functional core vs. side effects: Planning and verification remain pure;
systemd/Git interactions are isolated to adapters.
Declarative state model: Desired/observed/plan/outcome remain explicit data.
Idempotence & convergence: Repeated runs converge with no unintended change.
Explicit effects/failures: Side effects and failure modes are logged and
returned explicitly.
Observability: Journald provides default operational audit output; rich
reports remain operator-visible.
Safe defaults: Unattended runs still require explicit desired state and
fail safely on invalid inputs.
Compatibility: No fleet or host config expansion; existing semantics stay
stable for container Quadlets.
Test contract: Tests cover artifact ordering, lifecycle, verification, and
unattended execution behavior.
Regenerability: Spec and tests define the contract for safe regeneration.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: 95% of scheduled runs complete within 2 minutes on a host with up
to 50 Quadlet artifacts.
SC-002: 100% of runs emit a journald audit event with plan summary and
outcome status.
SC-003: 100% of failed runs emit a journald audit event containing run_id,
plan summary, failed artifact list, and failure reason.
SC-004: Reapplying the same desired state results in zero unintended
changes across three consecutive scheduled runs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83df2df011

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T08:04:07Z

+        (QuadletType::Volume, Some(_)) => success(unit_name),
+        (QuadletType::Volume, None) => failure(unit_name, "volume unit not found"),


Treat missing volume units as verification failures

This branch marks any discovered volume unit as success, but io::observed::query_unit_state synthesizes an ObservedUnit even when systemctl show reports not-found. In that common missing-unit path, a volume that was never generated is still reported as verified, so apply runs can incorrectly report converged while volume artifacts are absent.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-23T08:04:07Z

+        .release(guard)
+        .map_err(|err| CoreError::new(FailureClass::Apply, err.to_string()));
+
+    let (result, report, plan) = result?;


Emit audit events when agent apply fails early

The result? here returns before build_audit_event and emit_journal_event, so hard failures from apply_with_report (repo fetch/checkout failures, apply errors, etc.) produce no structured audit event. In systemd unattended runs this drops the failure telemetry for exactly the cases operators need to diagnose.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-23T08:04:07Z

+Environment=CORE_OPS_REPO=https://example.com/org/quadlets.git
+Environment=CORE_OPS_REV=main
+Environment=CORE_OPS_QUADLET_DIR=/etc/containers/systemd
+ExecStart=/usr/bin/core-ops apply --repo ${CORE_OPS_REPO} --rev ${CORE_OPS_REV} --quadlet-dir ${CORE_OPS_QUADLET_DIR}


Invoke agent mode in shipped systemd service template

The unit template executes core-ops apply instead of core-ops agent, which bypasses the new agent-only behavior (run lock semantics and env-driven defaults like CORE_OPS_SYSTEMD_UNIT_DIR/CORE_OPS_LOCK_PATH). Users following this template won't actually run the unattended agent flow introduced in this change.

Useful? React with 👍 / 👎.

…nfig/ Two Codex P2 catches on PR #28: P2 #1 — `read_payload_dropins` walked every directory under `services/<svc>/{quadlet,systemd}/` but only processed those ending in `.d`. A typo like `quadlet/foo.container.dropin/` was silently skipped, dropping the operator's drop-ins on the floor without surfacing a load error. Strict-layout contract should fail fast. The non-`.d` arm now returns `RepoError::LegacyArtifact(path)`, matching the existing pattern for "unrecognized directory at this scope" elsewhere in the parser. P2 #2 — `read_config_files` rejected any `config/etc/...` as legacy unconditionally. But spec.md FR-002 says `config/<rel>` is generic — a user who wants their config to deploy to `/etc/<config-root>/ etc/foo/bar` puts the source at `services/<svc>/config/etc/foo/bar`. Real-world example: `services/pihole/config/etc/dnsmasq.d/...` deploying to `/etc/pihole/etc/dnsmasq.d/...`. The legacy `config/etc/<config-root>/<rel>` mirror is detected by the migration script (which flattens it) and at load time by other unambiguous markers (top-level `quadlets/`, `quadlet-overrides/`, `hosts/<h>/overrides/`). The parser no longer special-cases `etc/`; the rejection is removed. Tests: tests/integration/test_source_repo_layout.rs gains `unrecognized_non_dropin_directory_rejected` and `literal_etc_subdir_under_config_is_legitimate`. Each guards one of the two contract points. Verification: 28 lib unit + 422 integration = 450 pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ntics PR #32 review feedback (P2 + P2) from chatgpt-codex-connector on commit f5cfc90. P2 #1 — Preserve EX_DATAERR for invalid source-repo layouts: the previous fix wired exit code 64 (`EX_USAGE`) for path-shape errors via `map_source_ref_error`, but layout/parser failures from `load_desired_state_from_path` still flowed through `map_plan_error`, which doesn't set an exit-code override. So a directory that exists but isn't a spec/016-conformant source-repo (missing `services/`, legacy artifacts, malformed `service.yaml`) exited 1 instead of the documented 65 (`EX_DATAERR`), so automation couldn't distinguish malformed input from generic runtime failures. Fix: add `map_stateless_layout_error` in `src/main.rs` that builds `CoreError::with_exit_code(FailureClass::Plan, msg, 65)` and use it in the `load_desired` closures of all three stateless branches (plan, apply, explain). For `apply_with_report_stateless`'s self-built deps in `src/cli/apply.rs`, the same exit-code wrapper is inlined since the stateless apply path can't reach the main.rs helper. Coverage: new `stateless_plan_against_invalid_layout_errors_with_exit_code_65` integration test. An empty tempdir (passes path-existence + is-dir gates, fails the parser at `MissingServicesDir`) exits 65. P2 #2 — Fall back to non-git sentinel on git probe failure: `working_tree_dirty` returned `true` whenever the `git status` subprocess errored (git missing from `$PATH`, invocation failure, non-zero exit), which forced `(stateless+dirty)` provenance for probe failures and conflated "actually dirty" with "couldn't determine". This contradicts `research.md` D1 step 5 ("On any subprocess error: log via miette diagnostic, fall back to `(stateless)`") and broke the audit semantic that `(stateless+dirty)` is supposed to assert. Fix: replace `working_tree_dirty(&Path) -> bool` with `working_tree_clean(&Path) -> Option<bool>`: - `Some(true)` — `git status --porcelain` succeeded, empty output. - `Some(false)` — `git status --porcelain` succeeded, non-empty output. - `None` — subprocess failed (probe non-verifiable). `detect_provenance` matches on the result: clean → SHA (or `(stateless)` if SHA fetch fails), dirty → `(stateless+dirty)`, probe-failed → `(stateless)` per the spec's documented fallback. `cargo test` 470 passed (was 469); `cargo clippy --all-targets -- -D warnings` clean; `core-ops-release validate --base-ref master` passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #32 review feedback (P2 + P2) from chatgpt-codex-connector on commit 342e5aa. P2 #1 — Populate stateless audit provenance with run outcome: the synthetic `PersistedProvenanceState` for stateless apply hardcoded `reconciliation.status = NeverRun` and `generation = 0`. That object is fed to `build_audit_event`, which emits those fields as authoritative audit metadata, so every stateless apply's journal event reported `reconciliation_status = "never_run"` even after a successful or failed apply — misleading for any downstream audit / monitoring consumer that keys on reconciliation provenance. Fix: extend `synthetic_stateless_provenance` (relocated from `src/main.rs` to `src/cli/apply.rs` so it is unit-testable) to accept the resolved `revision_id` plus the `RunStatus` outcome. The audit event now reports: - `reconciliation_status` = `success` | `failed` (from `RunStatus`). - `attempted_revision` = `revision_id` (always — both success and failure attempted the same target). - `applied_revision` = `revision_id` on success, `None` on failure (failure means no revision became authoritative on the host). - `last_observed_revision` = `revision_id` so the audit event can correlate with the source-repo revision the apply ran against. - `generation` = 1 (single ad-hoc run; stateless mode has no prior generation context). Coverage: new `synthetic_stateless_provenance_reflects_run_outcome` unit test asserts both the Success and Failure matrices. P2 #2 — Differentiate source-repo access errors from missing paths: `detect_provenance` used `Path::exists()` / `Path::is_dir()`, which swallow metadata I/O errors (most commonly `PermissionDenied`) into `false`. So an inaccessible directory got classified as `PathMissing` / `PathNotDirectory` (exit 64) instead of as an access failure, with a misleading diagnostic for unprivileged runs against restricted paths. Fix: replace the existence/is-dir gates with a single `std::fs::metadata` call. `ErrorKind::NotFound` maps to the existing `PathMissing` (exit 64); any other I/O error maps to a new `PathInaccessible { path, source }` variant (exit 66, alongside `Canonicalize`). `metadata.is_dir()` continues to drive the `PathNotDirectory` branch. `cargo test` 471 passed (was 470); `cargo clippy --all-targets -- -D warnings` clean; `core-ops-release validate --base-ref master` passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

outergod added 12 commits March 19, 2026 21:37

update agents

c4b426e

add specification

b00d118

implement Phase 1

30ea9e3

implement Phase 2

190458e

implement Phase 3

3bfdbe2

implement Phase 4

8de296b

implement Phase 5

70d1cd5

implement Phase 6 and fix remaining issues

bef61e3

fix: ignore duplicate workload names

f6b9d57

fix: socket file treatment as native systemd entities

cd37221

fix: startup order for services with corresponding socket

1122056

fix: contract / unit names without -agent

83df2df

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

outergod merged commit 6f7b0a5 into master Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec 002: systemd agent#2

Spec 002: systemd agent#2
outergod merged 12 commits into
masterfrom
002-systemd-agent

outergod commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		(QuadletType::Volume, Some(_)) => success(unit_name),
		(QuadletType::Volume, None) => failure(unit_name, "volume unit not found"),

Conversation

outergod commented Mar 23, 2026

Feature Specification: Systemd-Managed Host Agent

Clarifications

Session 2026-03-19

User Scenarios & Testing (mandatory)

User Story 1 - Automated host agent runs unattended (Priority: P1)

User Story 2 - Reconcile containers, sockets, and volumes (Priority: P2)

User Story 3 - Explicit verification and observability (Priority: P3)

Edge Cases

Requirements (mandatory)

Functional Requirements

Key Entities (include if feature involves data)

Constitution Alignment (mandatory)

Success Criteria (mandatory)

Measurable Outcomes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant