Skip to content

Spec 002: systemd agent#2

Merged
outergod merged 12 commits into
masterfrom
002-systemd-agent
Mar 23, 2026
Merged

Spec 002: systemd agent#2
outergod merged 12 commits into
masterfrom
002-systemd-agent

Conversation

@outergod
Copy link
Copy Markdown
Owner

Feature Specification: Systemd-Managed Host Agent

Feature Branch: [002-systemd-agent]
Created: 2026-03-19
Status: Draft
Input: User description: "Specify the next iteration of the Fedora CoreOS Quadlet GitOps controller as an automated host agent. Goals: - install and operate the controller as a systemd-managed service or timer on a host - validate and rely on journald-based operational observability when running under systemd - extend reconciliation support beyond container Quadlets to include socket and volume Quadlet artifacts - define lifecycle, ordering, and verification behavior for these supported artifact types - preserve the current principles of native system primitives, explicit failure, idempotence, and observability Requirements: - the controller must be able to run unattended on a host through systemd automation - journald should be the default operational audit sink under service execution - socket artifacts must be treated according to their distinct lifecycle semantics - volume artifacts should be supported as first-class reconciled objects - mounting existing host config directories into containers may be supported where it fits naturally within Quadlet definitions - avoid expanding into generic host configuration management Non-goals for this iteration: - host templating or reusable manifest parameterization across multiple hosts - secret distribution - arbitrary environment file management as first-class managed objects - fleet orchestration"

Clarifications

Session 2026-03-19

  • Q: Which systemd automation mode should the agent ship for unattended runs? → A: Provide both a oneshot service and a timer (timer triggers service)
  • Q: What ordering should apply across volume/container/socket artifacts? → A: Volume → Container → Socket ordering
  • Q: What verification approach should be used per artifact type? → A: Verify by systemd unit state (active for container/socket; loaded for volume). Do not enable/disable generated units.
  • Q: Where are socket artifacts installed? → A: Socket artifacts are native systemd units stored in the system unit directory (e.g., /etc/systemd/system), not in the Quadlet directory.

User Scenarios & Testing (mandatory)

User Story 1 - Automated host agent runs unattended (Priority: P1)

As an operator, I want the controller installed as a systemd-managed service or
timer so that reconciliation runs unattended on a Fedora CoreOS host.

Why this priority: Unattended host automation is the core value of this
iteration and enables reliable day-2 operations.

Independent Test: Install the unit files, enable the timer or service, and
verify the agent executes on schedule without manual invocation.

Acceptance Scenarios:

  1. Given the systemd unit files are installed, When the timer fires (or
    the service starts), Then reconciliation runs without user interaction.
  2. Given the agent runs under systemd, When it completes a run, Then
    operational logs are visible in journald.

User Story 2 - Reconcile containers, sockets, and volumes (Priority: P2)

As an operator, I want the controller to reconcile container, socket, and volume
Quadlet artifacts so that all supported workload types converge together.

Why this priority: Broadening Quadlet support is the key functional expansion
beyond the current container-only scope.

Independent Test: Place container, socket, and volume Quadlet files in the
repository and verify they are reconciled in a single run.

Acceptance Scenarios:

  1. Given container, socket, and volume Quadlet definitions, When
    reconciliation runs, Then each artifact type is created/updated or
    removed according to the desired state.
  2. Given a socket definition that depends on a container service, When
    reconciliation runs, Then ordering rules ensure a stable, converged state.

User Story 3 - Explicit verification and observability (Priority: P3)

As an operator, I want clear verification behavior and journal-based observability
so that I can validate changes and diagnose failures.

Why this priority: Observability and verification prevent silent failures and
maintain trust in unattended automation.

Independent Test: Introduce a failing Quadlet definition and confirm that the
agent logs the failure and marks the run as failed without partial success.

Acceptance Scenarios:

  1. Given an invalid Quadlet definition, When the agent runs, Then it
    reports a failure in journald with a clear reason.
  2. Given a successful run, When I inspect journald logs, Then I can
    identify the plan, actions, and outcome for that run.

Edge Cases

  • Timer overlaps a previous run (must avoid concurrent reconciliation).
  • Socket artifacts reference services that are missing or invalid.
  • Volume artifacts exist on disk but are removed from desired state.
  • Host reboots during a reconcile run.
  • Journald unavailable or log storage is full.
  • Git repository temporarily unavailable.
  • Quadlet files contain unsupported extensions.

Requirements (mandatory)

Functional Requirements

  • FR-001: The controller MUST run unattended via systemd service or timer
    execution on a Fedora CoreOS host. The distribution MUST include both a
    oneshot service and a timer that triggers it.
  • FR-002: The controller MUST emit operational audit events to journald when
    running under systemd.
  • FR-003: The controller MUST reconcile container and volume Quadlet
    artifacts (stored in the Quadlet directory) and socket artifacts as native
    systemd units (stored in the system unit directory).
  • FR-004: The controller MUST define and enforce lifecycle ordering for
    socket units and volume artifacts relative to containers and services, using a
    Volume → Container → Socket ordering model.
  • FR-005: The controller MUST define explicit verification behavior for each
    supported artifact type (container, socket, volume) using systemd unit state
    checks. Container and socket units MUST report active; volume artifacts MUST
    report a loaded unit state when applicable. The controller MUST NOT run
    enable/disable for generated units; enablement remains driven by Quadlet
    [Install] semantics.
  • FR-006: The controller MUST remain limited to Quadlet/systemd/container
    scope and MUST NOT expand into generic host configuration management.
  • FR-007: Mounting existing host config directories into containers MAY be
    supported when specified in Quadlet definitions; the controller MUST NOT manage
    those directories as independent objects.
  • FR-008: The controller MUST ensure reconciliation runs are idempotent and
    safe to repeat.
  • FR-009: The controller MUST surface failures explicitly in operator-visible
    diagnostics and logs.
  • FR-010: The controller MUST prevent overlapping reconcile runs (single-run
    lock semantics).

Key Entities (include if feature involves data)

  • Host Agent Run: A single scheduled execution with plan, actions, and outcome.
  • Quadlet Artifact: A desired-state object (container, socket, volume).
  • Reconciliation Plan: The ordered actions required to converge.
  • Verification Result: Outcome of post-apply checks per artifact type.
  • Audit Event: Structured operational log entry emitted to journald.

Constitution Alignment (mandatory)

  • Functional core vs. side effects: Planning and verification remain pure;
    systemd/Git interactions are isolated to adapters.
  • Declarative state model: Desired/observed/plan/outcome remain explicit data.
  • Idempotence & convergence: Repeated runs converge with no unintended change.
  • Explicit effects/failures: Side effects and failure modes are logged and
    returned explicitly.
  • Observability: Journald provides default operational audit output; rich
    reports remain operator-visible.
  • Safe defaults: Unattended runs still require explicit desired state and
    fail safely on invalid inputs.
  • Compatibility: No fleet or host config expansion; existing semantics stay
    stable for container Quadlets.
  • Test contract: Tests cover artifact ordering, lifecycle, verification, and
    unattended execution behavior.
  • Regenerability: Spec and tests define the contract for safe regeneration.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: 95% of scheduled runs complete within 2 minutes on a host with up
    to 50 Quadlet artifacts.
  • SC-002: 100% of runs emit a journald audit event with plan summary and
    outcome status.
  • SC-003: 100% of failed runs emit a journald audit event containing run_id,
    plan summary, failed artifact list, and failure reason.
  • SC-004: Reapplying the same desired state results in zero unintended
    changes across three consecutive scheduled runs.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83df2df011

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/verify.rs
Comment on lines +33 to +34
(QuadletType::Volume, Some(_)) => success(unit_name),
(QuadletType::Volume, None) => failure(unit_name, "volume unit not found"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat missing volume units as verification failures

This branch marks any discovered volume unit as success, but io::observed::query_unit_state synthesizes an ObservedUnit even when systemctl show reports not-found. In that common missing-unit path, a volume that was never generated is still reported as verified, so apply runs can incorrectly report converged while volume artifacts are absent.

Useful? React with 👍 / 👎.

Comment thread src/cli/agent.rs
.release(guard)
.map_err(|err| CoreError::new(FailureClass::Apply, err.to_string()));

let (result, report, plan) = result?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit audit events when agent apply fails early

The result? here returns before build_audit_event and emit_journal_event, so hard failures from apply_with_report (repo fetch/checkout failures, apply errors, etc.) produce no structured audit event. In systemd unattended runs this drops the failure telemetry for exactly the cases operators need to diagnose.

Useful? React with 👍 / 👎.

Environment=CORE_OPS_REPO=https://example.com/org/quadlets.git
Environment=CORE_OPS_REV=main
Environment=CORE_OPS_QUADLET_DIR=/etc/containers/systemd
ExecStart=/usr/bin/core-ops apply --repo ${CORE_OPS_REPO} --rev ${CORE_OPS_REV} --quadlet-dir ${CORE_OPS_QUADLET_DIR}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invoke agent mode in shipped systemd service template

The unit template executes core-ops apply instead of core-ops agent, which bypasses the new agent-only behavior (run lock semantics and env-driven defaults like CORE_OPS_SYSTEMD_UNIT_DIR/CORE_OPS_LOCK_PATH). Users following this template won't actually run the unattended agent flow introduced in this change.

Useful? React with 👍 / 👎.

@outergod outergod merged commit 6f7b0a5 into master Mar 23, 2026
outergod added a commit that referenced this pull request May 1, 2026
…nfig/

Two Codex P2 catches on PR #28:

P2 #1 — `read_payload_dropins` walked every directory under
`services/<svc>/{quadlet,systemd}/` but only processed those ending
in `.d`. A typo like `quadlet/foo.container.dropin/` was silently
skipped, dropping the operator's drop-ins on the floor without
surfacing a load error. Strict-layout contract should fail fast.
The non-`.d` arm now returns `RepoError::LegacyArtifact(path)`,
matching the existing pattern for "unrecognized directory at this
scope" elsewhere in the parser.

P2 #2 — `read_config_files` rejected any `config/etc/...` as legacy
unconditionally. But spec.md FR-002 says `config/<rel>` is generic
— a user who wants their config to deploy to `/etc/<config-root>/
etc/foo/bar` puts the source at `services/<svc>/config/etc/foo/bar`.
Real-world example: `services/pihole/config/etc/dnsmasq.d/...`
deploying to `/etc/pihole/etc/dnsmasq.d/...`. The legacy
`config/etc/<config-root>/<rel>` mirror is detected by the
migration script (which flattens it) and at load time by other
unambiguous markers (top-level `quadlets/`, `quadlet-overrides/`,
`hosts/<h>/overrides/`). The parser no longer special-cases `etc/`;
the rejection is removed.

Tests: tests/integration/test_source_repo_layout.rs gains
`unrecognized_non_dropin_directory_rejected` and
`literal_etc_subdir_under_config_is_legitimate`. Each guards one
of the two contract points.

Verification: 28 lib unit + 422 integration = 450 pass; clippy
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod added a commit that referenced this pull request May 5, 2026
…ntics

PR #32 review feedback (P2 + P2) from chatgpt-codex-connector on
commit f5cfc90.

P2 #1 — Preserve EX_DATAERR for invalid source-repo layouts:
the previous fix wired exit code 64 (`EX_USAGE`) for path-shape
errors via `map_source_ref_error`, but layout/parser failures from
`load_desired_state_from_path` still flowed through `map_plan_error`,
which doesn't set an exit-code override. So a directory that exists
but isn't a spec/016-conformant source-repo (missing `services/`,
legacy artifacts, malformed `service.yaml`) exited 1 instead of the
documented 65 (`EX_DATAERR`), so automation couldn't distinguish
malformed input from generic runtime failures.

Fix: add `map_stateless_layout_error` in `src/main.rs` that builds
`CoreError::with_exit_code(FailureClass::Plan, msg, 65)` and use it
in the `load_desired` closures of all three stateless branches (plan,
apply, explain). For `apply_with_report_stateless`'s self-built deps
in `src/cli/apply.rs`, the same exit-code wrapper is inlined since
the stateless apply path can't reach the main.rs helper.

Coverage: new `stateless_plan_against_invalid_layout_errors_with_exit_code_65`
integration test. An empty tempdir (passes path-existence + is-dir
gates, fails the parser at `MissingServicesDir`) exits 65.

P2 #2 — Fall back to non-git sentinel on git probe failure:
`working_tree_dirty` returned `true` whenever the `git status`
subprocess errored (git missing from `$PATH`, invocation failure,
non-zero exit), which forced `(stateless+dirty)` provenance for
probe failures and conflated "actually dirty" with "couldn't
determine". This contradicts `research.md` D1 step 5 ("On any
subprocess error: log via miette diagnostic, fall back to
`(stateless)`") and broke the audit semantic that
`(stateless+dirty)` is supposed to assert.

Fix: replace `working_tree_dirty(&Path) -> bool` with
`working_tree_clean(&Path) -> Option<bool>`:
  - `Some(true)`  — `git status --porcelain` succeeded, empty output.
  - `Some(false)` — `git status --porcelain` succeeded, non-empty output.
  - `None`        — subprocess failed (probe non-verifiable).
`detect_provenance` matches on the result: clean → SHA (or
`(stateless)` if SHA fetch fails), dirty → `(stateless+dirty)`,
probe-failed → `(stateless)` per the spec's documented fallback.

`cargo test` 470 passed (was 469); `cargo clippy --all-targets -- -D
warnings` clean; `core-ops-release validate --base-ref master`
passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod added a commit that referenced this pull request May 5, 2026
PR #32 review feedback (P2 + P2) from chatgpt-codex-connector on
commit 342e5aa.

P2 #1 — Populate stateless audit provenance with run outcome:
the synthetic `PersistedProvenanceState` for stateless apply
hardcoded `reconciliation.status = NeverRun` and `generation = 0`.
That object is fed to `build_audit_event`, which emits those fields
as authoritative audit metadata, so every stateless apply's journal
event reported `reconciliation_status = "never_run"` even after a
successful or failed apply — misleading for any downstream audit /
monitoring consumer that keys on reconciliation provenance.

Fix: extend `synthetic_stateless_provenance` (relocated from
`src/main.rs` to `src/cli/apply.rs` so it is unit-testable) to
accept the resolved `revision_id` plus the `RunStatus` outcome.
The audit event now reports:

  - `reconciliation_status` = `success` | `failed` (from `RunStatus`).
  - `attempted_revision` = `revision_id` (always — both success and
    failure attempted the same target).
  - `applied_revision` = `revision_id` on success, `None` on failure
    (failure means no revision became authoritative on the host).
  - `last_observed_revision` = `revision_id` so the audit event can
    correlate with the source-repo revision the apply ran against.
  - `generation` = 1 (single ad-hoc run; stateless mode has no
    prior generation context).

Coverage: new `synthetic_stateless_provenance_reflects_run_outcome`
unit test asserts both the Success and Failure matrices.

P2 #2 — Differentiate source-repo access errors from missing paths:
`detect_provenance` used `Path::exists()` / `Path::is_dir()`, which
swallow metadata I/O errors (most commonly `PermissionDenied`) into
`false`. So an inaccessible directory got classified as
`PathMissing` / `PathNotDirectory` (exit 64) instead of as an access
failure, with a misleading diagnostic for unprivileged runs against
restricted paths.

Fix: replace the existence/is-dir gates with a single
`std::fs::metadata` call. `ErrorKind::NotFound` maps to the
existing `PathMissing` (exit 64); any other I/O error maps to a
new `PathInaccessible { path, source }` variant (exit 66, alongside
`Canonicalize`). `metadata.is_dir()` continues to drive the
`PathNotDirectory` branch.

`cargo test` 471 passed (was 470); `cargo clippy --all-targets -- -D
warnings` clean; `core-ops-release validate --base-ref master`
passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant