Spec 002: systemd agent#2
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83df2df011
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| (QuadletType::Volume, Some(_)) => success(unit_name), | ||
| (QuadletType::Volume, None) => failure(unit_name, "volume unit not found"), |
There was a problem hiding this comment.
Treat missing volume units as verification failures
This branch marks any discovered volume unit as success, but io::observed::query_unit_state synthesizes an ObservedUnit even when systemctl show reports not-found. In that common missing-unit path, a volume that was never generated is still reported as verified, so apply runs can incorrectly report converged while volume artifacts are absent.
Useful? React with 👍 / 👎.
| .release(guard) | ||
| .map_err(|err| CoreError::new(FailureClass::Apply, err.to_string())); | ||
|
|
||
| let (result, report, plan) = result?; |
There was a problem hiding this comment.
Emit audit events when agent apply fails early
The result? here returns before build_audit_event and emit_journal_event, so hard failures from apply_with_report (repo fetch/checkout failures, apply errors, etc.) produce no structured audit event. In systemd unattended runs this drops the failure telemetry for exactly the cases operators need to diagnose.
Useful? React with 👍 / 👎.
| Environment=CORE_OPS_REPO=https://example.com/org/quadlets.git | ||
| Environment=CORE_OPS_REV=main | ||
| Environment=CORE_OPS_QUADLET_DIR=/etc/containers/systemd | ||
| ExecStart=/usr/bin/core-ops apply --repo ${CORE_OPS_REPO} --rev ${CORE_OPS_REV} --quadlet-dir ${CORE_OPS_QUADLET_DIR} |
There was a problem hiding this comment.
Invoke agent mode in shipped systemd service template
The unit template executes core-ops apply instead of core-ops agent, which bypasses the new agent-only behavior (run lock semantics and env-driven defaults like CORE_OPS_SYSTEMD_UNIT_DIR/CORE_OPS_LOCK_PATH). Users following this template won't actually run the unattended agent flow introduced in this change.
Useful? React with 👍 / 👎.
…nfig/ Two Codex P2 catches on PR #28: P2 #1 — `read_payload_dropins` walked every directory under `services/<svc>/{quadlet,systemd}/` but only processed those ending in `.d`. A typo like `quadlet/foo.container.dropin/` was silently skipped, dropping the operator's drop-ins on the floor without surfacing a load error. Strict-layout contract should fail fast. The non-`.d` arm now returns `RepoError::LegacyArtifact(path)`, matching the existing pattern for "unrecognized directory at this scope" elsewhere in the parser. P2 #2 — `read_config_files` rejected any `config/etc/...` as legacy unconditionally. But spec.md FR-002 says `config/<rel>` is generic — a user who wants their config to deploy to `/etc/<config-root>/ etc/foo/bar` puts the source at `services/<svc>/config/etc/foo/bar`. Real-world example: `services/pihole/config/etc/dnsmasq.d/...` deploying to `/etc/pihole/etc/dnsmasq.d/...`. The legacy `config/etc/<config-root>/<rel>` mirror is detected by the migration script (which flattens it) and at load time by other unambiguous markers (top-level `quadlets/`, `quadlet-overrides/`, `hosts/<h>/overrides/`). The parser no longer special-cases `etc/`; the rejection is removed. Tests: tests/integration/test_source_repo_layout.rs gains `unrecognized_non_dropin_directory_rejected` and `literal_etc_subdir_under_config_is_legitimate`. Each guards one of the two contract points. Verification: 28 lib unit + 422 integration = 450 pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntics PR #32 review feedback (P2 + P2) from chatgpt-codex-connector on commit f5cfc90. P2 #1 — Preserve EX_DATAERR for invalid source-repo layouts: the previous fix wired exit code 64 (`EX_USAGE`) for path-shape errors via `map_source_ref_error`, but layout/parser failures from `load_desired_state_from_path` still flowed through `map_plan_error`, which doesn't set an exit-code override. So a directory that exists but isn't a spec/016-conformant source-repo (missing `services/`, legacy artifacts, malformed `service.yaml`) exited 1 instead of the documented 65 (`EX_DATAERR`), so automation couldn't distinguish malformed input from generic runtime failures. Fix: add `map_stateless_layout_error` in `src/main.rs` that builds `CoreError::with_exit_code(FailureClass::Plan, msg, 65)` and use it in the `load_desired` closures of all three stateless branches (plan, apply, explain). For `apply_with_report_stateless`'s self-built deps in `src/cli/apply.rs`, the same exit-code wrapper is inlined since the stateless apply path can't reach the main.rs helper. Coverage: new `stateless_plan_against_invalid_layout_errors_with_exit_code_65` integration test. An empty tempdir (passes path-existence + is-dir gates, fails the parser at `MissingServicesDir`) exits 65. P2 #2 — Fall back to non-git sentinel on git probe failure: `working_tree_dirty` returned `true` whenever the `git status` subprocess errored (git missing from `$PATH`, invocation failure, non-zero exit), which forced `(stateless+dirty)` provenance for probe failures and conflated "actually dirty" with "couldn't determine". This contradicts `research.md` D1 step 5 ("On any subprocess error: log via miette diagnostic, fall back to `(stateless)`") and broke the audit semantic that `(stateless+dirty)` is supposed to assert. Fix: replace `working_tree_dirty(&Path) -> bool` with `working_tree_clean(&Path) -> Option<bool>`: - `Some(true)` — `git status --porcelain` succeeded, empty output. - `Some(false)` — `git status --porcelain` succeeded, non-empty output. - `None` — subprocess failed (probe non-verifiable). `detect_provenance` matches on the result: clean → SHA (or `(stateless)` if SHA fetch fails), dirty → `(stateless+dirty)`, probe-failed → `(stateless)` per the spec's documented fallback. `cargo test` 470 passed (was 469); `cargo clippy --all-targets -- -D warnings` clean; `core-ops-release validate --base-ref master` passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #32 review feedback (P2 + P2) from chatgpt-codex-connector on commit 342e5aa. P2 #1 — Populate stateless audit provenance with run outcome: the synthetic `PersistedProvenanceState` for stateless apply hardcoded `reconciliation.status = NeverRun` and `generation = 0`. That object is fed to `build_audit_event`, which emits those fields as authoritative audit metadata, so every stateless apply's journal event reported `reconciliation_status = "never_run"` even after a successful or failed apply — misleading for any downstream audit / monitoring consumer that keys on reconciliation provenance. Fix: extend `synthetic_stateless_provenance` (relocated from `src/main.rs` to `src/cli/apply.rs` so it is unit-testable) to accept the resolved `revision_id` plus the `RunStatus` outcome. The audit event now reports: - `reconciliation_status` = `success` | `failed` (from `RunStatus`). - `attempted_revision` = `revision_id` (always — both success and failure attempted the same target). - `applied_revision` = `revision_id` on success, `None` on failure (failure means no revision became authoritative on the host). - `last_observed_revision` = `revision_id` so the audit event can correlate with the source-repo revision the apply ran against. - `generation` = 1 (single ad-hoc run; stateless mode has no prior generation context). Coverage: new `synthetic_stateless_provenance_reflects_run_outcome` unit test asserts both the Success and Failure matrices. P2 #2 — Differentiate source-repo access errors from missing paths: `detect_provenance` used `Path::exists()` / `Path::is_dir()`, which swallow metadata I/O errors (most commonly `PermissionDenied`) into `false`. So an inaccessible directory got classified as `PathMissing` / `PathNotDirectory` (exit 64) instead of as an access failure, with a misleading diagnostic for unprivileged runs against restricted paths. Fix: replace the existence/is-dir gates with a single `std::fs::metadata` call. `ErrorKind::NotFound` maps to the existing `PathMissing` (exit 64); any other I/O error maps to a new `PathInaccessible { path, source }` variant (exit 66, alongside `Canonicalize`). `metadata.is_dir()` continues to drive the `PathNotDirectory` branch. `cargo test` 471 passed (was 470); `cargo clippy --all-targets -- -D warnings` clean; `core-ops-release validate --base-ref master` passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Feature Specification: Systemd-Managed Host Agent
Feature Branch:
[002-systemd-agent]Created: 2026-03-19
Status: Draft
Input: User description: "Specify the next iteration of the Fedora CoreOS Quadlet GitOps controller as an automated host agent. Goals: - install and operate the controller as a systemd-managed service or timer on a host - validate and rely on journald-based operational observability when running under systemd - extend reconciliation support beyond container Quadlets to include socket and volume Quadlet artifacts - define lifecycle, ordering, and verification behavior for these supported artifact types - preserve the current principles of native system primitives, explicit failure, idempotence, and observability Requirements: - the controller must be able to run unattended on a host through systemd automation - journald should be the default operational audit sink under service execution - socket artifacts must be treated according to their distinct lifecycle semantics - volume artifacts should be supported as first-class reconciled objects - mounting existing host config directories into containers may be supported where it fits naturally within Quadlet definitions - avoid expanding into generic host configuration management Non-goals for this iteration: - host templating or reusable manifest parameterization across multiple hosts - secret distribution - arbitrary environment file management as first-class managed objects - fleet orchestration"
Clarifications
Session 2026-03-19
User Scenarios & Testing (mandatory)
User Story 1 - Automated host agent runs unattended (Priority: P1)
As an operator, I want the controller installed as a systemd-managed service or
timer so that reconciliation runs unattended on a Fedora CoreOS host.
Why this priority: Unattended host automation is the core value of this
iteration and enables reliable day-2 operations.
Independent Test: Install the unit files, enable the timer or service, and
verify the agent executes on schedule without manual invocation.
Acceptance Scenarios:
the service starts), Then reconciliation runs without user interaction.
operational logs are visible in journald.
User Story 2 - Reconcile containers, sockets, and volumes (Priority: P2)
As an operator, I want the controller to reconcile container, socket, and volume
Quadlet artifacts so that all supported workload types converge together.
Why this priority: Broadening Quadlet support is the key functional expansion
beyond the current container-only scope.
Independent Test: Place container, socket, and volume Quadlet files in the
repository and verify they are reconciled in a single run.
Acceptance Scenarios:
reconciliation runs, Then each artifact type is created/updated or
removed according to the desired state.
reconciliation runs, Then ordering rules ensure a stable, converged state.
User Story 3 - Explicit verification and observability (Priority: P3)
As an operator, I want clear verification behavior and journal-based observability
so that I can validate changes and diagnose failures.
Why this priority: Observability and verification prevent silent failures and
maintain trust in unattended automation.
Independent Test: Introduce a failing Quadlet definition and confirm that the
agent logs the failure and marks the run as failed without partial success.
Acceptance Scenarios:
reports a failure in journald with a clear reason.
identify the plan, actions, and outcome for that run.
Edge Cases
Requirements (mandatory)
Functional Requirements
execution on a Fedora CoreOS host. The distribution MUST include both a
oneshot service and a timer that triggers it.
running under systemd.
artifacts (stored in the Quadlet directory) and socket artifacts as native
systemd units (stored in the system unit directory).
socket units and volume artifacts relative to containers and services, using a
Volume → Container → Socket ordering model.
supported artifact type (container, socket, volume) using systemd unit state
checks. Container and socket units MUST report active; volume artifacts MUST
report a loaded unit state when applicable. The controller MUST NOT run
enable/disable for generated units; enablement remains driven by Quadlet
[Install] semantics.
scope and MUST NOT expand into generic host configuration management.
supported when specified in Quadlet definitions; the controller MUST NOT manage
those directories as independent objects.
safe to repeat.
diagnostics and logs.
lock semantics).
Key Entities (include if feature involves data)
Constitution Alignment (mandatory)
systemd/Git interactions are isolated to adapters.
returned explicitly.
reports remain operator-visible.
fail safely on invalid inputs.
stable for container Quadlets.
unattended execution behavior.
Success Criteria (mandatory)
Measurable Outcomes
to 50 Quadlet artifacts.
outcome status.
plan summary, failed artifact list, and failure reason.
changes across three consecutive scheduled runs.