Skip to content

fix(verifier): accept socket-activated services and stop ConfigFile alias misattribution#31

Merged
outergod merged 2 commits into
masterfrom
fix/socket-activated-verification
May 4, 2026
Merged

fix(verifier): accept socket-activated services and stop ConfigFile alias misattribution#31
outergod merged 2 commits into
masterfrom
fix/socket-activated-verification

Conversation

@outergod
Copy link
Copy Markdown
Owner

@outergod outergod commented May 4, 2026

Summary

  • verify_workload now accepts a service in Inactive state when at least one of its triggering .socket units is Active — socket-activated services are correctly Inactive until first connection. Failed is still always a verification failure.
  • desired_target_aliases no longer populates runtime_unit -> managed_id entries for ConfigFile and SocketDropIn workloads. Without this skip, the catch-all in systemd_unit_for_quadlet_file synthesised a <stem>.service alias from any unrecognised extension (e.g. /etc/foo/foo.tomlfoo.service), which collided with the real container's runtime unit and last-write-wins re-pointed verification failures from the recoverable container onto the non-recoverable config-file path. That misattributed the failure in the apply report and suppressed the recovery StartUnit action.

How this surfaced

On a host applying the matrix-homeserver source repo, core-ops apply reported:

[↺] container/synapse-app.container                             recovered
[↺] socket/traefik.socket                                       recovered

[!] config/etc/traefik/traefik.toml                             failed
    unit not active: Inactive

— a config file flagged as failed because of a service unit's runtime state. traefik.service was correctly Inactive (socket-activated, no traffic yet), but the verifier didn't model socket activation, and the alias map then routed the failure to traefik.toml whose stem coincidentally matched.

Test plan

  • cargo test --all-targets — 447 / 447 pass
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo run --bin core-ops-release -- validate --base-ref masterOutcome: passed, declared bump patch matches required
  • 4 new unit tests in tests/unit/test_verification.rs cover socket-activation acceptance (Inactive-with-Active-socket → success, Failed-state → still fails, all-sockets-Inactive → still fails) and alias-misroute regressions (ConfigFile and SocketDropIn cases)
  • New accepted regression scenario tests/fixtures/verification/scenarios/accepted-socket-activated-trigger.yaml executed against a live VM via core-ops-verify run: all 9 assertions passed, teardown complete. The decisive step — second apply after systemctl stop frontend.service — emits 3 unchanged / Outcome: converged (pre-fix, this would have been [!] config/etc/frontend/frontend.toml failed / Outcome: non-converging).

Notes

  • Cargo.toml bumped 2.1.0 → 2.1.1; tests/fixtures/provenance_state/valid-success.json follows.
  • Release fragment at changes/fix-socket-activated-verification.md (release_intent: patch).

🤖 Generated with Claude Code

…lias misattribution

Two coupled bugs surfaced together when running `core-ops apply` on a host
with traefik that uses systemd socket activation:

1. `verify_workload` required the runtime `.service` unit to be Active.
   Socket-activated services are correctly Inactive until first connection;
   verification now accepts Inactive when at least one of the service's
   triggering `.socket` units is Active. A Failed service is never accepted
   even with Active sockets.

2. `desired_target_aliases` populated `runtime_unit -> managed_id` entries
   for every workload, including ConfigFile and SocketDropIn. The catch-all
   in `systemd_unit_for_quadlet_file` synthesises `<stem>.service` from any
   unrecognised extension, so `/etc/foo/foo.toml` produced `foo.service` and
   collided with the real container's runtime unit name. Last-write-wins
   re-pointed verification failures from the recoverable container to the
   non-recoverable config file path, which both misattributed the failure
   in the apply report and suppressed the recovery `StartUnit` action.
   ConfigFile and SocketDropIn workloads are now skipped when populating
   the alias map.

Adds an accepted regression scenario at
tests/fixtures/verification/scenarios/accepted-socket-activated-trigger.yaml
backed by a single-revision history fixture: external `systemctl stop` of a
socket-activated service must still verify as converged, and the apply
output must not surface a `[!] config/etc/...` failure marker for the
config file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04e42f51a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/verify.rs Outdated
…write-wins

Codex P2 review on #31: `Service=` is a single-valued systemd directive, and
when a socket has drop-ins on disk at /etc/systemd/system/<sock>.d/*.conf,
later assignments override earlier ones (an empty `Service=` resets to the
default `<stem>.service`). The first-match resolver could pick the base
unit's target even when a drop-in retargeted the socket — causing the
verifier to bless socket-activation acceptance for the wrong service and
fail the actual one.

The resolver now walks the base socket contents followed by every
SocketDropIn workload under `<socket>.d/` sorted lex by file name, taking
the last non-empty assignment and treating empty `Service=` as a reset.

Adds two unit tests covering the override case and the empty-reset case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@outergod outergod merged commit 936a4bb into master May 4, 2026
5 of 6 checks passed
@outergod outergod deleted the fix/socket-activated-verification branch May 4, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant