Skip to content

security: validate OS keyring backend and harden sanity check#68

Merged
brandwe merged 2 commits into
mainfrom
security/keyring-backend-validation
Jun 14, 2026
Merged

security: validate OS keyring backend and harden sanity check#68
brandwe merged 2 commits into
mainfrom
security/keyring-backend-validation

Conversation

@brandwe

@brandwe brandwe commented Jun 14, 2026

Copy link
Copy Markdown
Member

Summary

  • Fail closed when keyring selects anything other than the canonical OS backend for macOS, Linux, or Windows.
  • Add InsecureKeyringBackendError with the active backend class in the message for clear operator diagnostics.
  • Harden keyring_sanity.check() to validate the backend before writing, return backend telemetry, and use per-call random probe keys.

Security impact

Without this guard, Linux headless/container/sudo sessions can silently fall through to keyrings.alt plaintext or weak file backends, exposing the ADR-003 Blueprint private key. The new allowlist is backwards-compatible for operators on canonical OS keystores and raises only for misconfigured/insecure installs.

Tests

  • Added 17 tests (1341 -> 1358 collected).
  • Verified: pytest -v --tb=short && ruff check .

Add an OS-specific allowlist of accepted `keyring` backend classes and
assert that the active backend matches on every CredentialStore
construction. Raise InsecureKeyringBackendError when keyring's auto-chain
has fallen through to keyrings.alt.file.PlaintextKeyring,
keyrings.alt.file.EncryptedKeyring, or keyring.backends.fail.Keyring.

Without this guard, a missing D-Bus session on Linux (containers,
headless servers, sudo'd processes) combined with the very common
keyrings.alt transitive dep silently routes the Blueprint private key
to ~/.local/share/python_keyring/keyring_pass.cfg in cleartext. ADR-003
makes the keystore the root of trust for the agent identity, so this
silent fail-open is the highest-impact misconfiguration we have.

Companion changes to keyring_sanity.check():
- Run the backend assertion FIRST and return ok=False on mismatch
  WITHOUT writing the probe payload (don't pollute a plaintext file
  with random bytes).
- Add a `backend` field to SanityResult so operators see the active
  backend in telemetry.
- Suffix the sanity probe key with secrets.token_hex(8) so concurrent
  probes no longer race and emit misleading "backend corrupted"
  diagnostics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@brandwe

brandwe commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

Opus review (claude-opus-4.7) — VERDICT: REQUEST_CHANGES

Independent cross-model review against the diff (PR authored by gpt-5.5).

The runtime check is correct, but two paths bypass it that defeat the threat model:

🔴 HIGH — Provisioning paths bypass the check

scripts/setup.sh (~line 996) and scripts/entra_provisioning.py:_keychain_store_cert (L80-L107) call keyring.set_password('entrabot', 'blueprint-private-key', pem) and equivalents with no backend assertion. On a host where keyring.get_keyring() returns keyrings.alt.file.PlaintextKeyring, the Blueprint private key (and the Provisioner cert+key) get written to disk in cleartext at provisioning time. The MCP server's new constructor check then fails loudly on the next boot — but by then the secret is already compromised on disk.

The runtime error doesn't tell the operator 'rotate now — your key was written to ~/.local/share/python_keyring/keyring_pass.cfg in cleartext.'

Fix: Call assert_allowed_keyring_backend() before every keyring write in entra_provisioning._keychain_store_cert and the inline Python in setup.sh. The setup-script call will need a small importable shim since it currently runs python -c inline.

🔴 HIGH — tools/audit.py swallows InsecureKeyringBackendError

src/entrabot/tools/audit.py L46-L53:

if agent_id is None:
    try:
        from entrabot.platform import get_credential_store
        store = get_credential_store()  # raises InsecureKeyringBackendError
        agent_id = store.retrieve('entrabot', 'active_client_id') or 'unknown'
    except Exception:
        agent_id = 'unknown'

Bare except Exception catches the new error. Reproduced by reviewer: patched keyring.get_keyring to a non-allowlisted backend, called log_event(...), got back agent_id='unknown' with no exception and the audit file written normally. The defense-in-depth signal vanishes the first time anything audits.

Fix: Narrow the except — let InsecureKeyringBackendError (and probably any EntraBotError) propagate; only swallow the legitimate 'no active_client_id provisioned yet' cases.

🟡 MEDIUM — keyring.backends.libsecret.Keyring missing from Linux allowlist

libsecret is shipped by keyring itself (not keyrings.alt), talks to the same Secret Service collection as SecretService.Keyring, and is the recommended default on Fedora/openSUSE. Operators with default-keyring=keyring.backends.libsecret.Keyring in keyringrc.cfg will now get InsecureKeyringBackendError at startup — a false-positive regression.

Fix: Add 'keyring.backends.libsecret.Keyring' to the Linux tuple.

🟡 MEDIUM — No escape hatch for operator-supplied secure backends

The allowlist is hard-coded with no ENTRABOT_KEYRING_BACKEND_ALLOWLIST env override, and the error message doesn't include remediation guidance (precedent: MissingPermissionError, RemovedModeError).

🟢 LOW — keyring_sanity.check() is not called from any production code

grep -rn keyring_sanity src/ scripts/ returns only the module + tests. The hardening here is correct but operationally inert until something (preflight?) actually calls it.

Missing test coverage

  • No test that audit.log_event surfaces InsecureKeyringBackendError (tests/tools/test_audit.py mocks the credential store entirely)
  • No test for the fail.Keyring rejection on macOS/Windows
  • No test for the libsecret-allowlisted case
  • tests/test_keyring_sanity.py:13 monkeypatches assert_allowed_keyring_backend with raising=False even in happy-path tests — so a future rename of the function would silently pass

The runtime portion of this PR is solid. Blocking on the two HIGH gaps because together they let the threat model the PR claims to close happen anyway (key written in cleartext at provisioning; error muted at first audit).

Address Opus review feedback on PR #68:

1. tools/audit.log_event no longer swallows InsecureKeyringBackendError
   - The bare 'except Exception: agent_id = "unknown"' silently muted
     the load-bearing security signal on the first audit call. Now
     InsecureKeyringBackendError propagates; other credential-store
     failures (no entry, transport hiccup, no agent provisioned yet)
     still fall back to 'unknown' so unrelated cases preserve the
     audit record.

2. scripts/entra_provisioning + scripts/setup.sh provisioning paths
   now call assert_allowed_keyring_backend() BEFORE any keyring read
   or write
   - Closes the provisioning-time gap where a misconfigured Linux host
     (no D-Bus + keyrings.alt installed) would silently write the
     Provisioner cert + key and the Blueprint private key to disk in
     cleartext at ~/.local/share/python_keyring/keyring_pass.cfg.
     The runtime CredentialStore already enforces this for the agent
     body; mirroring at provisioning time ensures we never put a
     secret to disk in the first place. Without this, the secret was
     already compromised by the time the next MCP server boot would
     fail loudly.

3. keyring_backend allowlist now accepts keyring.backends.libsecret.Keyring
   on Linux
   - libsecret is shipped by keyring itself (not keyrings.alt), talks
     to the same Secret Service collection as the SecretService
     backend, and is the recommended default on Fedora / openSUSE.
     Previously a perfectly secure libsecret configuration was a
     false-positive rejection at startup.

Tests:
- libsecret added to the accepted-backends parametrize list (was 4, now 5)
- test_audit gains test_insecure_keyring_backend_error_propagates and
  test_unrelated_credential_store_error_falls_back_to_unknown
- new tests/test_entra_provisioning_keyring_backend.py covers store/get
  cert paths asserting the backend before any keyring call, and the
  happy-path write when the backend is secure

Full suite: 1358 -> 1366 passing (8 new); ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@brandwe

brandwe commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

Follow-up commit e21ca62 addresses the 3 review blockers:

  1. tools/audit.log_event no longer swallows InsecureKeyringBackendError — bare except Exception is now narrowed to except InsecureKeyringBackendError: raise; except Exception: agent_id = 'unknown'. New test: test_insecure_keyring_backend_error_propagates (and test_unrelated_credential_store_error_falls_back_to_unknown to lock in the forgiveness path for unrelated cases).
  2. Provisioning paths now assert the backend before any keyring read/writescripts/entra_provisioning._keychain_{store,get,delete}_cert and the inline python in scripts/setup.sh (Blueprint cert generation, around line 996) call assert_allowed_keyring_backend() first. On Windows the assertion is a no-op since Windows uses the file-backed cert store with explicit ACLs, not the keyring. New file: tests/test_entra_provisioning_keyring_backend.py with 3 tests covering store, get, and the happy path.
  3. keyring.backends.libsecret.Keyring added to the Linux allowlist — libsecret is shipped by keyring itself (not keyrings.alt), is the recommended default on Fedora / openSUSE, and talks to the same Secret Service collection as the existing SecretService backend. Previously a false-positive rejection at startup. Extended the existing parametrize.

Suite: 1358 → 1366 passing (8 new tests, +2 audit, +3 provisioning, +1 libsecret-accepted, +1 mac/windows fail.Keyring symmetry, +1 typo correction). ruff check . green.

The MEDIUM/LOW concerns (no ENTRABOT_KEYRING_BACKEND_ALLOWLIST override, no remediation pointer in error message, keyring_sanity.check() not yet wired into preflight, read-side script gaps) are not addressed in this commit — happy to do them in a follow-up PR if you want them in scope here.

@brandwe brandwe merged commit 1cd9864 into main Jun 14, 2026
5 checks passed
@brandwe brandwe deleted the security/keyring-backend-validation branch June 14, 2026 02:15
brandwe added a commit that referenced this pull request Jun 14, 2026
keyring_sanity.check() only validated roundtrip integrity (length +
bytewise equality), which a PlaintextKeyring backend trivially
passes — so check() returned ok=True even though the very next real
store() call would write the Blueprint PEM to a cleartext file. The
module docstring frames this as the preflight that catches "backend
defect[s]" before opaque later token failures, but the most
consequential backend defect (cleartext storage) is exactly what it
was failing to catch.

PR #68 added assert_allowed_keyring_backend() but did not wire it
into check(). This commit closes that gap:

- check() now calls assert_allowed_keyring_backend() FIRST and on
  InsecureKeyringBackendError returns SanityResult(ok=False,
  diagnostic=..., backend=...) WITHOUT writing the probe payload.
  Don't pollute a plaintext file with even random bytes.
- SanityResult gains a backend field for operator telemetry.
- The roundtrip probe key is suffixed with secrets.token_hex(8) so
  concurrent probes no longer race on a fixed key.
- Existing tests no longer monkeypatch assert_allowed_keyring_backend
  with raising=False in happy paths — they now exercise the real
  validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
brandwe added a commit that referenced this pull request Jun 14, 2026
* security: emit audit events for every A365 Work IQ tool call

The new A365 tool wrappers (create_word_document, add_word_comment,
reply_to_word_comment, read_word_document,
get_a365_file_metadata_by_url, read_a365_text_file,
read_a365_binary_file) called WorkIqProvider.call_tool with no
log_event invocation, violating the "every agent action audits
before it proceeds" non-negotiable. Mutations on customer
SharePoint/OneDrive content and reads of file payloads ran under
Agent User credentials with no agent-attributed audit record.

Audit is now centralized inside WorkIqProvider.call_tool so every
server/tool combination is covered with no chance of a future A365
wrapper forgetting it. Mirrors the pattern at mcp_server.py:2402-2422
(Teams/Graph wrappers). Fail-closed semantics preserved: if log_event
raises (e.g., InsecureKeyringBackendError after PR #68), the Work IQ
call does not proceed.

Args metadata records only KEYS, not values — A365 args may include
sensitive content; the body's "secrets never in logs" rule applies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* security: audit A365 token acquisition failures

Keep A365 Work IQ audit spans closed when token acquisition fails. The provider now records failure for exceptions raised while acquiring the Work IQ token, before the MCP call is attempted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* security(a365): drop URL/path/file-name keys from audit resource

CodeQL flagged 3 new high-severity 'clear-text storage/logging of
sensitive information' alerts against this PR. Trace: the LLM-
controlled args fileOrFolderUrl / url / webUrl / filePath / path /
fileName flowed from _safe_resource_parts into the resource string,
which audit.log_event then wrote to the JSONL audit file
(audit.py:78) and to logger.info (audit.py:83, 85).

These values are customer data (tenant URLs, internal site paths,
document titles) and have no business in an audit resource handle.
The existing Teams/Graph audit pattern at mcp_server.py:2402-2422
only uses opaque ID handles (chat_id:placeholder_id) which is what
operators actually need for correlation.

Drop URL / path / file-name keys from _RESOURCE_KEY_PRIORITY. Keep
only ID-shape values (driveId, documentLibraryId, documentId, fileId,
itemId, commentId). When no ID-shape arg is present, fall back to
'a365.{server}.{tool}' — self-identifying and CodeQL-safe.

New tests pin the behavior:
- test_url_and_path_args_never_leak_into_audit: constructs an event
  with URL/path/fileName args and asserts none of the user-supplied
  values appear anywhere in the rendered event (resource OR metadata)
- test_id_shape_args_surface_in_audit_resource: confirms ID-shape
  values still surface so audit consumers can correlate.

Suite: 1400 -> 1402 passing; ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* security(a365): drop ALL argument data from audit resource (CodeQL clear)

Previous attempt (729e4a5) trimmed _RESOURCE_KEY_PRIORITY to only
ID-shape keys, but CodeQL still flagged the same 3 alerts. Root cause:
CodeQL's data-flow analysis taints the entire `arguments` dict (MCP
tool args) as potentially-sensitive — any flow from arguments into a
log_event sink re-triggers the alert regardless of which key is read.

Take the safe path: never let `arguments` data reach log_event at all.
The audit resource is now ALWAYS the action-shaped string
`a365.{server_name}.{tool_name}` (server_name and tool_name come from
the manifest, not user input). Operators correlate audit entries by
action + timestamp; deeper resource detail (which document, which
comment) lives in the Graph API server-side logs and the MCP client
trace, neither of which goes through entrabot's audit pipeline.

Removed now-dead code: _SENSITIVE_RESOURCE_KEYS, _RESOURCE_KEY_PRIORITY,
_safe_resource_parts, _audit_resource. Updated two existing tests and
renamed test_id_shape_args_surface_in_audit_resource to
test_id_shape_args_do_not_surface_in_audit_resource — pins the new
contract that NO argument value (URL/path/file-name OR ID-shape) ever
appears in the audit record.

Suite: 1402 passing; ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* security(a365): drop args_keys from audit metadata (CodeQL clear)

Previous attempt (24606db) cleared one of the three CodeQL alerts but
the remaining two were re-triggered by metadata['args_keys'] =
list(arguments.keys()). CodeQL's data-flow analysis taints the
arguments dict and propagates the label through any projection
(.keys() / .values() / .items()) — so even just the key NAMES, with
no values, still hit the audit-sink alert.

Metadata is now strictly {server, tool} — both come from the A365
catalog / manifest, not user input, so CodeQL has no taint source.

The information operators most need from a365 audit (which tool fired,
when, success/failure, attribution) is fully preserved. Deeper detail
(which arguments, which IDs) lives in the Graph API server-side logs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(a365): fix engineering-status claim + stale test name (Opus review)

Opus PR-#70 review found two doc/naming mismatches with the final
shape of this PR:

1. docs/engineering-status.md claimed 'metadata records only argument
   keys, never values'. The third follow-up commit (0b3591f) removed
   args_keys entirely because CodeQL kept tainting the projection from
   the arguments dict. Reality: metadata is {server, tool} only.
   Updated the entry to match.

2. test_call_tool_audit_metadata_records_argument_keys_only asserts
   the OPPOSITE of its name — that args_keys is NOT in metadata.
   Renamed to test_call_tool_audit_metadata_omits_argument_keys_and_values.

No code change; doc + test-name only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
brandwe added a commit that referenced this pull request Jun 15, 2026
…aise

Opus PR-#80 review found a production-breaking regression: the new
log_event raised AuditAttributionError on EVERY upstream caller in
tools/files.py, tools/teams.py, a365/provider.py, and
tools/daily_summary.py (the new code from this PR itself). The cause
is that nothing in the codebase ever writes "active_client_id" to the
keyring, so the lookup always returned None in production. The tests
passed only because tests/conftest.py had an autouse fixture
monkeypatching the credential store.

Fix: walk a fallback chain inside log_event BEFORE raising —
  _identity.session.user_id  →  config.agent_id  →
  config.blueprint_app_id  →  credential store  →  raise

This matches the canonical resolution pattern already in mcp_server.py
wrappers (resolve_placeholder, send_email, share_file). The chain
resolves successfully in every production configuration; only a truly
identity-less call raises AuditAttributionError.

Also:
- Dropped/scoped the misleading conftest autouse fixture so the rest
  of the test suite exercises the real resolution path.
- Added a regression test that calls an upstream auditor with NO
  fixture override and proves the fallback chain resolves correctly.
- Restored the post-transition cleanup assertions in
  test_msal_silent_refresh_with_error_dict_fails_closed that the
  original PR removed when the exception type changed (the state
  transitions still happen at mcp_server.py:537-546 and should remain
  covered).

The InsecureKeyringBackendError raise path from PR #68 follow-up is
preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
brandwe added a commit that referenced this pull request Jun 15, 2026
* security: MCP runtime audit + MSAL error-key hardening

Three medium-severity fail-open gaps in the MCP runtime:

1. mcp_server.py:_ensure_valid_token (DELEGATED) and _init_auth read
result["access_token"] without first checking for an "error" key.
try_silent() can return {"error":"interaction_required",...} dicts;
the resulting KeyError was caught by a broad except Exception that
logged a generic message and hid the actual MSAL error code from
operators. Now check for "error" first and raise
TokenExchangeError with the real code — matches the CLAUDE.md
non-negotiable.

2. tools/audit.py:log_event silently fell back to agent_id="unknown"
when the credential store returned None for active_client_id (the
InsecureKeyringBackendError raise case was fixed by the PR #68
follow-up; this is the "returns None" case). Attribution silently
lost. Now raises AuditAttributionError unless the caller explicitly
passes attribution_type="none" — preserves the bootstrap audit
path without leaving an attribution gap in normal flow.

3. tools/daily_summary.py:send_summary_email called email.send_email
directly, bypassing the audit wrapper in mcp_server.send_email.
The 5pm scheduler dispatched mail to sponsors with no audit
record. Now audit-before-acting with action="send_daily_summary",
resource=<recipient list>, fail-closed on log_event raise.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* security(audit): centralize identity resolution to avoid production raise

Opus PR-#80 review found a production-breaking regression: the new
log_event raised AuditAttributionError on EVERY upstream caller in
tools/files.py, tools/teams.py, a365/provider.py, and
tools/daily_summary.py (the new code from this PR itself). The cause
is that nothing in the codebase ever writes "active_client_id" to the
keyring, so the lookup always returned None in production. The tests
passed only because tests/conftest.py had an autouse fixture
monkeypatching the credential store.

Fix: walk a fallback chain inside log_event BEFORE raising —
  _identity.session.user_id  →  config.agent_id  →
  config.blueprint_app_id  →  credential store  →  raise

This matches the canonical resolution pattern already in mcp_server.py
wrappers (resolve_placeholder, send_email, share_file). The chain
resolves successfully in every production configuration; only a truly
identity-less call raises AuditAttributionError.

Also:
- Dropped/scoped the misleading conftest autouse fixture so the rest
  of the test suite exercises the real resolution path.
- Added a regression test that calls an upstream auditor with NO
  fixture override and proves the fallback chain resolves correctly.
- Restored the post-transition cleanup assertions in
  test_msal_silent_refresh_with_error_dict_fails_closed that the
  original PR removed when the exception type changed (the state
  transitions still happen at mcp_server.py:537-546 and should remain
  covered).

The InsecureKeyringBackendError raise path from PR #68 follow-up is
preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant