security: validate OS keyring backend and harden sanity check by brandwe · Pull Request #68 · microsoft/entrabot

brandwe · 2026-06-14T01:38:54Z

Summary

Fail closed when keyring selects anything other than the canonical OS backend for macOS, Linux, or Windows.
Add InsecureKeyringBackendError with the active backend class in the message for clear operator diagnostics.
Harden keyring_sanity.check() to validate the backend before writing, return backend telemetry, and use per-call random probe keys.

Security impact

Without this guard, Linux headless/container/sudo sessions can silently fall through to keyrings.alt plaintext or weak file backends, exposing the ADR-003 Blueprint private key. The new allowlist is backwards-compatible for operators on canonical OS keystores and raises only for misconfigured/insecure installs.

Tests

Added 17 tests (1341 -> 1358 collected).
Verified: pytest -v --tb=short && ruff check .

Add an OS-specific allowlist of accepted `keyring` backend classes and assert that the active backend matches on every CredentialStore construction. Raise InsecureKeyringBackendError when keyring's auto-chain has fallen through to keyrings.alt.file.PlaintextKeyring, keyrings.alt.file.EncryptedKeyring, or keyring.backends.fail.Keyring. Without this guard, a missing D-Bus session on Linux (containers, headless servers, sudo'd processes) combined with the very common keyrings.alt transitive dep silently routes the Blueprint private key to ~/.local/share/python_keyring/keyring_pass.cfg in cleartext. ADR-003 makes the keystore the root of trust for the agent identity, so this silent fail-open is the highest-impact misconfiguration we have. Companion changes to keyring_sanity.check(): - Run the backend assertion FIRST and return ok=False on mismatch WITHOUT writing the probe payload (don't pollute a plaintext file with random bytes). - Add a `backend` field to SanityResult so operators see the active backend in telemetry. - Suffix the sanity probe key with secrets.token_hex(8) so concurrent probes no longer race and emit misleading "backend corrupted" diagnostics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

brandwe · 2026-06-14T02:10:12Z

Opus review (claude-opus-4.7) — VERDICT: REQUEST_CHANGES

Independent cross-model review against the diff (PR authored by gpt-5.5).

The runtime check is correct, but two paths bypass it that defeat the threat model:

🔴 HIGH — Provisioning paths bypass the check

scripts/setup.sh (~line 996) and scripts/entra_provisioning.py:_keychain_store_cert (L80-L107) call keyring.set_password('entrabot', 'blueprint-private-key', pem) and equivalents with no backend assertion. On a host where keyring.get_keyring() returns keyrings.alt.file.PlaintextKeyring, the Blueprint private key (and the Provisioner cert+key) get written to disk in cleartext at provisioning time. The MCP server's new constructor check then fails loudly on the next boot — but by then the secret is already compromised on disk.

The runtime error doesn't tell the operator 'rotate now — your key was written to ~/.local/share/python_keyring/keyring_pass.cfg in cleartext.'

Fix: Call assert_allowed_keyring_backend() before every keyring write in entra_provisioning._keychain_store_cert and the inline Python in setup.sh. The setup-script call will need a small importable shim since it currently runs python -c inline.

🔴 HIGH — `tools/audit.py` swallows `InsecureKeyringBackendError`

src/entrabot/tools/audit.py L46-L53:

if agent_id is None:
    try:
        from entrabot.platform import get_credential_store
        store = get_credential_store()  # raises InsecureKeyringBackendError
        agent_id = store.retrieve('entrabot', 'active_client_id') or 'unknown'
    except Exception:
        agent_id = 'unknown'

Bare except Exception catches the new error. Reproduced by reviewer: patched keyring.get_keyring to a non-allowlisted backend, called log_event(...), got back agent_id='unknown' with no exception and the audit file written normally. The defense-in-depth signal vanishes the first time anything audits.

Fix: Narrow the except — let InsecureKeyringBackendError (and probably any EntraBotError) propagate; only swallow the legitimate 'no active_client_id provisioned yet' cases.

🟡 MEDIUM — `keyring.backends.libsecret.Keyring` missing from Linux allowlist

libsecret is shipped by keyring itself (not keyrings.alt), talks to the same Secret Service collection as SecretService.Keyring, and is the recommended default on Fedora/openSUSE. Operators with default-keyring=keyring.backends.libsecret.Keyring in keyringrc.cfg will now get InsecureKeyringBackendError at startup — a false-positive regression.

Fix: Add 'keyring.backends.libsecret.Keyring' to the Linux tuple.

🟡 MEDIUM — No escape hatch for operator-supplied secure backends

The allowlist is hard-coded with no ENTRABOT_KEYRING_BACKEND_ALLOWLIST env override, and the error message doesn't include remediation guidance (precedent: MissingPermissionError, RemovedModeError).

🟢 LOW — `keyring_sanity.check()` is not called from any production code

grep -rn keyring_sanity src/ scripts/ returns only the module + tests. The hardening here is correct but operationally inert until something (preflight?) actually calls it.

Missing test coverage

No test that audit.log_event surfaces InsecureKeyringBackendError (tests/tools/test_audit.py mocks the credential store entirely)
No test for the fail.Keyring rejection on macOS/Windows
No test for the libsecret-allowlisted case
tests/test_keyring_sanity.py:13 monkeypatches assert_allowed_keyring_backend with raising=False even in happy-path tests — so a future rename of the function would silently pass

The runtime portion of this PR is solid. Blocking on the two HIGH gaps because together they let the threat model the PR claims to close happen anyway (key written in cleartext at provisioning; error muted at first audit).

Address Opus review feedback on PR #68: 1. tools/audit.log_event no longer swallows InsecureKeyringBackendError - The bare 'except Exception: agent_id = "unknown"' silently muted the load-bearing security signal on the first audit call. Now InsecureKeyringBackendError propagates; other credential-store failures (no entry, transport hiccup, no agent provisioned yet) still fall back to 'unknown' so unrelated cases preserve the audit record. 2. scripts/entra_provisioning + scripts/setup.sh provisioning paths now call assert_allowed_keyring_backend() BEFORE any keyring read or write - Closes the provisioning-time gap where a misconfigured Linux host (no D-Bus + keyrings.alt installed) would silently write the Provisioner cert + key and the Blueprint private key to disk in cleartext at ~/.local/share/python_keyring/keyring_pass.cfg. The runtime CredentialStore already enforces this for the agent body; mirroring at provisioning time ensures we never put a secret to disk in the first place. Without this, the secret was already compromised by the time the next MCP server boot would fail loudly. 3. keyring_backend allowlist now accepts keyring.backends.libsecret.Keyring on Linux - libsecret is shipped by keyring itself (not keyrings.alt), talks to the same Secret Service collection as the SecretService backend, and is the recommended default on Fedora / openSUSE. Previously a perfectly secure libsecret configuration was a false-positive rejection at startup. Tests: - libsecret added to the accepted-backends parametrize list (was 4, now 5) - test_audit gains test_insecure_keyring_backend_error_propagates and test_unrelated_credential_store_error_falls_back_to_unknown - new tests/test_entra_provisioning_keyring_backend.py covers store/get cert paths asserting the backend before any keyring call, and the happy-path write when the backend is secure Full suite: 1358 -> 1366 passing (8 new); ruff clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

brandwe · 2026-06-14T02:13:51Z

Follow-up commit e21ca62 addresses the 3 review blockers:

✅ tools/audit.log_event no longer swallows InsecureKeyringBackendError — bare except Exception is now narrowed to except InsecureKeyringBackendError: raise; except Exception: agent_id = 'unknown'. New test: test_insecure_keyring_backend_error_propagates (and test_unrelated_credential_store_error_falls_back_to_unknown to lock in the forgiveness path for unrelated cases).
✅ Provisioning paths now assert the backend before any keyring read/write — scripts/entra_provisioning._keychain_{store,get,delete}_cert and the inline python in scripts/setup.sh (Blueprint cert generation, around line 996) call assert_allowed_keyring_backend() first. On Windows the assertion is a no-op since Windows uses the file-backed cert store with explicit ACLs, not the keyring. New file: tests/test_entra_provisioning_keyring_backend.py with 3 tests covering store, get, and the happy path.
✅ keyring.backends.libsecret.Keyring added to the Linux allowlist — libsecret is shipped by keyring itself (not keyrings.alt), is the recommended default on Fedora / openSUSE, and talks to the same Secret Service collection as the existing SecretService backend. Previously a false-positive rejection at startup. Extended the existing parametrize.

Suite: 1358 → 1366 passing (8 new tests, +2 audit, +3 provisioning, +1 libsecret-accepted, +1 mac/windows fail.Keyring symmetry, +1 typo correction). ruff check . green.

The MEDIUM/LOW concerns (no ENTRABOT_KEYRING_BACKEND_ALLOWLIST override, no remediation pointer in error message, keyring_sanity.check() not yet wired into preflight, read-side script gaps) are not addressed in this commit — happy to do them in a follow-up PR if you want them in scope here.

keyring_sanity.check() only validated roundtrip integrity (length + bytewise equality), which a PlaintextKeyring backend trivially passes — so check() returned ok=True even though the very next real store() call would write the Blueprint PEM to a cleartext file. The module docstring frames this as the preflight that catches "backend defect[s]" before opaque later token failures, but the most consequential backend defect (cleartext storage) is exactly what it was failing to catch. PR #68 added assert_allowed_keyring_backend() but did not wire it into check(). This commit closes that gap: - check() now calls assert_allowed_keyring_backend() FIRST and on InsecureKeyringBackendError returns SanityResult(ok=False, diagnostic=..., backend=...) WITHOUT writing the probe payload. Don't pollute a plaintext file with even random bytes. - SanityResult gains a backend field for operator telemetry. - The roundtrip probe key is suffixed with secrets.token_hex(8) so concurrent probes no longer race on a fixed key. - Existing tests no longer monkeypatch assert_allowed_keyring_backend with raising=False in happy paths — they now exercise the real validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* security: emit audit events for every A365 Work IQ tool call The new A365 tool wrappers (create_word_document, add_word_comment, reply_to_word_comment, read_word_document, get_a365_file_metadata_by_url, read_a365_text_file, read_a365_binary_file) called WorkIqProvider.call_tool with no log_event invocation, violating the "every agent action audits before it proceeds" non-negotiable. Mutations on customer SharePoint/OneDrive content and reads of file payloads ran under Agent User credentials with no agent-attributed audit record. Audit is now centralized inside WorkIqProvider.call_tool so every server/tool combination is covered with no chance of a future A365 wrapper forgetting it. Mirrors the pattern at mcp_server.py:2402-2422 (Teams/Graph wrappers). Fail-closed semantics preserved: if log_event raises (e.g., InsecureKeyringBackendError after PR #68), the Work IQ call does not proceed. Args metadata records only KEYS, not values — A365 args may include sensitive content; the body's "secrets never in logs" rule applies. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security: audit A365 token acquisition failures Keep A365 Work IQ audit spans closed when token acquisition fails. The provider now records failure for exceptions raised while acquiring the Work IQ token, before the MCP call is attempted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security(a365): drop URL/path/file-name keys from audit resource CodeQL flagged 3 new high-severity 'clear-text storage/logging of sensitive information' alerts against this PR. Trace: the LLM- controlled args fileOrFolderUrl / url / webUrl / filePath / path / fileName flowed from _safe_resource_parts into the resource string, which audit.log_event then wrote to the JSONL audit file (audit.py:78) and to logger.info (audit.py:83, 85). These values are customer data (tenant URLs, internal site paths, document titles) and have no business in an audit resource handle. The existing Teams/Graph audit pattern at mcp_server.py:2402-2422 only uses opaque ID handles (chat_id:placeholder_id) which is what operators actually need for correlation. Drop URL / path / file-name keys from _RESOURCE_KEY_PRIORITY. Keep only ID-shape values (driveId, documentLibraryId, documentId, fileId, itemId, commentId). When no ID-shape arg is present, fall back to 'a365.{server}.{tool}' — self-identifying and CodeQL-safe. New tests pin the behavior: - test_url_and_path_args_never_leak_into_audit: constructs an event with URL/path/fileName args and asserts none of the user-supplied values appear anywhere in the rendered event (resource OR metadata) - test_id_shape_args_surface_in_audit_resource: confirms ID-shape values still surface so audit consumers can correlate. Suite: 1400 -> 1402 passing; ruff clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security(a365): drop ALL argument data from audit resource (CodeQL clear) Previous attempt (729e4a5) trimmed _RESOURCE_KEY_PRIORITY to only ID-shape keys, but CodeQL still flagged the same 3 alerts. Root cause: CodeQL's data-flow analysis taints the entire `arguments` dict (MCP tool args) as potentially-sensitive — any flow from arguments into a log_event sink re-triggers the alert regardless of which key is read. Take the safe path: never let `arguments` data reach log_event at all. The audit resource is now ALWAYS the action-shaped string `a365.{server_name}.{tool_name}` (server_name and tool_name come from the manifest, not user input). Operators correlate audit entries by action + timestamp; deeper resource detail (which document, which comment) lives in the Graph API server-side logs and the MCP client trace, neither of which goes through entrabot's audit pipeline. Removed now-dead code: _SENSITIVE_RESOURCE_KEYS, _RESOURCE_KEY_PRIORITY, _safe_resource_parts, _audit_resource. Updated two existing tests and renamed test_id_shape_args_surface_in_audit_resource to test_id_shape_args_do_not_surface_in_audit_resource — pins the new contract that NO argument value (URL/path/file-name OR ID-shape) ever appears in the audit record. Suite: 1402 passing; ruff clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security(a365): drop args_keys from audit metadata (CodeQL clear) Previous attempt (24606db) cleared one of the three CodeQL alerts but the remaining two were re-triggered by metadata['args_keys'] = list(arguments.keys()). CodeQL's data-flow analysis taints the arguments dict and propagates the label through any projection (.keys() / .values() / .items()) — so even just the key NAMES, with no values, still hit the audit-sink alert. Metadata is now strictly {server, tool} — both come from the A365 catalog / manifest, not user input, so CodeQL has no taint source. The information operators most need from a365 audit (which tool fired, when, success/failure, attribution) is fully preserved. Deeper detail (which arguments, which IDs) lives in the Graph API server-side logs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs(a365): fix engineering-status claim + stale test name (Opus review) Opus PR-#70 review found two doc/naming mismatches with the final shape of this PR: 1. docs/engineering-status.md claimed 'metadata records only argument keys, never values'. The third follow-up commit (0b3591f) removed args_keys entirely because CodeQL kept tainting the projection from the arguments dict. Reality: metadata is {server, tool} only. Updated the entry to match. 2. test_call_tool_audit_metadata_records_argument_keys_only asserts the OPPOSITE of its name — that args_keys is NOT in metadata. Renamed to test_call_tool_audit_metadata_omits_argument_keys_and_values. No code change; doc + test-name only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…aise Opus PR-#80 review found a production-breaking regression: the new log_event raised AuditAttributionError on EVERY upstream caller in tools/files.py, tools/teams.py, a365/provider.py, and tools/daily_summary.py (the new code from this PR itself). The cause is that nothing in the codebase ever writes "active_client_id" to the keyring, so the lookup always returned None in production. The tests passed only because tests/conftest.py had an autouse fixture monkeypatching the credential store. Fix: walk a fallback chain inside log_event BEFORE raising — _identity.session.user_id → config.agent_id → config.blueprint_app_id → credential store → raise This matches the canonical resolution pattern already in mcp_server.py wrappers (resolve_placeholder, send_email, share_file). The chain resolves successfully in every production configuration; only a truly identity-less call raises AuditAttributionError. Also: - Dropped/scoped the misleading conftest autouse fixture so the rest of the test suite exercises the real resolution path. - Added a regression test that calls an upstream auditor with NO fixture override and proves the fallback chain resolves correctly. - Restored the post-transition cleanup assertions in test_msal_silent_refresh_with_error_dict_fails_closed that the original PR removed when the exception type changed (the state transitions still happen at mcp_server.py:537-546 and should remain covered). The InsecureKeyringBackendError raise path from PR #68 follow-up is preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* security: MCP runtime audit + MSAL error-key hardening Three medium-severity fail-open gaps in the MCP runtime: 1. mcp_server.py:_ensure_valid_token (DELEGATED) and _init_auth read result["access_token"] without first checking for an "error" key. try_silent() can return {"error":"interaction_required",...} dicts; the resulting KeyError was caught by a broad except Exception that logged a generic message and hid the actual MSAL error code from operators. Now check for "error" first and raise TokenExchangeError with the real code — matches the CLAUDE.md non-negotiable. 2. tools/audit.py:log_event silently fell back to agent_id="unknown" when the credential store returned None for active_client_id (the InsecureKeyringBackendError raise case was fixed by the PR #68 follow-up; this is the "returns None" case). Attribution silently lost. Now raises AuditAttributionError unless the caller explicitly passes attribution_type="none" — preserves the bootstrap audit path without leaving an attribution gap in normal flow. 3. tools/daily_summary.py:send_summary_email called email.send_email directly, bypassing the audit wrapper in mcp_server.send_email. The 5pm scheduler dispatched mail to sponsors with no audit record. Now audit-before-acting with action="send_daily_summary", resource=<recipient list>, fail-closed on log_event raise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security(audit): centralize identity resolution to avoid production raise Opus PR-#80 review found a production-breaking regression: the new log_event raised AuditAttributionError on EVERY upstream caller in tools/files.py, tools/teams.py, a365/provider.py, and tools/daily_summary.py (the new code from this PR itself). The cause is that nothing in the codebase ever writes "active_client_id" to the keyring, so the lookup always returned None in production. The tests passed only because tests/conftest.py had an autouse fixture monkeypatching the credential store. Fix: walk a fallback chain inside log_event BEFORE raising — _identity.session.user_id → config.agent_id → config.blueprint_app_id → credential store → raise This matches the canonical resolution pattern already in mcp_server.py wrappers (resolve_placeholder, send_email, share_file). The chain resolves successfully in every production configuration; only a truly identity-less call raises AuditAttributionError. Also: - Dropped/scoped the misleading conftest autouse fixture so the rest of the test suite exercises the real resolution path. - Added a regression test that calls an upstream auditor with NO fixture override and proves the fallback chain resolves correctly. - Restored the post-transition cleanup assertions in test_msal_silent_refresh_with_error_dict_fails_closed that the original PR removed when the exception type changed (the state transitions still happen at mcp_server.py:537-546 and should remain covered). The InsecureKeyringBackendError raise path from PR #68 follow-up is preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

brandwe merged commit 1cd9864 into main Jun 14, 2026
5 checks passed

brandwe deleted the security/keyring-backend-validation branch June 14, 2026 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

security: validate OS keyring backend and harden sanity check#68

security: validate OS keyring backend and harden sanity check#68
brandwe merged 2 commits into
mainfrom
security/keyring-backend-validation

brandwe commented Jun 14, 2026

Uh oh!

brandwe commented Jun 14, 2026

Uh oh!

brandwe commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brandwe commented Jun 14, 2026

Summary

Security impact

Tests

Uh oh!

brandwe commented Jun 14, 2026

🔴 HIGH — Provisioning paths bypass the check

🔴 HIGH — tools/audit.py swallows InsecureKeyringBackendError

🟡 MEDIUM — keyring.backends.libsecret.Keyring missing from Linux allowlist

🟡 MEDIUM — No escape hatch for operator-supplied secure backends

🟢 LOW — keyring_sanity.check() is not called from any production code

Missing test coverage

Uh oh!

brandwe commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🔴 HIGH — `tools/audit.py` swallows `InsecureKeyringBackendError`

🟡 MEDIUM — `keyring.backends.libsecret.Keyring` missing from Linux allowlist

🟢 LOW — `keyring_sanity.check()` is not called from any production code