security: validate OS keyring backend and harden sanity check#68
Conversation
Add an OS-specific allowlist of accepted `keyring` backend classes and assert that the active backend matches on every CredentialStore construction. Raise InsecureKeyringBackendError when keyring's auto-chain has fallen through to keyrings.alt.file.PlaintextKeyring, keyrings.alt.file.EncryptedKeyring, or keyring.backends.fail.Keyring. Without this guard, a missing D-Bus session on Linux (containers, headless servers, sudo'd processes) combined with the very common keyrings.alt transitive dep silently routes the Blueprint private key to ~/.local/share/python_keyring/keyring_pass.cfg in cleartext. ADR-003 makes the keystore the root of trust for the agent identity, so this silent fail-open is the highest-impact misconfiguration we have. Companion changes to keyring_sanity.check(): - Run the backend assertion FIRST and return ok=False on mismatch WITHOUT writing the probe payload (don't pollute a plaintext file with random bytes). - Add a `backend` field to SanityResult so operators see the active backend in telemetry. - Suffix the sanity probe key with secrets.token_hex(8) so concurrent probes no longer race and emit misleading "backend corrupted" diagnostics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Opus review (claude-opus-4.7) — VERDICT: REQUEST_CHANGES Independent cross-model review against the diff (PR authored by gpt-5.5). The runtime check is correct, but two paths bypass it that defeat the threat model: 🔴 HIGH — Provisioning paths bypass the check
The runtime error doesn't tell the operator 'rotate now — your key was written to Fix: Call 🔴 HIGH —
|
Address Opus review feedback on PR #68: 1. tools/audit.log_event no longer swallows InsecureKeyringBackendError - The bare 'except Exception: agent_id = "unknown"' silently muted the load-bearing security signal on the first audit call. Now InsecureKeyringBackendError propagates; other credential-store failures (no entry, transport hiccup, no agent provisioned yet) still fall back to 'unknown' so unrelated cases preserve the audit record. 2. scripts/entra_provisioning + scripts/setup.sh provisioning paths now call assert_allowed_keyring_backend() BEFORE any keyring read or write - Closes the provisioning-time gap where a misconfigured Linux host (no D-Bus + keyrings.alt installed) would silently write the Provisioner cert + key and the Blueprint private key to disk in cleartext at ~/.local/share/python_keyring/keyring_pass.cfg. The runtime CredentialStore already enforces this for the agent body; mirroring at provisioning time ensures we never put a secret to disk in the first place. Without this, the secret was already compromised by the time the next MCP server boot would fail loudly. 3. keyring_backend allowlist now accepts keyring.backends.libsecret.Keyring on Linux - libsecret is shipped by keyring itself (not keyrings.alt), talks to the same Secret Service collection as the SecretService backend, and is the recommended default on Fedora / openSUSE. Previously a perfectly secure libsecret configuration was a false-positive rejection at startup. Tests: - libsecret added to the accepted-backends parametrize list (was 4, now 5) - test_audit gains test_insecure_keyring_backend_error_propagates and test_unrelated_credential_store_error_falls_back_to_unknown - new tests/test_entra_provisioning_keyring_backend.py covers store/get cert paths asserting the backend before any keyring call, and the happy-path write when the backend is secure Full suite: 1358 -> 1366 passing (8 new); ruff clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Follow-up commit
Suite: 1358 → 1366 passing (8 new tests, +2 audit, +3 provisioning, +1 libsecret-accepted, +1 mac/windows fail.Keyring symmetry, +1 typo correction). The MEDIUM/LOW concerns (no |
keyring_sanity.check() only validated roundtrip integrity (length + bytewise equality), which a PlaintextKeyring backend trivially passes — so check() returned ok=True even though the very next real store() call would write the Blueprint PEM to a cleartext file. The module docstring frames this as the preflight that catches "backend defect[s]" before opaque later token failures, but the most consequential backend defect (cleartext storage) is exactly what it was failing to catch. PR #68 added assert_allowed_keyring_backend() but did not wire it into check(). This commit closes that gap: - check() now calls assert_allowed_keyring_backend() FIRST and on InsecureKeyringBackendError returns SanityResult(ok=False, diagnostic=..., backend=...) WITHOUT writing the probe payload. Don't pollute a plaintext file with even random bytes. - SanityResult gains a backend field for operator telemetry. - The roundtrip probe key is suffixed with secrets.token_hex(8) so concurrent probes no longer race on a fixed key. - Existing tests no longer monkeypatch assert_allowed_keyring_backend with raising=False in happy paths — they now exercise the real validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* security: emit audit events for every A365 Work IQ tool call The new A365 tool wrappers (create_word_document, add_word_comment, reply_to_word_comment, read_word_document, get_a365_file_metadata_by_url, read_a365_text_file, read_a365_binary_file) called WorkIqProvider.call_tool with no log_event invocation, violating the "every agent action audits before it proceeds" non-negotiable. Mutations on customer SharePoint/OneDrive content and reads of file payloads ran under Agent User credentials with no agent-attributed audit record. Audit is now centralized inside WorkIqProvider.call_tool so every server/tool combination is covered with no chance of a future A365 wrapper forgetting it. Mirrors the pattern at mcp_server.py:2402-2422 (Teams/Graph wrappers). Fail-closed semantics preserved: if log_event raises (e.g., InsecureKeyringBackendError after PR #68), the Work IQ call does not proceed. Args metadata records only KEYS, not values — A365 args may include sensitive content; the body's "secrets never in logs" rule applies. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security: audit A365 token acquisition failures Keep A365 Work IQ audit spans closed when token acquisition fails. The provider now records failure for exceptions raised while acquiring the Work IQ token, before the MCP call is attempted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security(a365): drop URL/path/file-name keys from audit resource CodeQL flagged 3 new high-severity 'clear-text storage/logging of sensitive information' alerts against this PR. Trace: the LLM- controlled args fileOrFolderUrl / url / webUrl / filePath / path / fileName flowed from _safe_resource_parts into the resource string, which audit.log_event then wrote to the JSONL audit file (audit.py:78) and to logger.info (audit.py:83, 85). These values are customer data (tenant URLs, internal site paths, document titles) and have no business in an audit resource handle. The existing Teams/Graph audit pattern at mcp_server.py:2402-2422 only uses opaque ID handles (chat_id:placeholder_id) which is what operators actually need for correlation. Drop URL / path / file-name keys from _RESOURCE_KEY_PRIORITY. Keep only ID-shape values (driveId, documentLibraryId, documentId, fileId, itemId, commentId). When no ID-shape arg is present, fall back to 'a365.{server}.{tool}' — self-identifying and CodeQL-safe. New tests pin the behavior: - test_url_and_path_args_never_leak_into_audit: constructs an event with URL/path/fileName args and asserts none of the user-supplied values appear anywhere in the rendered event (resource OR metadata) - test_id_shape_args_surface_in_audit_resource: confirms ID-shape values still surface so audit consumers can correlate. Suite: 1400 -> 1402 passing; ruff clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security(a365): drop ALL argument data from audit resource (CodeQL clear) Previous attempt (729e4a5) trimmed _RESOURCE_KEY_PRIORITY to only ID-shape keys, but CodeQL still flagged the same 3 alerts. Root cause: CodeQL's data-flow analysis taints the entire `arguments` dict (MCP tool args) as potentially-sensitive — any flow from arguments into a log_event sink re-triggers the alert regardless of which key is read. Take the safe path: never let `arguments` data reach log_event at all. The audit resource is now ALWAYS the action-shaped string `a365.{server_name}.{tool_name}` (server_name and tool_name come from the manifest, not user input). Operators correlate audit entries by action + timestamp; deeper resource detail (which document, which comment) lives in the Graph API server-side logs and the MCP client trace, neither of which goes through entrabot's audit pipeline. Removed now-dead code: _SENSITIVE_RESOURCE_KEYS, _RESOURCE_KEY_PRIORITY, _safe_resource_parts, _audit_resource. Updated two existing tests and renamed test_id_shape_args_surface_in_audit_resource to test_id_shape_args_do_not_surface_in_audit_resource — pins the new contract that NO argument value (URL/path/file-name OR ID-shape) ever appears in the audit record. Suite: 1402 passing; ruff clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * security(a365): drop args_keys from audit metadata (CodeQL clear) Previous attempt (24606db) cleared one of the three CodeQL alerts but the remaining two were re-triggered by metadata['args_keys'] = list(arguments.keys()). CodeQL's data-flow analysis taints the arguments dict and propagates the label through any projection (.keys() / .values() / .items()) — so even just the key NAMES, with no values, still hit the audit-sink alert. Metadata is now strictly {server, tool} — both come from the A365 catalog / manifest, not user input, so CodeQL has no taint source. The information operators most need from a365 audit (which tool fired, when, success/failure, attribution) is fully preserved. Deeper detail (which arguments, which IDs) lives in the Graph API server-side logs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs(a365): fix engineering-status claim + stale test name (Opus review) Opus PR-#70 review found two doc/naming mismatches with the final shape of this PR: 1. docs/engineering-status.md claimed 'metadata records only argument keys, never values'. The third follow-up commit (0b3591f) removed args_keys entirely because CodeQL kept tainting the projection from the arguments dict. Reality: metadata is {server, tool} only. Updated the entry to match. 2. test_call_tool_audit_metadata_records_argument_keys_only asserts the OPPOSITE of its name — that args_keys is NOT in metadata. Renamed to test_call_tool_audit_metadata_omits_argument_keys_and_values. No code change; doc + test-name only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aise Opus PR-#80 review found a production-breaking regression: the new log_event raised AuditAttributionError on EVERY upstream caller in tools/files.py, tools/teams.py, a365/provider.py, and tools/daily_summary.py (the new code from this PR itself). The cause is that nothing in the codebase ever writes "active_client_id" to the keyring, so the lookup always returned None in production. The tests passed only because tests/conftest.py had an autouse fixture monkeypatching the credential store. Fix: walk a fallback chain inside log_event BEFORE raising — _identity.session.user_id → config.agent_id → config.blueprint_app_id → credential store → raise This matches the canonical resolution pattern already in mcp_server.py wrappers (resolve_placeholder, send_email, share_file). The chain resolves successfully in every production configuration; only a truly identity-less call raises AuditAttributionError. Also: - Dropped/scoped the misleading conftest autouse fixture so the rest of the test suite exercises the real resolution path. - Added a regression test that calls an upstream auditor with NO fixture override and proves the fallback chain resolves correctly. - Restored the post-transition cleanup assertions in test_msal_silent_refresh_with_error_dict_fails_closed that the original PR removed when the exception type changed (the state transitions still happen at mcp_server.py:537-546 and should remain covered). The InsecureKeyringBackendError raise path from PR #68 follow-up is preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* security: MCP runtime audit + MSAL error-key hardening
Three medium-severity fail-open gaps in the MCP runtime:
1. mcp_server.py:_ensure_valid_token (DELEGATED) and _init_auth read
result["access_token"] without first checking for an "error" key.
try_silent() can return {"error":"interaction_required",...} dicts;
the resulting KeyError was caught by a broad except Exception that
logged a generic message and hid the actual MSAL error code from
operators. Now check for "error" first and raise
TokenExchangeError with the real code — matches the CLAUDE.md
non-negotiable.
2. tools/audit.py:log_event silently fell back to agent_id="unknown"
when the credential store returned None for active_client_id (the
InsecureKeyringBackendError raise case was fixed by the PR #68
follow-up; this is the "returns None" case). Attribution silently
lost. Now raises AuditAttributionError unless the caller explicitly
passes attribution_type="none" — preserves the bootstrap audit
path without leaving an attribution gap in normal flow.
3. tools/daily_summary.py:send_summary_email called email.send_email
directly, bypassing the audit wrapper in mcp_server.send_email.
The 5pm scheduler dispatched mail to sponsors with no audit
record. Now audit-before-acting with action="send_daily_summary",
resource=<recipient list>, fail-closed on log_event raise.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* security(audit): centralize identity resolution to avoid production raise
Opus PR-#80 review found a production-breaking regression: the new
log_event raised AuditAttributionError on EVERY upstream caller in
tools/files.py, tools/teams.py, a365/provider.py, and
tools/daily_summary.py (the new code from this PR itself). The cause
is that nothing in the codebase ever writes "active_client_id" to the
keyring, so the lookup always returned None in production. The tests
passed only because tests/conftest.py had an autouse fixture
monkeypatching the credential store.
Fix: walk a fallback chain inside log_event BEFORE raising —
_identity.session.user_id → config.agent_id →
config.blueprint_app_id → credential store → raise
This matches the canonical resolution pattern already in mcp_server.py
wrappers (resolve_placeholder, send_email, share_file). The chain
resolves successfully in every production configuration; only a truly
identity-less call raises AuditAttributionError.
Also:
- Dropped/scoped the misleading conftest autouse fixture so the rest
of the test suite exercises the real resolution path.
- Added a regression test that calls an upstream auditor with NO
fixture override and proves the fallback chain resolves correctly.
- Restored the post-transition cleanup assertions in
test_msal_silent_refresh_with_error_dict_fails_closed that the
original PR removed when the exception type changed (the state
transitions still happen at mcp_server.py:537-546 and should remain
covered).
The InsecureKeyringBackendError raise path from PR #68 follow-up is
preserved.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Security impact
Without this guard, Linux headless/container/sudo sessions can silently fall through to keyrings.alt plaintext or weak file backends, exposing the ADR-003 Blueprint private key. The new allowlist is backwards-compatible for operators on canonical OS keystores and raises only for misconfigured/insecure installs.
Tests