Skip to content

fix: auto-redact tool_calls[*].function.arguments#588

Merged
Evrard-Nil merged 3 commits into
mainfrom
fix/auto-redact-tool-calls
May 13, 2026
Merged

fix: auto-redact tool_calls[*].function.arguments#588
Evrard-Nil merged 3 commits into
mainfrom
fix/auto-redact-tool-calls

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Collaborator

Summary

Caught while testing an agentic round-trip on staging: when a model returns a tool call whose JSON arguments echo redacted PII, the placeholder leaked to the client unchanged.

The non-streaming + streaming un-redact paths walked content + reasoning_content + reasoning but missed tool_calls[*].function.arguments.

Repro on staging before this fix:

```json
// Request: x-auto-redact: on, prompt contains "alice.chen@gmail.com"
// Response (broken):
"tool_calls":[{
"function":{
"name":"send_email",
"arguments":"{\"to\":\"\",\"subject\":\"Thanks!\",\"body\":\"...\"}"
}
}]
```

After this fix the client sees `{"to":"alice.chen@gmail.com",...}`. The provider still only ever saw ``.

Changes

  • `unredact_chat_response_in_place`: walk `message.tool_calls[*].function.arguments` and `map.unredact` each.
  • `unredact_chunk_in_place` + `StreamUnredactStates`: new per-`(choice_index, tool_call_index)` sliding-tail state map, so streamed argument fragments handle split placeholders the same way content does.
  • New e2e test `auto_redact_unredacts_tool_call_arguments` configures the MockProvider to emit a `send_email` tool call with `` in its args and asserts the client receives the substituted original. 9 auto_redact e2e tests now pass.

Known limitations (deferred follow-ups)

  • privacy-filter span splitting: a single PII like `alice.chen@gmail.com` can come back as two adjacent spans ``. Some models (Claude) treat that as two values and issue two tool calls, one per placeholder. That's a privacy-filter side concern — not a redaction bug.
  • End-of-stream flush for tool_call_arguments tails: existing flush covers content fields; tool_call args don't yet. Most flows complete cleanly without mid-token truncation, deferred.

Test plan

  • `cargo check --workspace --all-targets`
  • `cargo clippy --workspace --all-targets -- -D warnings`
  • `cargo fmt --check`
  • `cargo test --test e2e_all auto_redact -- --test-threads 1` (9 pass)
  • Staging re-test once merged: re-run the agentic round-trip with `tool_choice: send_email` against Anthropic, confirm `to` field contains real email, provider still saw placeholder.

Follow-up to #585.

The un-redact path walked message.content / reasoning_content /
reasoning but not tool_calls[*].function.arguments, so when a model
emitted a tool call whose JSON arguments echoed the user's (now-
redacted) PII, the placeholder leaked to the client unchanged.

Caught while testing an agentic round-trip on staging:

  POST /v1/chat/completions  x-auto-redact: on
  messages=[{user, "Send a thank-you to alice.chen@gmail.com ..."}]
  tools=[{send_email(to, subject, body)}]

Provider (claude-sonnet-4-6) correctly received <email1><email2> in
its prompt, but the resulting tool_calls[0].function.arguments came
back as {"to":"<email1>",...} instead of {"to":"alice.chen@gmail",...}.

Fix:
- Non-streaming (unredact_chat_response_in_place): walk
  message.tool_calls[*].function.arguments and apply map.unredact.
- Streaming (unredact_chunk_in_place + StreamUnredactStates): add a
  per-(choice_index, tool_call_index) sliding-tail state map so
  arguments JSON streamed in fragments handles split placeholders
  the same way content does.

New e2e test auto_redact_unredacts_tool_call_arguments configures
the MockProvider to emit a send_email tool call with <email1> in
its arguments and asserts the client receives the substituted
original.

Known limitations (not addressed here; separate follow-ups):
- privacy-filter sometimes splits one PII into multiple adjacent
  spans (alice.chen@gmail.com -> <email1><email2>), and a model that
  treats placeholders as separate values may issue multiple tool
  calls (one per placeholder). This is a privacy-filter span-merging
  concern, not an un-redact bug.
- End-of-stream flush for tool_call_arguments tails is not yet
  emitted as a synthetic chunk (the existing flush covers content
  fields). Most tool-call args flows complete cleanly without mid-
  token truncation, so deferring to a follow-up.
Copilot AI review requested due to automatic review settings May 13, 2026 13:26
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 13, 2026 13:26 — with GitHub Actions Inactive
@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Critical Review

Targeted fix; non-streaming path and the new test look correct. Two concerns worth addressing before merge, plus one nit.

⚠️ 1. JSON-encoded arguments will be corrupted when the original PII contains JSON-special chars

tool_calls[*].function.arguments is a JSON-encoded string (per OpenAI's schema and how clients parse it). map.unredact(args) does a raw substring substitution — it inserts the original PII verbatim into a JSON string context. For private_email / private_phone / account_number this is fine in practice, but private_name and private_address (both real categories in placeholders.rs:29-30) can absolutely contain double-quotes, backslashes, or newlines.

Concrete failure:

  • User prompt contains a name like Patrick O"Brien (quote in name) → minted as <name1>.
  • Model emits {"to":"...","name":"<name1>"} as arguments.
  • After un-redact: {"to":"...","name":"Patrick O"Brien"} — broken JSON, the client's JSON.parse(arguments) throws.

Same issue for any address with a newline or any value with a backslash.

Suggested fix: JSON-escape each replacement when un-redacting inside arguments. Either a dedicated unredact_into_json_string helper, or parse the args, walk strings, un-redact, and re-serialize. Worth a test with a name/address containing a quote.

This affects both unredact_chat_response_in_place (completions.rs:378-384) and the streaming tool_call_arguments path (completions.rs:440-457).

⚠️ 2. End-of-stream flush gap for tool_call_arguments is more than cosmetic

The PR description marks this as a deferred follow-up, but it is a privacy regression, not just a missing tail:

build_flush_chunks (completions.rs:479-541) drains content / reasoning_content / reasoning but never touches states.tool_call_arguments. If a stream ends with a partial placeholder buffered in the sliding tail (e.g. <emai held back waiting for the rest), those bytes are silently dropped when StreamUnredactStates is dropped. Worse — under the right split, the client receives the literal placeholder <email1> because the tail did not have enough bytes to decide. Either way the un-redact contract is broken specifically for streamed tool calls.

If a full flush is genuinely out of scope here, at minimum drain tool_call_arguments and emit a synthetic chunk per (choice_idx, tc_idx) — same shape as the content flush — and add a streaming test that splits <email1> across chunk boundaries inside arguments.

Nits

  • tc.index.unwrap_or(0) (completions.rs:446): if a provider ever emits two tool calls in one delta, both without an index, they collide on (idx, 0) and cross-contaminate. Vanishingly rare, but a per-position fallback (e.g. enumerate) would be safer and the same line count.
  • No streaming test for tool calls — the streaming path is the more complex of the two changes and is currently uncovered. auto_redact_streaming_splits_placeholder_across_chunks is great prior art to copy.

⚠️ Issues found — JSON-corruption (#1) is the blocker; the flush gap (#2) should not be deferred.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an auto-redact round-trip gap in /v1/chat/completions where redacted placeholders inside tool_calls[*].function.arguments were not being un-redacted before returning to the client—particularly important for agentic flows that execute tool calls with PII in arguments.

Changes:

  • Non-streaming: un-redact choices[*].message.tool_calls[*].function.arguments in-place using the request’s RedactionMap.
  • Streaming: add per-(choice_index, tool_call_index) un-redaction state so tool-call argument fragments can be safely un-redacted across chunk boundaries.
  • Tests: add a new e2e test asserting provider sees placeholders while the client receives un-redacted tool-call arguments.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
crates/api/src/routes/completions.rs Extends streaming + non-streaming un-redaction to include tool-call function arguments and maintains per-tool-call stream state.
crates/api/tests/e2e_all/auto_redact.rs Adds an e2e regression test covering un-redaction of tool-call arguments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/api/src/routes/completions.rs
Addresses claude-review on #588:

1. JSON corruption (blocker):
   tool_calls[*].function.arguments is a JSON-encoded string.
   map.unredact() did raw substring replacement, so PII originals
   containing `"`, `\`, control chars, or non-ASCII would break the
   surrounding JSON.

   - Add RedactionMap::unredact_json_string which JSON-escapes each
     replacement via serde_json::to_string + strip outer quotes.
   - Add StreamUnredact::new_for_json_string variant for the streaming
     path; routes substitutions through unredact_json_string.
   - Non-streaming unredact_chat_response_in_place now uses the json
     variant for tool_calls[*].function.arguments.
   - Streaming unredact_chunk_in_place creates per-(choice_idx, tc_idx)
     state via new_for_json_string.

   Tests:
   - placeholders::unredact_json_string_{escapes_quotes,
     escapes_backslash, escapes_newline, safe_for_simple_pii}
   - stream_unredact::json_string_variant_{escapes_quote_in_replacement,
     escapes_across_chunk_split, no_op_for_simple_pii}

2. End-of-stream flush gap (blocker):
   build_flush_chunks drained content/reasoning fields but ignored
   tool_call_arguments — partial placeholders held in the tail were
   silently dropped (or could leak as literal `<email1>` if the tail
   contained an incomplete-but-recognizable shape).

   - Extend build_flush_chunks with a second pass that drains
     states.tool_call_arguments and emits a synthetic SSE chunk per
     (choice_idx, tc_idx) with the held bytes as a tool_calls delta.
   - Stable ordering: sort by (choice_idx, tc_idx) for deterministic
     output.
   - New e2e auto_redact_unredacts_tool_call_arguments_streaming
     covers the streamed-args reassembly path end-to-end.

3. Nit (collision safety):
   Streaming chunk handler no longer falls back to tc.index = 0 for
   indexless tool calls — uses enumerate position instead, so two
   parallel indexless tool calls in one delta don't collide on (idx, 0).

Test counts:
- services unit: 43 pass (was 38; +5 JSON-escape tests)
- e2e auto_redact: 10 pass (was 9; +1 streaming-args test)
- clippy -D warnings: clean
- cargo fmt --check: clean
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 13, 2026 13:38 — with GitHub Actions Inactive
@Evrard-Nil
Copy link
Copy Markdown
Collaborator Author

Pushed `11e8203` addressing both blockers and the nit.

1. JSON corruption (claude-review #1):
Added `RedactionMap::unredact_json_string` that JSON-escapes each replacement via `serde_json::to_string` + strip outer quotes. Used for the `arguments` field in both non-streaming and streaming paths. Streaming side has a new `StreamUnredact::new_for_json_string` variant that routes substitutions through the JSON-safe path.

Tested with names containing `"`, paths with `\`, addresses with newlines — all round-trip parse as valid JSON via `serde_json::from_str`. Simple PII (emails, phones) produces identical output to the non-JSON path so no behavior change for the common case.

2. End-of-stream flush gap (claude-review #2, also Copilot):
`build_flush_chunks` now does a second pass that drains `states.tool_call_arguments` and emits a synthetic tool-call SSE delta per `(choice_idx, tc_idx)` with the held bytes. No silent loss; truncated placeholders still emit literally as designed.

New e2e `auto_redact_unredacts_tool_call_arguments_streaming` exercises the full streamed-args reassembly path end-to-end (mock streams arguments split by spaces, client reassembles, asserts valid JSON with the un-redacted email).

3. Nit — indexless tool-call collision:
Replaced `tc.index.unwrap_or(0)` with `tc.index.unwrap_or(pos as i64)` where `pos` is the iter position, so two parallel indexless tool calls in one delta no longer collide on `(idx, 0)`.

Test count:

  • 43 unit pass (was 38)
  • 10 auto_redact e2e pass (was 9)
  • clippy -D warnings clean, cargo fmt clean

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements un-redaction for tool call arguments in both standard and streaming chat completion responses, ensuring that PII placeholders are correctly restored before reaching the client. It includes updates to the state management for streaming responses and a new end-to-end test to verify the behavior. A review comment identifies a critical issue where the new tool_call_arguments state is not flushed in the final chunks, which could lead to truncated JSON or missing PII if a placeholder is split at the end of a stream.

Comment thread crates/api/src/routes/completions.rs
@Evrard-Nil
Copy link
Copy Markdown
Collaborator Author

@gemini-code-assist — appears to be a stale review (the comment is anchored to commit 11e8203 but the flush IS implemented there). See build_flush_chunks lines 550–612 in crates/api/src/routes/completions.rs — Pass 2 drains states.tool_call_arguments and emits a synthetic tool-call SSE delta per (choice_idx, tc_idx) before [DONE]. Covered by e2e auto_redact_unredacts_tool_call_arguments_streaming. Happy to re-trigger a fresh review if useful.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thank you for the update and for verifying the implementation. I have reviewed the changes in commit 11e8203 and confirm that the build_flush_chunks logic in crates/api/src/routes/completions.rs (lines 550–612) correctly drains the tool_call_arguments state and emits the necessary synthetic SSE deltas. The addition of unredact_json_string and the corresponding streaming variant successfully addresses the JSON corruption concerns, and the new e2e tests provide excellent coverage for these paths. The PR looks solid and ready for review/merge.

Surfaced by an independent subagent code review of #588. Each is a
distinct privacy regression vs. the design goal of "provider never
sees raw PII".

1. **Input tool_calls.arguments now redacted.**
   In an agent loop the user resubmits the assistant's prior tool_call
   as part of conversation history. `collect_text_fragments` only
   walked `message.content`, so the JSON arguments string (which often
   echoes the original PII verbatim) was forwarded to the provider raw
   on every follow-up turn.

   - New `TextRef::ToolCallArg { msg_idx, tc_idx }` variant in
     `apply.rs`; `collect_text_fragments` + `write_back` extended to
     cover it. The arguments string is redacted as opaque text — our
     placeholders are pure ASCII (`<emailN>`) so they stay valid JSON.
   - Unit tests: collect_walks_assistant_tool_call_arguments,
     write_back_updates_tool_call_arguments.
   - E2E test: auto_redact_redacts_input_tool_call_arguments — sends a
     4-turn history with a tool_call carrying bob@example.com and
     asserts the provider sees <email1>, not the raw email.

2. **Response message.refusal now un-redacted.**
   Safety-tuned models may quote our placeholders back ("I can't email
   <email1> per policy"). Without un-redacting `choice.message.refusal`,
   the placeholder leaked to the client.
   - Walk the field in `unredact_chat_response_in_place`.
   - E2E test: auto_redact_unredacts_refusal_field.
   - `ChatDelta` has no `refusal` field, so streaming has no
     corresponding gap.

3. **Privacy-filter error logs no longer leak upstream response body.**
   `PrivacyClassifyError::HttpError` carries the verbatim response.text()
   from the privacy-filter. A misbehaving filter that echoes its input
   in an error response would have routed customer PII straight to
   application logs via `tracing::warn!(error = %e, …)`.

   - Added `Self::privacy_classify_error_category(&e)` returning a
     bounded `&'static str` (`unauthorized`, `rate_limited`,
     `unavailable`, `server_error`, `client_error`, `http_other`,
     `request_failed`).
   - The "all providers failed" final error now hand-formats the
     status code only, bypassing `sanitize_error_message` which would
     have re-introduced the body via `Display`.
   - Demoted "Privacy classify completed successfully" from info to
     debug — high-volume info log on every redacted request.

4. **15-second wall-clock timeout on the redact step.**
   Detector retries cascade through `pool.privacy_classify` with each
   attempt bounded by `completion_timeout()` (default 600s) × multiple
   providers. Without an outer bound, a hung detector could hold the
   user's request hostage for tens of minutes before the 503 fires.
   Auto-redact is in the critical request path; we cap it tightly.
   - New `REDACT_TIMEOUT = Duration::from_secs(15)` constant.
   - `redact_messages` now wraps the inner work in `tokio::time::timeout`
     and maps `Elapsed` to `DetectorUnavailable`.

Also includes 8 new adversarial e2e tests in
`auto_redact_adversarial.rs` (from a separate subagent's testing
pass): empty messages, empty content, PII in system messages, PII in
multimodal content-parts arrays, repeated-PII dedup, user-supplied
placeholder collision, 512 KB body, 26 MB body rejection.

Test counts:
- services unit: 47 pass (was 43; +4 collect/write/timeout tests)
- e2e auto_redact: 12 pass (was 10; +2 new tests)
- e2e auto_redact_adversarial: 8 pass (new file)
- clippy -D warnings: clean
- cargo fmt --check: clean
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 13, 2026 14:06 — with GitHub Actions Inactive
@Evrard-Nil
Copy link
Copy Markdown
Collaborator Author

Two subagents reviewed the work — one critical code review, one writing adversarial e2e tests. Pushed `bedf8d7` closing all 3 remaining High-priority gaps from the review plus a timeout, and folded in the 8 adversarial tests from the test pass.

Reviewer-flagged High items, all addressed:

  1. Input `tool_calls[*].arguments` weren't redacted. In an agent loop the user resubmits the assistant's prior tool_call as conversation history; the JSON arguments string echoed the original PII raw to the provider on every turn. Now `collect_text_fragments` walks them; new `TextRef::ToolCallArg` variant + write_back support. E2E: `auto_redact_redacts_input_tool_call_arguments` sends a 4-turn history with `{"to":"bob@example.com"}` in the tool_call args and asserts the provider only sees ``.

  2. `message.refusal` wasn't un-redacted. Safety-tuned models can quote our placeholders back ("I can't email `` per policy"); the placeholder leaked to the client. Walked in `unredact_chat_response_in_place`. (No streaming counterpart — `ChatDelta` has no refusal field.)

  3. Privacy-filter error logs leaked upstream response body. `PrivacyClassifyError::HttpError` carries `response.text()` verbatim — a misbehaving filter that echoed input could route customer PII to application logs via `tracing::warn!(error = %e)`. Replaced with a bounded `error_category` (`unauthorized`/`rate_limited`/`server_error`/etc.). Final user-facing error now hand-formats just the status code. Demoted "Privacy classify completed" from info → debug.

Bonus (Medium #4): 15s outer timeout on `redact_messages` via `tokio::time::timeout`. Worst case before this was N providers × 600s per-attempt → up to ~20 min hold on the user request. Now: timeout → `DetectorUnavailable` → 503. Constant `REDACT_TIMEOUT` in `auto_redact/mod.rs`.

Tester subagent: wrote `crates/api/tests/e2e_all/auto_redact_adversarial.rs` with 8 tests covering empty messages, empty content, system-message PII, multimodal parts, dedup, user-supplied placeholder collision, 512 KB body, 26 MB rejection. None surfaced a bug.

Test counts now:

  • services unit: 47 pass (was 43)
  • e2e auto_redact: 12 pass (was 10)
  • e2e auto_redact_adversarial: 8 pass (new)
  • clippy -D warnings: clean
  • fmt: clean

Re: gemini's earlier "flush gap" comment on `11e8203` — that flagged a missing flush on the latest commit, but `build_flush_chunks` Pass 2 was already there. Stale auto-review; replied separately.

Items I left as deferred follow-ups (not blockers for this PR):

  • Annotations / audio / function_call (legacy) un-redact — additional response fields that could theoretically carry placeholders. Lower probability than refusal in practice.
  • Single source of truth for `auto_redact_enabled` (currently fine, just fragile).
  • Dead `accumulated_bytes` cleanup (pre-existing).

@Evrard-Nil Evrard-Nil merged commit d91edc9 into main May 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants