feat: add /v1/privacy/classify endpoint#584
Conversation
Adds a raw passthrough endpoint that forwards requests to a backend token-classification model (e.g. openai/privacy-filter) exposing POST /v1/privacy/classify. Follows the embeddings_raw pattern: the cloud API only deserializes the model field for routing, then forwards the request body and proxies the response bytes back unchanged. Billing follows rerank/embedding semantics: input_tokens summed from data[*].usage.input_tokens, priced via input_cost_per_token. Adds a new InferenceType::PrivacyClassify variant for usage tracking. Unblocks nearai/infra#86. The /v1/redact endpoint (nearai/infra#87) will be added separately.
Review — PR #584:
|
There was a problem hiding this comment.
Code Review
This pull request adds a new /v1/privacy/classify endpoint for PII span detection, integrating it into the API, service layers, and provider pool. It includes updates to OpenAPI documentation, usage tracking for billing, and implementations for vLLM and mock providers. Feedback suggests addressing a potential integer overflow during token summation and consolidating retry logic to ensure consistency with other inference endpoints.
There was a problem hiding this comment.
Pull request overview
Adds a new privacy-classification passthrough endpoint to the API (POST /v1/privacy/classify) and threads it through the inference provider abstraction, provider pool, concurrency limiting, OpenAPI registration, and usage billing (billed by input tokens, similar to embeddings/rerank).
Changes:
- Introduces
InferenceType::PrivacyClassifyand bills it using input tokens ×input_cost_per_token. - Adds a new raw passthrough method (
privacy_classify_raw) acrossInferenceProviderimplementations (vLLM/external/mock) and exposes it viaInferenceProviderPool+CompletionServiceTraitwith concurrent-slot guarding. - Adds the Axum route + OpenAPI schema/tag for
/v1/privacy/classify, including token extraction fromdata[*].usage.input_tokensfor usage recording.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/services/src/usage/ports.rs | Adds PrivacyClassify inference type string mappings. |
| crates/services/src/usage/mod.rs | Bills PrivacyClassify like embeddings/rerank (input-token based). |
| crates/services/src/inference_provider_pool/mod.rs | Adds pool-level privacy_classify passthrough with provider fallback. |
| crates/services/src/completions/ports.rs | Extends completion service trait with try_privacy_classify. |
| crates/services/src/completions/mod.rs | Implements try_privacy_classify with concurrency slot guarding + error mapping. |
| crates/inference_providers/src/vllm/mod.rs | Implements vLLM privacy_classify_raw passthrough call. |
| crates/inference_providers/src/models.rs | Introduces PrivacyClassifyError. |
| crates/inference_providers/src/mock.rs | Adds mock implementation returning a minimal privacy-filter-like JSON shape. |
| crates/inference_providers/src/lib.rs | Exposes PrivacyClassifyError and adds privacy_classify_raw to the InferenceProvider trait. |
| crates/inference_providers/src/external/mod.rs | Wires passthrough method through ExternalProvider. |
| crates/inference_providers/src/external/backend.rs | Adds default “not supported” implementation for external backends. |
| crates/api/src/routes/completions.rs | Adds /v1/privacy/classify handler with routing-by-model + usage token extraction/recording. |
| crates/api/src/openapi.rs | Registers the new OpenAPI path and adds the “Privacy” tag. |
| crates/api/src/lib.rs | Registers the new Axum route under completion routes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address bot review feedback on #584: - token sum: fold into i64 with saturating_add and clamp to i32, filtering negative values. Avoids release-build wrap when a malformed or malicious provider response has many data entries or oversized per-entry input_tokens (caught only after the MAX_REASONABLE_TOKENS cap, which checks the already-wrapped value). - usage/mod.rs: extend the "rerank/embedding" comment on the input-token billing arm to also mention privacy_classify.
Six tests against MockProvider exercise the full route path: - basic single-input → 200 with valid response shape - array input → 200 - unknown model → 404 - missing API key → 401 - missing model field → 400 - costs deducted (10 mock tokens × 1_000_000 = 10_000_000 units) Also adds openai/privacy-filter to init_inference_providers_with_mocks so the MockProvider answers privacy_classify_raw for it.
Re-posting as a comment-only review; not blocking on changes.
lloydmak99
left a comment
There was a problem hiding this comment.
Nice PR — shape mirrors embeddings_raw closely and the token-accounting math (saturating i64 sum, clamp to i32, negative-value filter, MAX_REASONABLE_TOKENS cap + Datadog anomaly metric) is genuinely the cleanest version of this pattern in the repo. A few items below before this lands.
Probably blocking
- Body size limit inheritance:
/privacy/classifyis registered before theDefaultBodyLimit::max(AUDIO_TRANSCRIPTION_MAX_BODY_SIZE)layer incrates/api/src/lib.rs:982, so it inherits the audio-transcription cap (tens of MB). The privacy-filter model has a 512-token context per the test setup — that limit is wildly oversized for a text-classification endpoint. Suggest a per-routeDefaultBodyLimit::max(...)with something more proportionate (a few hundred KB) so abuse / accidental flooding is less attractive. - Verify the upstream endpoint path actually exists:
crates/inference_providers/src/vllm/mod.rs:1463calls{base_url}/v1/privacy/classify. Standard vLLM/SGLang don't expose this path — this assumes the privacy-filter model server (port 8007 on gpu04) has a custom route or a custom server framework. Worth confirming against the small-models compose / model image before merging; if it's actually/v1/classifyor/predictupstream, this 404s silently against the live backend and the E2E box on the test plan won't catch it because the mock answers anything.
Worth addressing
_body_hashextracted then discarded (crates/api/src/routes/completions.rs:109). Other passthrough routes consume this for body-hash propagation. If intentional, a one-line// privacy classify does not need body-hash propagationcomment would prevent confusion; if not, this is silently dropping data downstream consumers may expect.- Pool's "no providers" path returns
RequestFailed→ 502 (crates/services/src/inference_provider_pool/mod.rs:1801). The route handler pre-resolves the model and returns 404, but if a model disappears between resolution and pool lookup the user sees a 502 instead of 404. Minor — consider a distinct error variant or status mapping for consistency with the pre-check. - 401/403 from upstream remapped to 500
"The model is currently unavailable"(crates/services/src/completions/mod.rs:1610). Matches existing patterns, but for a brand-new endpoint where upstream auth is still firming up, keep an eye on thetracing::warn!in the pool when debugging — the 500 the user sees won't tell you it's actually an auth problem. - Test coverage gaps: no test exercises the
MAX_REASONABLE_TOKENScap firing, the zero-tokens warning, or the 429/503 mappings out oftry_privacy_classify. Given the token math is the most novel part of the diff, an explicit test that the cap clamps and emitsMETRIC_PROVIDER_TOKEN_ANOMALIESwould be valuable.
Minor / nits
- Model name
openai/privacy-filteris misleading — not from OpenAI. Purely a naming choice for the admin upsert, but worth raising with whoever's deciding the public model ID before it ships. - OpenAPI doc types declare
input: serde_json::Valueanddata: serde_json::Value— generates a genericobjectschema. Acceptable for a passthrough but unhelpful for consumers. ConsideroneOfforinput: string | string[]and a typeddataentry. - Mock response hardcodes
model: "mock-privacy-filter"(crates/inference_providers/src/mock.rs:1101) regardless of the requested model.test_privacy_classify_basiconly assertsbody.get("model").is_some()so it doesn't matter today — flagging as a future foot-gun for any test that asserts model-id round-trip. ./svc.sh-style redundancy doesn't apply here, but the redundant 502 message intry_privacy_classify's5xxarm ("server_error"→ "Privacy classify request failed. Please try again later.") could be unified with the catch-all branch.- Two
tokio::time::sleep(Duration::from_millis(200))calls in the tests — fine, matches the rest of the suite, but worth double-checking they're necessary for this endpoint (some of those sleeps were originally tied to the 15-min provider refresh which is now 5min, and may not be related to this code path at all).
Approve once the body-size override and upstream-path verification are sorted. The token-accounting block here is solid enough that it could probably be a shared helper for the other passthrough routes in a follow-up.
Addresses review feedback on #584: - Per-route DefaultBodyLimit::max(256 KB) on /v1/privacy/classify so it doesn't inherit the 25 MB audio-transcription cap from the shared text_inference_routes layer. Privacy filter is text-in/text-out with a small (e.g. 512-token) context; 25 MB is wildly oversized. - Mock now echoes the requested model id instead of hardcoding "mock-privacy-filter", so any future test asserting model round-trip won't hit a foot-gun. - New e2e test test_privacy_classify_body_size_limit confirms the per-route cap kicks in (413 on 300 KB body).
|
Thanks @lloydmak99 — really thorough review. Pushed b446baa addressing the blocking items and a couple of the smaller ones. Walk-through: Blocking itemsBody size limit ✅ Fixed in b446baa. Added a per-route Upstream endpoint path verification ✅ Already verified before posting the test commit, but I should have flagged the evidence in the PR description. Two confirmations:
Worth-addressing itemsMock hardcoded model id ✅ Fixed in b446baa — mock now echoes
Pool 404→502 race — Pre-existing across all 401/403→500 upstream remap — Same pre-existing pattern as Test for MAX_REASONABLE_TOKENS cap — Skipped here because the current NitsModel name OpenAPI 502 message redundancy / sleep(200ms) — Both pre-existing patterns inherited from the embeddings-style scaffolding. Out of scope here. Test suite went from 6 → 7 tests, all pass serially. Let me know if the body-size override looks right and I'll squash if needed. |
Critical / blocking fixes:
- redact_one is now fail-closed on malformed input. Previously a span
whose offsets fell inside a multi-byte UTF-8 sequence or were
out-of-range silently passed the original text through — leaking PII
upstream when redaction was explicitly requested. Now returns
AutoRedactError::Internal, propagated to a 500 in the handler.
(gemini security-critical, Copilot)
- Streaming un-redact state is now keyed by choice index (HashMap<i64,
StreamUnredact>) rather than a single shared instance, with separate
maps for content / reasoning_content / reasoning. For n>1 completions
the provider may interleave chunks across choice indices, which would
have cross-contaminated the sliding 16-byte tail. (gemini high)
High-value fixes:
- End-of-stream flush: any bytes still held in a tail buffer when the
upstream stream ends (e.g. mid-placeholder) are now emitted as a
synthetic SSE chunk before [DONE], not silently dropped. (Copilot)
- The per-chunk `tracing::debug!("Completion stream event: ...")` line
is suppressed when auto_redact_enabled. After un-redact the chunk
holds the user's original PII; routing it to logs would defeat the
privacy guarantee. (Copilot)
Other fixes:
- detect.rs: extend rather than overwrite when the detector returns
multiple `data` entries for the same index. (gemini medium)
- Skip non-streaming response re-serialize when the redaction map is
empty (request opted in but no PII detected). Preserves the
raw_bytes/signing path for clean inputs. (Copilot)
- Doc comment in mod.rs no longer references a non-existent
`docs/auto-redact.md`. (Copilot nit)
- MAX_PLACEHOLDER_LEN comment example matches what's actually minted
(placeholder_prefix("account_number") -> "account", not
"account_number"). (Copilot nit)
- Span.text field removed (it was only carried through from the
detector response and never read; dead-code warning).
- MockProvider's privacy_classify_raw keeps usage.input_tokens=10 for
backward compat with the privacy_classify e2e test from #584; only
the spans field is computed from the input.
Tests:
- New unit tests in apply.rs for fail-closed-on-non-char-boundary and
fail-closed-on-out-of-range-span.
- New e2e test
auto_redact_skips_response_munging_when_no_pii_detected covers the
empty-map short-circuit.
- All 260 unit + 8 auto_redact e2e + 7 privacy_classify e2e pass.
* feat: x-auto-redact for chat completions
Adds an opt-in header (`x-auto-redact: on`) and body field
(`auto_redact: true`) on /v1/chat/completions that:
1. Detects PII in prompt messages by calling the privacy-filter model
via the inference provider pool.
2. Mints stable per-request placeholders (<email1>, <phone2>, …) and
rewrites the messages so the provider only ever sees the redacted
form. Provider scope is intentional: works for vLLM and external
(Anthropic/OpenAI/Gemini) alike.
3. Strips the auto_redact body field before forwarding so providers
with strict JSON schemas don't 422.
4. Un-redacts the response: walks message.content / reasoning fields
for non-streaming, wraps SSE chunks in a sliding-window unredacter
(16-byte tail, holds incomplete <…> tokens across chunks) for
streaming.
Fails closed: if the PII detector is unavailable the request is
rejected with 503 `auto_redact_unavailable` rather than degrading
silently to send raw PII to the provider.
New module: services::auto_redact
- placeholders.rs: bidirectional placeholder ↔ original map with
monotonic per-category ordinals, dedup of repeated PII, and
collision-safe minting when the user's own text contains a
<categoryN>-shaped literal.
- detect.rs: pool.privacy_classify invocation + response parse.
- apply.rs: walks CompletionMessage content (string + content-parts
arrays), redacts spans, writes back.
- stream_unredact.rs: sliding 16-byte tail buffer + regex
`<[a-z_]+\d+>` for streaming replacement that never splits a
placeholder across an emitted chunk.
Scope decisions:
- /v1/completions left out: handler currently returns 501. The
maybe_redact helper is generic over ServiceCompletionRequest so the
wire-in is a 5-line copy when that handler is enabled.
- /v1/responses out of scope for v1 per design; stored-conversation
semantics need their own decision (PR description).
- No category opt-out filter (all-or-nothing).
Tests: 34 unit + 7 e2e covering header/body activation parity,
streaming chunk splits, fail-closed on missing detector, body-field
stripping, multi-PII (email/phone/SSN) round-trip, and PII passthrough
when off.
Builds on /v1/privacy/classify (#584). The MockProvider's
privacy_classify_raw now does shape-based PII detection (email, SSN,
phone) so e2e tests exercise the full redact→provider→unredact loop
without a live privacy-filter model.
* fix: address review feedback on auto-redact (#585)
Critical / blocking fixes:
- redact_one is now fail-closed on malformed input. Previously a span
whose offsets fell inside a multi-byte UTF-8 sequence or were
out-of-range silently passed the original text through — leaking PII
upstream when redaction was explicitly requested. Now returns
AutoRedactError::Internal, propagated to a 500 in the handler.
(gemini security-critical, Copilot)
- Streaming un-redact state is now keyed by choice index (HashMap<i64,
StreamUnredact>) rather than a single shared instance, with separate
maps for content / reasoning_content / reasoning. For n>1 completions
the provider may interleave chunks across choice indices, which would
have cross-contaminated the sliding 16-byte tail. (gemini high)
High-value fixes:
- End-of-stream flush: any bytes still held in a tail buffer when the
upstream stream ends (e.g. mid-placeholder) are now emitted as a
synthetic SSE chunk before [DONE], not silently dropped. (Copilot)
- The per-chunk `tracing::debug!("Completion stream event: ...")` line
is suppressed when auto_redact_enabled. After un-redact the chunk
holds the user's original PII; routing it to logs would defeat the
privacy guarantee. (Copilot)
Other fixes:
- detect.rs: extend rather than overwrite when the detector returns
multiple `data` entries for the same index. (gemini medium)
- Skip non-streaming response re-serialize when the redaction map is
empty (request opted in but no PII detected). Preserves the
raw_bytes/signing path for clean inputs. (Copilot)
- Doc comment in mod.rs no longer references a non-existent
`docs/auto-redact.md`. (Copilot nit)
- MAX_PLACEHOLDER_LEN comment example matches what's actually minted
(placeholder_prefix("account_number") -> "account", not
"account_number"). (Copilot nit)
- Span.text field removed (it was only carried through from the
detector response and never read; dead-code warning).
- MockProvider's privacy_classify_raw keeps usage.input_tokens=10 for
backward compat with the privacy_classify e2e test from #584; only
the spans field is computed from the input.
Tests:
- New unit tests in apply.rs for fail-closed-on-non-char-boundary and
fail-closed-on-out-of-range-span.
- New e2e test
auto_redact_skips_response_munging_when_no_pii_detected covers the
empty-map short-circuit.
- All 260 unit + 8 auto_redact e2e + 7 privacy_classify e2e pass.
Summary
POST /v1/privacy/classifyas a raw passthrough to a backend token-classification model (e.g.openai/privacy-filter)embeddings_rawpattern: only themodelfield is deserialized for routing; the request body and response bytes are forwarded unchangedprivacy_classify_rawtrait method throughInferenceProvider(vLLM impl + external/mock stubs), the inference provider pool, andCompletionServiceTraitwith concurrent-slot guardingInferenceType::PrivacyClassifyfor usage tracking, billed asinput_tokens × input_cost_per_token(summed fromdata[*].usage.input_tokensin the provider response) — same semantics as embedding/rerankUnblocks nearai/infra#86. The
/v1/redactendpoint (nearai/infra#87) is a separate concern and will land in a follow-up PR.The privacy-filter model itself is already deployed at port 8007 on gpu04 and registered at
privacy-filter.completions.near.ai(seecvm-conf/small-models.yaml). The model still needs to be added to the cloud-api models DB viaPOST /v1/admin/modelsbefore this endpoint can be exercised end-to-end against the live backend.Test plan
cargo check --workspace --all-targetscargo clippy --workspace --all-targets -- -D warningscargo fmt --all -- --checkcargo test --lib --bins -p inference_providers -p services -p api(224 + 60 unit tests pass;test_openapi_spec_generationcovers the new path)openai/privacy-filterinto cloud-api DB withinput_modalities=["text"],output_modalities=["text"],provider_type=vllm,inference_url=https://privacy-filter.completions.near.ai, thencurl -X POST $CLOUD_API/v1/privacy/classify -H 'Authorization: Bearer sk-...' -d '{"model":"openai/privacy-filter","input":"my SSN is 123-45-6789","threshold":0.5}'