US-21.3: Cost Aggregation Oban Workers & Anomaly Detection#13
Merged
Conversation
Add cost_summaries and cost_anomalies tables with RLS policies, CostRollupWorker (daily cron, analytics queue) with config-based DI via RollupBehaviour/DefaultRollup, CostAnomalyWorker chained after rollup for high_cost/suspiciously_low detection, and REST endpoints GET/PATCH /api/v1/cost-anomalies for anomaly management.
1. Add audit logging to resolve_anomaly (was missing unlike all other mutation functions in the TokenUsage context) 2. Include period_end in CostRollupWorker upsert replace list (was silently dropped on re-runs with different period_end) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mkreyman
added a commit
that referenced
this pull request
Jul 4, 2026
…aker, route worker/proposal-gate through it, sanitize provider errors, permanent-error/idempotent embed, keyless degrade Addresses the 15 BYO-introduced regressions from the enhanced review on #295. Circuit-breaker cluster: - The embedding circuit breaker was process-GLOBAL (keyed only by :failures / :circuit_open_until). Now keyed PER-TENANT ({tenant_id, :failures} / :window_start / :open_until) so a failing tenant only degrades ITSELF — one tenant's 401/429/timeout can no longer open the shared breaker and degrade every other tenant's semantic search (CRIT #1). Failures are CLASSIFIED: only 5xx / transport / timeout / crash count; 4xx (401/403/429) and :no_api_key are per-tenant credential/quota problems and are exempt. Increment is now atomic via :ets.update_counter (MED #13). - ArticleEmbeddingWorker and ProposalGate no longer call the client directly; both route through the guarded Knowledge.generate_embedding/3 so circuit-open is respected AND contributed to (HIGH #4). ProposalGate ran synchronously in the HTTP write path — it now fast-fails on outage instead of hammering a dead provider (LB-timeout risk). Provider-error-body key-leak cluster (one shared helper, both vendors): - New Loopctl.Llm.ProviderError.sanitize/1 + log_tag/1 drop any provider response body (which echoes a masked key fragment, e.g. "Incorrect API key provided: sk-...ZXY") — keeping only the status. Applied in EmbeddingClient.post (HIGH #3) and Anthropic.post (MED #11), and ProposalGate now logs a value-free tag instead of inspecting the raw tuple (CRIT #2). Stops the fragment reaching logs and oban_jobs.errors (which is surfaced to tenants via list_extraction_errors). Retry / idempotency / classification: - Loopctl.Llm.permanent_provider_error?/1: a 4xx (except 408/429) is permanent; ArticleEmbeddingWorker {:discard, {:embedding_permanent_error, _}}s a bad/revoked-key embed instead of burning retries (HIGH #5). - New articles.embedding_content_hash column (migration) records the SHA-256 of the embedded text; the worker skips re-calling the paid provider on retry when the stored hash matches the current content (MED #12). Keyless UX + audit + timeout + schema: - mode=semantic degrades to keyword-only (fallback: true) for a keyless tenant instead of 503 (MED #8). - Llm.record_blocked(tenant_id, :embedding) is emitted from the worker skip and ProposalGate's fall-open, matching the Anthropic audit trail (MED #9). - The embedding client's single-attempt budget (4s, no client retries) is kept strictly below Knowledge's 5s Task.yield so a slow-but-valid embed isn't killed-then-miscounted (MED #10). - "embedding" added to the LlmUsageResponse operation enum (MED #7). Tests: per-tenant breaker (:no_api_key + 4xx exempt; 5xx opens; the two-tenant no-cross-degrade repro); permanent-error discard vs transient retry; idempotent skip on re-run; keyless semantic degrade; success-path no-key-leak; a realistic OpenAI 401 body dropped from the error term + logs; skip telemetry asserted. Full precommit green (compile --warnings-as-errors, format, credo --strict, dialyzer 0, 3638 tests / 0 failures) + 73 mcp node tests.
mkreyman
added a commit
that referenced
this pull request
Jul 4, 2026
…r-classed circuit breaker, provider-error sanitizer, permanent-error/idempotent embed, keyless degrade (#296) * fix(embeddings): tenant-scope + error-class the embedding circuit breaker, route worker/proposal-gate through it, sanitize provider errors, permanent-error/idempotent embed, keyless degrade Addresses the 15 BYO-introduced regressions from the enhanced review on #295. Circuit-breaker cluster: - The embedding circuit breaker was process-GLOBAL (keyed only by :failures / :circuit_open_until). Now keyed PER-TENANT ({tenant_id, :failures} / :window_start / :open_until) so a failing tenant only degrades ITSELF — one tenant's 401/429/timeout can no longer open the shared breaker and degrade every other tenant's semantic search (CRIT #1). Failures are CLASSIFIED: only 5xx / transport / timeout / crash count; 4xx (401/403/429) and :no_api_key are per-tenant credential/quota problems and are exempt. Increment is now atomic via :ets.update_counter (MED #13). - ArticleEmbeddingWorker and ProposalGate no longer call the client directly; both route through the guarded Knowledge.generate_embedding/3 so circuit-open is respected AND contributed to (HIGH #4). ProposalGate ran synchronously in the HTTP write path — it now fast-fails on outage instead of hammering a dead provider (LB-timeout risk). Provider-error-body key-leak cluster (one shared helper, both vendors): - New Loopctl.Llm.ProviderError.sanitize/1 + log_tag/1 drop any provider response body (which echoes a masked key fragment, e.g. "Incorrect API key provided: sk-...ZXY") — keeping only the status. Applied in EmbeddingClient.post (HIGH #3) and Anthropic.post (MED #11), and ProposalGate now logs a value-free tag instead of inspecting the raw tuple (CRIT #2). Stops the fragment reaching logs and oban_jobs.errors (which is surfaced to tenants via list_extraction_errors). Retry / idempotency / classification: - Loopctl.Llm.permanent_provider_error?/1: a 4xx (except 408/429) is permanent; ArticleEmbeddingWorker {:discard, {:embedding_permanent_error, _}}s a bad/revoked-key embed instead of burning retries (HIGH #5). - New articles.embedding_content_hash column (migration) records the SHA-256 of the embedded text; the worker skips re-calling the paid provider on retry when the stored hash matches the current content (MED #12). Keyless UX + audit + timeout + schema: - mode=semantic degrades to keyword-only (fallback: true) for a keyless tenant instead of 503 (MED #8). - Llm.record_blocked(tenant_id, :embedding) is emitted from the worker skip and ProposalGate's fall-open, matching the Anthropic audit trail (MED #9). - The embedding client's single-attempt budget (4s, no client retries) is kept strictly below Knowledge's 5s Task.yield so a slow-but-valid embed isn't killed-then-miscounted (MED #10). - "embedding" added to the LlmUsageResponse operation enum (MED #7). Also adds Knowledge.reset_circuit_breaker/1 (tenant-scoped) and switches the breaker tests to it: the tenant-scoped ETS breaker made the old global delete_all_objects reset race concurrent async tests (a wiped counter = a flaky "breaker didn't open"), so per-tenant reset keeps async isolation. Tests: per-tenant breaker (:no_api_key + 4xx exempt; 5xx opens; the two-tenant no-cross-degrade repro); permanent-error discard vs transient retry; idempotent skip on re-run; keyless semantic degrade; success-path no-key-leak; a realistic OpenAI 401 body dropped from the error term + logs; skip telemetry asserted. Follow-up to #295 (squash-merged to master before this review pass landed), so it ships as its own PR rather than an update to that (now-merged) branch. Full precommit green (compile --warnings-as-errors, format, credo --strict, dialyzer 0, full test suite green) + 73 mcp node tests. * fix(embeddings): complete provider-error sanitization + atomic breaker window + supervised breaker table + keyless-embedding log text Addresses the 6 findings from the expedited re-review of #296 (1 Critical, 4 Medium, 1 Low). master ships the round-1 regressions, so this must merge clean. Provider-error sanitization (complete the half-applied #11): - CRIT #1: Anthropic.post's 200-UNEXPECTED-SHAPE branch still returned the raw body via redact_body/1 (a no-op for map bodies). A misconfigured/compromised endpoint returning HTTP 200 with an error-shaped body echoing a masked key fragment leaked it into oban_jobs.errors (surfaced by GET /knowledge/pipeline). Now sanitized like every other branch: {:error, {:api_error, 200, :provider_error}}. redact_body/1 DELETED. - MED #4: Anthropic.post's transport-error branch logged inspect(reason); now uses ProviderError.log_tag/1 (value-free), matching EmbeddingClient. - MED #5: content_ingestion_worker.finalize_errors/4 and review_knowledge_worker.classify_result/1 (and ArticleEmbeddingWorker) now route the error term through ProviderError.sanitize/1 BEFORE it becomes an Oban discard/error reason — defense-in-depth at every worker boundary. - LOW #6: ProviderError.sanitize/1 is now total and value-free for the key-bearing shapes: {:api_error,_,body}, a non-atom {:request_failed,_}, {:embedding_crash,_}, and a BARE body (string or map) all collapse; a bare atom and a STRUCTURED domain-error tuple (e.g. {:insert_failed, step, changeset}) are provably key-free and preserved (collapsing them would replace a precise operator reason with a misleading :provider_error). The API key only ever reaches sanitize inside those provider shapes. Breaker correctness: - MED #2: the per-tenant failure counter is now keyed by the WINDOW PERIOD ({tenant_id, :failures, div(now, window)}) so an expired window is a DISTINCT key — the non-atomic check-then-reset in maybe_reset_window is gone, and a burst of concurrent failures for one tenant counts atomically to the threshold. - The re-review's required concurrent test exposed a latent OWNERSHIP bug: the breaker ETS table was created lazily by whatever transient process (request / Oban job / Task) first touched it and VANISHED when that process died — silently resetting the breaker and crashing concurrent peers mid-update_counter. Added a supervised owner (Loopctl.Knowledge.EmbeddingCircuitBreaker, mirroring ExportConcurrency) so the table has a stable, node-lifetime owner. Correct log text: - MED #3: Llm.record_blocked/2 hard-coded "no Anthropic API key configured" even for :embedding. Now branches: "no embedding API key configured" for :embedding. Tests: ProviderError sanitize/log_tag across every shape (incl. the bare-body #6 cases + structured-tuple preservation); Anthropic 200-shape + non-200 exact sanitized value with a key-echoing body + no-leak; a Task.async_stream concurrent one-tenant failure burst opening the breaker reliably; worker-boundary sanitization in review + ingestion + embedding workers (key-bearing api_error -> value-free Oban reason); record_blocked log naming the right credential for both paths. Full precommit green (compile --warnings-as-errors, format, credo --strict, dialyzer 0, 3652 tests / 0 failures) + 73 mcp node tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cost_summariesandcost_anomaliestables with PostgreSQL RLS policies and CHECK constraintsCostRollupWorker(daily cron at 02:00 UTC,analyticsqueue) that aggregates token usage reports into cost summaries by agent/epic/project scope, with idempotent UPSERT keyed on(tenant_id, scope_type, scope_id, period_start)CostAnomalyWorkerchained after rollup, detecting stories with cost >3x (high_cost) or <0.1x (suspiciously_low) the epic averageRollupBehaviour+DefaultRollupwith config-based DI (Loopctl.MockCostRollupin test)GET /api/v1/cost-anomalies(filterable by anomaly_type, project_id) andPATCH /api/v1/cost-anomalies/:id(resolve anomaly), both requiringuserrolefixture(:cost_summary),fixture(:cost_anomaly)Test plan
Generated with Claude Code