Skip to content

US-21.3: Cost Aggregation Oban Workers & Anomaly Detection#13

Merged
mkreyman merged 2 commits into
masterfrom
story/21.3-cost-rollup-anomaly-workers
Apr 3, 2026
Merged

US-21.3: Cost Aggregation Oban Workers & Anomaly Detection#13
mkreyman merged 2 commits into
masterfrom
story/21.3-cost-rollup-anomaly-workers

Conversation

@mkreyman

@mkreyman mkreyman commented Apr 3, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add cost_summaries and cost_anomalies tables with PostgreSQL RLS policies and CHECK constraints
  • Implement CostRollupWorker (daily cron at 02:00 UTC, analytics queue) that aggregates token usage reports into cost summaries by agent/epic/project scope, with idempotent UPSERT keyed on (tenant_id, scope_type, scope_id, period_start)
  • Implement CostAnomalyWorker chained after rollup, detecting stories with cost >3x (high_cost) or <0.1x (suspiciously_low) the epic average
  • Add RollupBehaviour + DefaultRollup with config-based DI (Loopctl.MockCostRollup in test)
  • Add REST endpoints: GET /api/v1/cost-anomalies (filterable by anomaly_type, project_id) and PATCH /api/v1/cost-anomalies/:id (resolve anomaly), both requiring user role
  • New fixtures: fixture(:cost_summary), fixture(:cost_anomaly)

Test plan

  • CostSummary schema changeset validation (required fields, enum values, non-negative constraints, unique index)
  • DefaultRollup integration tests (agent/epic/project aggregation, model breakdown, tenant isolation)
  • CostRollupWorker tests (mocked rollup service, idempotent upsert, multi-tenant, error handling, skips suspended tenants)
  • CostAnomalyWorker tests (high_cost detection, suspiciously_low detection, normal range no-flag, updates existing anomalies)
  • TokenUsage context tests (list/get/resolve anomalies, filtering, pagination, tenant isolation)
  • CostAnomalyController tests (GET list with filters, PATCH resolve, role enforcement, tenant isolation)
  • 46 new tests, all passing

Generated with Claude Code

mkreyman and others added 2 commits April 3, 2026 16:20
Add cost_summaries and cost_anomalies tables with RLS policies,
CostRollupWorker (daily cron, analytics queue) with config-based DI
via RollupBehaviour/DefaultRollup, CostAnomalyWorker chained after
rollup for high_cost/suspiciously_low detection, and REST endpoints
GET/PATCH /api/v1/cost-anomalies for anomaly management.
1. Add audit logging to resolve_anomaly (was missing unlike all other
   mutation functions in the TokenUsage context)
2. Include period_end in CostRollupWorker upsert replace list (was
   silently dropped on re-runs with different period_end)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mkreyman mkreyman merged commit e21ae00 into master Apr 3, 2026
6 checks passed
@mkreyman mkreyman deleted the story/21.3-cost-rollup-anomaly-workers branch April 3, 2026 22:32
mkreyman added a commit that referenced this pull request Jul 4, 2026
…aker, route worker/proposal-gate through it, sanitize provider errors, permanent-error/idempotent embed, keyless degrade

Addresses the 15 BYO-introduced regressions from the enhanced review on #295.

Circuit-breaker cluster:
- The embedding circuit breaker was process-GLOBAL (keyed only by :failures /
  :circuit_open_until). Now keyed PER-TENANT ({tenant_id, :failures} /
  :window_start / :open_until) so a failing tenant only degrades ITSELF — one
  tenant's 401/429/timeout can no longer open the shared breaker and degrade every
  other tenant's semantic search (CRIT #1). Failures are CLASSIFIED: only 5xx /
  transport / timeout / crash count; 4xx (401/403/429) and :no_api_key are
  per-tenant credential/quota problems and are exempt. Increment is now atomic via
  :ets.update_counter (MED #13).
- ArticleEmbeddingWorker and ProposalGate no longer call the client directly; both
  route through the guarded Knowledge.generate_embedding/3 so circuit-open is
  respected AND contributed to (HIGH #4). ProposalGate ran synchronously in the
  HTTP write path — it now fast-fails on outage instead of hammering a dead
  provider (LB-timeout risk).

Provider-error-body key-leak cluster (one shared helper, both vendors):
- New Loopctl.Llm.ProviderError.sanitize/1 + log_tag/1 drop any provider response
  body (which echoes a masked key fragment, e.g. "Incorrect API key provided:
  sk-...ZXY") — keeping only the status. Applied in EmbeddingClient.post (HIGH #3)
  and Anthropic.post (MED #11), and ProposalGate now logs a value-free tag instead
  of inspecting the raw tuple (CRIT #2). Stops the fragment reaching logs and
  oban_jobs.errors (which is surfaced to tenants via list_extraction_errors).

Retry / idempotency / classification:
- Loopctl.Llm.permanent_provider_error?/1: a 4xx (except 408/429) is permanent;
  ArticleEmbeddingWorker {:discard, {:embedding_permanent_error, _}}s a
  bad/revoked-key embed instead of burning retries (HIGH #5).
- New articles.embedding_content_hash column (migration) records the SHA-256 of the
  embedded text; the worker skips re-calling the paid provider on retry when the
  stored hash matches the current content (MED #12).

Keyless UX + audit + timeout + schema:
- mode=semantic degrades to keyword-only (fallback: true) for a keyless tenant
  instead of 503 (MED #8).
- Llm.record_blocked(tenant_id, :embedding) is emitted from the worker skip and
  ProposalGate's fall-open, matching the Anthropic audit trail (MED #9).
- The embedding client's single-attempt budget (4s, no client retries) is kept
  strictly below Knowledge's 5s Task.yield so a slow-but-valid embed isn't
  killed-then-miscounted (MED #10).
- "embedding" added to the LlmUsageResponse operation enum (MED #7).

Tests: per-tenant breaker (:no_api_key + 4xx exempt; 5xx opens; the two-tenant
no-cross-degrade repro); permanent-error discard vs transient retry; idempotent
skip on re-run; keyless semantic degrade; success-path no-key-leak; a realistic
OpenAI 401 body dropped from the error term + logs; skip telemetry asserted.

Full precommit green (compile --warnings-as-errors, format, credo --strict,
dialyzer 0, 3638 tests / 0 failures) + 73 mcp node tests.
mkreyman added a commit that referenced this pull request Jul 4, 2026
…r-classed circuit breaker, provider-error sanitizer, permanent-error/idempotent embed, keyless degrade (#296)

* fix(embeddings): tenant-scope + error-class the embedding circuit breaker, route worker/proposal-gate through it, sanitize provider errors, permanent-error/idempotent embed, keyless degrade

Addresses the 15 BYO-introduced regressions from the enhanced review on #295.

Circuit-breaker cluster:
- The embedding circuit breaker was process-GLOBAL (keyed only by :failures /
  :circuit_open_until). Now keyed PER-TENANT ({tenant_id, :failures} /
  :window_start / :open_until) so a failing tenant only degrades ITSELF — one
  tenant's 401/429/timeout can no longer open the shared breaker and degrade every
  other tenant's semantic search (CRIT #1). Failures are CLASSIFIED: only 5xx /
  transport / timeout / crash count; 4xx (401/403/429) and :no_api_key are
  per-tenant credential/quota problems and are exempt. Increment is now atomic via
  :ets.update_counter (MED #13).
- ArticleEmbeddingWorker and ProposalGate no longer call the client directly; both
  route through the guarded Knowledge.generate_embedding/3 so circuit-open is
  respected AND contributed to (HIGH #4). ProposalGate ran synchronously in the
  HTTP write path — it now fast-fails on outage instead of hammering a dead
  provider (LB-timeout risk).

Provider-error-body key-leak cluster (one shared helper, both vendors):
- New Loopctl.Llm.ProviderError.sanitize/1 + log_tag/1 drop any provider response
  body (which echoes a masked key fragment, e.g. "Incorrect API key provided:
  sk-...ZXY") — keeping only the status. Applied in EmbeddingClient.post (HIGH #3)
  and Anthropic.post (MED #11), and ProposalGate now logs a value-free tag instead
  of inspecting the raw tuple (CRIT #2). Stops the fragment reaching logs and
  oban_jobs.errors (which is surfaced to tenants via list_extraction_errors).

Retry / idempotency / classification:
- Loopctl.Llm.permanent_provider_error?/1: a 4xx (except 408/429) is permanent;
  ArticleEmbeddingWorker {:discard, {:embedding_permanent_error, _}}s a
  bad/revoked-key embed instead of burning retries (HIGH #5).
- New articles.embedding_content_hash column (migration) records the SHA-256 of the
  embedded text; the worker skips re-calling the paid provider on retry when the
  stored hash matches the current content (MED #12).

Keyless UX + audit + timeout + schema:
- mode=semantic degrades to keyword-only (fallback: true) for a keyless tenant
  instead of 503 (MED #8).
- Llm.record_blocked(tenant_id, :embedding) is emitted from the worker skip and
  ProposalGate's fall-open, matching the Anthropic audit trail (MED #9).
- The embedding client's single-attempt budget (4s, no client retries) is kept
  strictly below Knowledge's 5s Task.yield so a slow-but-valid embed isn't
  killed-then-miscounted (MED #10).
- "embedding" added to the LlmUsageResponse operation enum (MED #7).

Also adds Knowledge.reset_circuit_breaker/1 (tenant-scoped) and switches the breaker
tests to it: the tenant-scoped ETS breaker made the old global delete_all_objects
reset race concurrent async tests (a wiped counter = a flaky "breaker didn't open"),
so per-tenant reset keeps async isolation.

Tests: per-tenant breaker (:no_api_key + 4xx exempt; 5xx opens; the two-tenant
no-cross-degrade repro); permanent-error discard vs transient retry; idempotent
skip on re-run; keyless semantic degrade; success-path no-key-leak; a realistic
OpenAI 401 body dropped from the error term + logs; skip telemetry asserted.

Follow-up to #295 (squash-merged to master before this review pass landed), so it
ships as its own PR rather than an update to that (now-merged) branch.

Full precommit green (compile --warnings-as-errors, format, credo --strict,
dialyzer 0, full test suite green) + 73 mcp node tests.

* fix(embeddings): complete provider-error sanitization + atomic breaker window + supervised breaker table + keyless-embedding log text

Addresses the 6 findings from the expedited re-review of #296 (1 Critical, 4
Medium, 1 Low). master ships the round-1 regressions, so this must merge clean.

Provider-error sanitization (complete the half-applied #11):
- CRIT #1: Anthropic.post's 200-UNEXPECTED-SHAPE branch still returned the raw body
  via redact_body/1 (a no-op for map bodies). A misconfigured/compromised endpoint
  returning HTTP 200 with an error-shaped body echoing a masked key fragment leaked
  it into oban_jobs.errors (surfaced by GET /knowledge/pipeline). Now sanitized like
  every other branch: {:error, {:api_error, 200, :provider_error}}. redact_body/1
  DELETED.
- MED #4: Anthropic.post's transport-error branch logged inspect(reason); now uses
  ProviderError.log_tag/1 (value-free), matching EmbeddingClient.
- MED #5: content_ingestion_worker.finalize_errors/4 and
  review_knowledge_worker.classify_result/1 (and ArticleEmbeddingWorker) now route
  the error term through ProviderError.sanitize/1 BEFORE it becomes an Oban
  discard/error reason — defense-in-depth at every worker boundary.
- LOW #6: ProviderError.sanitize/1 is now total and value-free for the key-bearing
  shapes: {:api_error,_,body}, a non-atom {:request_failed,_}, {:embedding_crash,_},
  and a BARE body (string or map) all collapse; a bare atom and a STRUCTURED
  domain-error tuple (e.g. {:insert_failed, step, changeset}) are provably key-free
  and preserved (collapsing them would replace a precise operator reason with a
  misleading :provider_error). The API key only ever reaches sanitize inside those
  provider shapes.

Breaker correctness:
- MED #2: the per-tenant failure counter is now keyed by the WINDOW PERIOD
  ({tenant_id, :failures, div(now, window)}) so an expired window is a DISTINCT key
  — the non-atomic check-then-reset in maybe_reset_window is gone, and a burst of
  concurrent failures for one tenant counts atomically to the threshold.
- The re-review's required concurrent test exposed a latent OWNERSHIP bug: the
  breaker ETS table was created lazily by whatever transient process (request / Oban
  job / Task) first touched it and VANISHED when that process died — silently
  resetting the breaker and crashing concurrent peers mid-update_counter. Added a
  supervised owner (Loopctl.Knowledge.EmbeddingCircuitBreaker, mirroring
  ExportConcurrency) so the table has a stable, node-lifetime owner.

Correct log text:
- MED #3: Llm.record_blocked/2 hard-coded "no Anthropic API key configured" even for
  :embedding. Now branches: "no embedding API key configured" for :embedding.

Tests: ProviderError sanitize/log_tag across every shape (incl. the bare-body #6
cases + structured-tuple preservation); Anthropic 200-shape + non-200 exact
sanitized value with a key-echoing body + no-leak; a Task.async_stream concurrent
one-tenant failure burst opening the breaker reliably; worker-boundary sanitization
in review + ingestion + embedding workers (key-bearing api_error -> value-free Oban
reason); record_blocked log naming the right credential for both paths.

Full precommit green (compile --warnings-as-errors, format, credo --strict,
dialyzer 0, 3652 tests / 0 failures) + 73 mcp node tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant