Corroborating signals (ADR 021): strengthen groups without firing mitigations#109
Corroborating signals (ADR 021): strengthen groups without firing mitigations#109
Conversation
Introduces two new optional fields on each entry in correlation.yaml's
sources map, laying the groundwork for corroborating-only signals (ADR 021):
- `mode`: "primary" (default) or "corroborating". Primary sources can
trigger mitigations on their own; corroborating sources can only
strengthen existing signal groups.
- `match_dimensions`: list drawn from {customer_id, pop, service_id,
interface}. Required and non-empty when mode=corroborating; must be
empty for primary sources.
Zero-config upgrade: omitting both fields defaults to mode=primary with
no match_dimensions, which is exactly the v0.15.0 behavior. The redacted
config API now exposes mode and match_dimensions; the validator rejects
misuse with clear errors on PUT /v1/config/correlation.
No engine or handler changes in this commit — the fields are wired
through config, serialization, validation, and helpers (source_mode,
match_dimensions), with 8 new unit tests. The engine work lands in the
next commit.
Implements the runtime for ADR 021's corroborating signal mode. New HTTP endpoint: - POST /v1/signals/corroborator (authenticated) Accepts a dimension-tagged signal (no victim_ip required) from sources configured as `mode: corroborating`. If an open signal group matches on any declared dimension (OR semantics), the signal is attached and contributes to derived_confidence. Otherwise it's cached for up to window_seconds and drained when a matching primary event arrives. Data model: - New migration 009: adds `signal_groups.primary_dimensions` JSONB, `signal_group_events.is_corroborating`/corroborator_source/confidence, and `corroborating_signals` floating cache table. - New domain types: `CorroboratingSignal`, `EventDimensions`, `PrimaryDimensions`. - Eight new `RepositoryTrait` methods, implemented for both Postgres and the Mock. Postgres uses JSONB for group dimensions and a dedicated cache table with an expires_at index. Engine: - `corroborator_matches` decides attachment using OR over dimensions with an optional vector narrower. - `check_corroboration_with_primary` enforces the ADR 021 invariant that a group composed entirely of corroborators can never trigger a mitigation — at least one primary event is required. Handlers: - /v1/events now rejects posts from sources in mode=corroborating with clear 400 "use /v1/signals/corroborator instead" message. - /v1/events populates the group's primary_dimensions from settings.pop and the inventory-resolved customer_id / service_id, unioning on each ingest so late primaries contribute too. - On primary ingest, the handler drains any cached corroborators whose dimensions now match, attaches them, and recomputes group aggregates. Scheduler: - Reconcile loop now sweeps expired corroborating signals via `delete_expired_corroborating_signals` and increments CORROBORATOR_EXPIRED_TOTAL per removal. Observability: - Three new metrics: prefixd_corroborator_ingested_total, prefixd_corroborator_attached_total, prefixd_corroborator_expired_total (all labelled by source). Tests: - 9 new engine unit tests covering matching OR semantics, vector filtering, has_any_dimension, and check_corroboration_with_primary. - 6 new integration tests: primary source rejected from corroborator endpoint, missing declared dimension rejected, cached when no match, attaches when matching primary exists, corroborator-only never creates a group, rejected when correlation disabled. All tests green: 221 unit + 116 integration + 16 postgres.
…ontrols CLI: - `prefixdctl send-corroborator` builds a /v1/signals/corroborator POST from flags (--source, --vector, --customer-id, --pop, --service-id, --interface, --confidence). Shows signal_id, status and attached group IDs in table or JSON output. Dashboard — Correlation ▸ Configuration tab: - Source cards now show a "corroborating" badge when applicable, plus the declared match_dimensions as a secondary metric row. - The source edit dialog adds a Mode select (primary / corroborating) and a dimension picker (customer_id, pop, service_id, interface) that appears only when mode=corroborating. Validation mirrors the server: primary must have zero dims, corroborating must have ≥1. - New exported `validateSourceConfig` helper with 9 unit tests. Dashboard — Signal group detail: - Contributing events list shows a per-event corroborating badge so operators can see at a glance which rows strengthened the group vs which ones triggered it. Frontend tests: 87 passing (9 new). Frontend build: clean.
- docs/adr/021-corroborating-signals.md — full design record (context, decision, consequences, rollout, alternatives rejected). Updated ADR index. - docs/api.md — new 'Corroborator Signal' section with schema, error cases, and the group-must-have-a-primary-event invariant. - docs/configuration.md — added mode + match_dimensions rows to the source-config table and a worked YAML example. - configs/correlation.yaml — commented-out example corroborating source. - docs/cli.md — 'Send a Corroborating Signal' section for prefixdctl. - docs/detectors/corroborating-signals.md — operator quickstart: declare, emit, verify, CLI, Python pusher, troubleshooting. - CHANGELOG.md — Unreleased entry. - ROADMAP.md — marked corroborating-only signals shipped under Correlation Engine. - FEATURES.md — listed alongside other Signal Ingestion options. - AGENTS.md — new endpoint, ADR count bumped to 21, feature line updated.
Addresses the findings from the corroborating-signals PR review. No behavior change for primary sources or the existing correlation flow. Correctness fixes: - POST /v1/events now rejects events from sources configured as mode=corroborating at handler entry, before any ban/unban branching and before any DB writes. Previously the check lived inside handle_ban only, leaving unban and other paths partially processed before rejection. New test test_primary_event_rejects_corroborating_source_before_write asserts the event is not persisted. - Corroborator matching now uses declared match_dimensions as the authoritative filter, not just a presence check. A source declared for [pop] can no longer attach to a group via an undeclared customer_id/service_id/interface. Applies on: * immediate corroborator attach (ingest_corroborator) * cached-signal drain on primary ingest Factored into corroborator_matches_declared + unit test corroborator_declared_matching_ignores_undeclared_dimensions. - primary_dimensions now includes interface, fed from a new Asset.interface field in inventory and carried through IpContext. Interface-only corroborators are now matchable. - recompute_group_aggregates can no longer flip corroboration_met=true without playbook-override context. Corroborator-only recomputes update aggregates; the primary-ingest path remains the only place that promotes corroboration_met. - Corroborator rows in signal_group_events now carry their own ingested_at via new corroborator_ingested_at column (migration 010) and the list query is fully CASE-split on is_corroborating. Frontend api.ts now types ingested_at as string|null and the group detail page guards against null. - Cached corroborator count/list now mean unattached and unexpired, matching trait docstrings. Both mock and Postgres implementations updated. Validation / boot safety: - CorrelationConfig::load and Settings::load now run validate() on YAML load. Misconfigured mode=corroborating sources can no longer boot. Observability: - CORROBORATOR_EXPIRED_TOTAL dropped its misleading 'source="unknown"' label and is now a plain counter incremented by deleted count per sweep. Per-source attribution deferred to a follow-up (would require selecting rows before delete). Frontend: - SourceDialog Mode select now clears match_dimensions when flipping back to primary, preventing stale state from reaching the validator. Mock repository: - update_signal_group now copies the full struct (was only copying a subset of fields, hiding primary_dimensions changes from tests). - find_open_groups_by_dimensions now actually filters via PrimaryDimensions.matches_probe, matching the Postgres behavior. Bench fix: benches/benchmarks.rs Asset literal now includes the new interface field. Tests: 222 unit + 119 integration + 16 postgres, all green. Clippy clean. 3 new tests wired for the remediation paths.
Complements the remediation code commit with the corresponding doc
updates and explicit follow-up scope.
CHANGELOG Unreleased section now documents:
- Declared dimensions are authoritative (not just a presence check)
- Boot-time validation via CorrelationConfig::load / Settings::load
- Early handler-entry rejection of corroborating sources on /v1/events
- New interface dimension fed from Asset.interface
- check_corroboration_with_primary invariant + recompute guard
- Migration 010 for corroborator_ingested_at
- Expiry metric simplified (unlabelled); per-source deferred to PR B
- Dashboard: mode-switch auto-clears match_dimensions;
null-safe ingested_at on corroborator rows
ADR 021 appendix sections:
- 'Review remediations (merged into the shipping revision)' — lists
each fix with rationale so future readers can see what was in scope
for v1.
- 'Known limits / deferred to PR B' — explicit list of carry-overs:
playbook-override-aware finalization, per-source expiry attribution,
CorroboratorResponse.cached field removal.
- Refreshed References section to include migration 010,
corroborator_matches_declared, settings.rs, inventory.rs, and
metrics.rs.
ROADMAP new 'Corroborating signals v2 (PR B)' bulleted milestone under
Correlation Engine, spelling out 5 concrete work items:
- Playbook-override-aware corroborator finalization
- Per-source attribution on prefixd_corroborator_expired_total
- Drop CorroboratorResponse.cached (API cleanup)
- Dashboard 'cached corroborators' panel + admin listing endpoint
- Gauge metric prefixd_corroborator_cache_size{source}
docs/api.md:
- /v1/signals/corroborator now explicitly documents 'declared dimensions
only' matching semantics.
- Response section flags the redundant 'cached' field for removal and
steers integrators to 'status' / 'attached_group_ids' instead.
docs/configuration.md:
- 'Source Configuration' section clarifies declared-dimensions-are-
authoritative, plus boot-time + API-time validation.
- inventory.yaml 'Services' section adds an 'Asset fields' table
documenting the new optional 'interface' field and linking it to
ADR 021's match dimensions.
docs/detectors/corroborating-signals.md:
- New 'Known limits' section with the late-corroborator-finalization
caveat written in operator-actionable terms.
- 'Invariants to remember' now includes 'Declared dimensions are
authoritative'.
AGENTS.md:
- /v1/signals/corroborator endpoint note expanded with matching
semantics and rejection conditions for future agent sessions.
No code change.
There was a problem hiding this comment.
Pull request overview
Implements ADR 021 by introducing corroborating signals: dimension-tagged telemetry that can attach to and strengthen existing open signal groups (via coarse dimensions like customer_id, pop, service_id, interface) but can never trigger mitigations on its own.
Changes:
- Adds
mode+match_dimensionsto correlation source config with server/client validation and boot-time validation on YAML load. - Introduces
POST /v1/signals/corroborator, plus DB schema/cache support to attach-or-cache corroborators and sweep expired cache entries. - Extends UI/CLI/docs/metrics/tests to support and observe corroborating signals end-to-end.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integration.rs | Adds integration tests for corroborator endpoint behavior (reject/attach/cache) and updates test configs for new defaults. |
| tests/common/mod.rs | Updates shared test inventory and correlation source setup for new SourceConfig defaults / asset interface field. |
| src/scheduler/reconcile.rs | Adds periodic sweep of expired corroborating-signal cache and increments expiry metric. |
| src/policy/mod.rs | Updates policy test fixtures for new IpContext.interface field. |
| src/observability/metrics.rs | Adds corroborator ingest/attach/expire Prometheus counters and wires them into init. |
| src/db/traits.rs | Extends repository trait with dimension-based group lookup and corroborator cache/attach APIs. |
| src/db/repository.rs | Implements corroborator persistence + group event denormalization, adds primary_dimensions storage, and new queries. |
| src/db/mod.rs | Registers migrations 009/010 for corroborator tables/columns. |
| src/db/mock.rs | Extends mock repo to support corroborator cache and is_corroborating event links. |
| src/correlation/engine.rs | Adds PrimaryDimensions, CorroboratingSignal, dimension matching helpers, and “must have primary” corroboration check. |
| src/correlation/config.rs | Adds SourceMode/MatchDimension, config defaults, helpers, and validation enforcement in load(). |
| src/config/settings.rs | Enforces correlation config validation at daemon settings load time. |
| src/config/inventory.rs | Adds Asset.interface and carries it through lookup into IpContext for interface dimension matching. |
| src/bin/prefixdctl.rs | Adds send-corroborator CLI subcommand to post corroborating signals. |
| src/api/routes.rs | Registers new POST /v1/signals/corroborator route. |
| src/api/openapi.rs | Exposes corroborator endpoint + schemas in OpenAPI. |
| src/api/handlers.rs | Rejects corroborating sources on /v1/events, populates group primary_dimensions, drains cached corroborators on primary ingest, and implements corroborator ingest endpoint. |
| migrations/009_corroborating_signals.sql | Adds corroborator cache table, is_corroborating, denormalized corroborator columns, and group primary_dimensions. |
| migrations/010_corroborator_ingested_at.sql | Adds corroborator_ingested_at to ensure corroborator rows have timestamps for UI ordering. |
| frontend/lib/api.ts | Adds corroborator API types/call and extends SignalGroupEvent typing for null-safe timestamps and corroborator flag. |
| frontend/components/dashboard/correlation/config-tab.tsx | Adds mode + match-dimension editing UI and client-side validation mirroring backend rules. |
| frontend/app/(dashboard)/correlation/groups/[id]/page.tsx | Renders corroborating badge per contributing event and handles null ingest timestamps. |
| frontend/tests/correlation-source-config.test.ts | Adds frontend validator tests for mode + match_dimensions. |
| docs/detectors/corroborating-signals.md | Adds operator quickstart for configuring and sending corroborators (curl/CLI/Python). |
| docs/configuration.md | Documents mode/match_dimensions and new Asset.interface field behavior. |
| docs/cli.md | Documents prefixdctl send-corroborator. |
| docs/api.md | Documents POST /v1/signals/corroborator request/response semantics and invariant. |
| docs/adr/README.md | Adds ADR 021 to the ADR index. |
| docs/adr/021-corroborating-signals.md | Adds design record describing config, endpoint, cache, invariant, and known limits. |
| configs/correlation.yaml | Adds commented example corroborating source configuration. |
| benches/benchmarks.rs | Updates benchmark inventory fixture for new Asset.interface field. |
| ROADMAP.md | Adds ADR 021 completion and follow-up items for “v2 / PR B”. |
| FEATURES.md | Lists corroborating signals as a supported integration path. |
| CHANGELOG.md | Adds unreleased entry describing corroborating signals feature set. |
| AGENTS.md | Updates product/agent endpoint list and ADR count to include ADR 021. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Sort events chronologically (earliest first) | ||
| const sortedEvents = [...group.events].sort( | ||
| (a, b) => | ||
| new Date(a.ingested_at).getTime() - new Date(b.ingested_at).getTime(), | ||
| new Date(a.ingested_at ?? 0).getTime() - new Date(b.ingested_at ?? 0).getTime(), | ||
| ) |
There was a problem hiding this comment.
Sorting uses new Date(a.ingested_at ?? 0); when ingested_at is null this treats it as epoch (0) and will sort “Unknown ingest time” events to the very beginning. Prefer sorting nulls last (e.g., map null to Number.POSITIVE_INFINITY) so the timeline remains chronological for known timestamps.
| // Sample source labels for the expired metric BEFORE deleting, so we | ||
| // can attribute expiries. For simplicity we just increment the | ||
| // counter by total deleted with an "unknown" label; an operator who | ||
| // wants per-source attribution can scrape the metric and compare | ||
| // with CORROBORATOR_INGESTED_TOTAL. | ||
| let now = chrono::Utc::now(); |
There was a problem hiding this comment.
The comment in sweep_corroborator_cache still refers to sampling per-source labels and incrementing an "unknown"-labelled metric, but CORROBORATOR_EXPIRED_TOTAL is defined with no labels and the code increments it with an empty label set. Please update/remove this comment so it matches the current metric design (or reintroduce per-source attribution if that’s the intent).
| - Engine invariant enforced in two places: (a) `check_corroboration_with_primary` requires at least one primary event before `corroboration_met` can flip true, and (b) the corroborator-side aggregate recompute refuses to promote `corroboration_met` from false→true on its own — only the primary-ingest path (which has playbook-override context) can do that. | ||
| - `POST /v1/events` rejects corroborating-only sources at handler entry, before any ban/unban branching and before any DB writes. Nothing persists from a rejected event. | ||
| - New `interface` field on inventory `Asset` entries feeds into `IpContext.interface`, so interface-only corroborators (a common gNMI / SNMP shape) now have a real matchable dimension. | ||
| - Reconciliation loop sweeps expired corroborators. Four new Prometheus metrics: `prefixd_corroborator_ingested_total{source}`, `_attached_total{source}`, `_expired_total` (unlabelled — per-source attribution deferred to PR B; see ROADMAP). |
There was a problem hiding this comment.
This changelog entry says "Four new Prometheus metrics", but the implementation adds three corroborator metrics (prefixd_corroborator_ingested_total, _attached_total, _expired_total). Please correct the count to avoid confusing operators.
| - Reconciliation loop sweeps expired corroborators. Four new Prometheus metrics: `prefixd_corroborator_ingested_total{source}`, `_attached_total{source}`, `_expired_total` (unlabelled — per-source attribution deferred to PR B; see ROADMAP). | |
| - Reconciliation loop sweeps expired corroborators. Three new Prometheus metrics: `prefixd_corroborator_ingested_total{source}`, `_attached_total{source}`, `_expired_total` (unlabelled — per-source attribution deferred to PR B; see ROADMAP). |
Three nits from the Copilot reviewer on the last doc+remediation round: 1. frontend groups page: sort events with null `ingested_at` at the end instead of treating null as epoch 0 (which sent rows with an unknown ingest time to the very beginning of the timeline). Factored out a small `ingestTs` helper that maps null -> Number.POSITIVE_INFINITY. 2. src/scheduler/reconcile.rs: the doc comment on `sweep_corroborator_cache` still described sampling per-source labels and an "unknown" label, neither of which is true anymore after CORROBORATOR_EXPIRED_TOTAL was made unlabelled. Rewrote to describe the current behavior and point at the PR B follow-up for per-source attribution. 3. CHANGELOG: corrected the corroborator metric count from "Four new Prometheus metrics" to "Three" — _ingested_total, _attached_total, _expired_total. No behavior change. cargo fmt + clippy + frontend tests/build clean.
1) Backfill primary_dimensions for pre-upgrade open signal groups
(P2). Migration 009 defaulted primary_dimensions to '{}', so any
signal group already open at upgrade time was unmatchable from the
corroborator path (/v1/signals/corroborator and cache-drain both
match exclusively on primary_dimensions) until another primary event
happened to update the row. New migration 011 derives dimensions
best-effort from each group's related mitigations:
customer_id / pop / service_id are denormalized on mitigations, so a
single aggregation fills those in. Interface is left empty
pre-upgrade (new inventory field). Open groups with no associated
mitigation yet stay empty — honest state.
2) Derive Signals-tab source health from corroborator traffic too
(P2). getSignalSources() combined correlation config with /v1/events
only, so any source configured mode=corroborating always rendered
last_seen=null / unhealthy even while actively posting to
/v1/signals/corroborator. New backend endpoint
GET /v1/signals/corroborator/activity?minutes=N aggregates per-
source (last_seen, count) across the live cache
(corroborating_signals) and attached rows (signal_group_events
WHERE is_corroborating) in a single SQL. Repository trait gains
corroborator_source_activity() with Postgres + Mock impls. Handler
returns {since, sources:[{source,last_seen,count}]}. Frontend
getSignalSources() fetches this in parallel with events and merges
into the source map before the 10-minute healthy cutoff runs.
3) Count only unattached corroborators toward expired_total (P3).
/v1/signals/corroborator always persists a cache row (even on
attach, for late fan-out), so the old DELETE ... WHERE expires_at
<= $1 removed attached rows when their TTL passed and fed that
count into CORROBORATOR_EXPIRED_TOTAL — which the ADR documents as
'expired without attaching'. Repository now splits into
CorroboratorSweepStats { unattached_expired, attached_expired },
scheduler only increments the counter by unattached_expired.
Attached rows still get GC'd (their audit copy lives on
signal_group_events independently).
Tests added:
- test_expired_sweep_splits_attached_vs_unattached (integration) —
seeds three cache rows (unattached+expired, attached+expired,
unattached+fresh), asserts the sweep returns (1,1) and the unexpired
row survives.
- test_corroborator_source_activity_merges_cache_rows (integration) —
seeds three rows across two sources, asserts the aggregation groups
correctly and respects the 'since' cutoff.
ADR 021 appendix extended with items 9 and 10 (backfill migration
reasoning + Signals dashboard health fix). CHANGELOG Unreleased entry
updated to mention migration 011, the new activity endpoint, and the
narrower expired_total semantics. api.md documents the new endpoint.
AGENTS.md catalog updated.
Tests: 222 unit + 121 integration + 16 postgres + 87 frontend, all
green. cargo fmt + cargo clippy --all-targets clean. bun run build
clean.
Implements ADR 021. Adds a second class of correlation signals — corroborating signals — that strengthen open signal groups on coarse dimensions (customer_id, pop, service_id, interface) without ever triggering a mitigation on their own.
Motivation
The correlation engine (ADR 018) treats every event as a full primary: any ingested event can create a signal group and, if thresholds are met, trigger a /32 mitigation. That works for detectors that can identify a victim IP (FastNetMon, Alertmanager) but breaks down for valuable coarse telemetry:
Today operators drop those signals or write brittle shims that guess a victim IP. This PR lets them opt those sources into a corroborating mode where they meaningfully contribute to derived_confidence but never mitigate alone.
What ships
Config schema (`correlation.yaml`)
```yaml
sources:
fastnetmon:
mode: primary # default; unchanged behavior
weight: 1.0
router-cpu:
mode: corroborating # new
weight: 0.4
match_dimensions: [pop, customer_id]
```
Validator rejects `mode=corroborating` with empty dims, `mode=primary` with non-empty dims, unknown dimensions.
New endpoint
`POST /v1/signals/corroborator` — dimension-tagged signal, no victim_ip required. Matches open groups with OR-semantics across populated dimensions, optional vector narrower. Unmatched signals cache for up to `window_seconds` and drain when a matching primary arrives.
`prefixdctl send-corroborator --source router-cpu --pop iad1 --customer-id cust_42 ...`
Dashboard
Engine invariant
A group composed entirely of corroborators can never reach `corroboration_met=true`, regardless of how high derived_confidence climbs. Enforced in `check_corroboration_with_primary` — at least one `is_corroborating=false` event is required.
Data model (migration 009)
Observability
Three new metrics: `prefixd_corroborator_ingested_total`, `_attached_total`, `_expired_total` (labelled by source). Reconciliation loop sweeps expired cache entries.
Commits
Verification
Rollout
Zero-config upgrade: every existing source defaults to `mode=primary` and behaves identically to v0.15.0. Operators opt in by editing `correlation.yaml` and pointing new detectors at `/v1/signals/corroborator`.
See ADR 021 for the design record and docs/detectors/corroborating-signals.md for the operator quickstart.