-
Notifications
You must be signed in to change notification settings - Fork 1
Metrics
Prometheus surface exposed at /metrics on the orchestrator's main
listen address (-addr 127.0.0.1:8443 by default). Same TLS/mTLS
envelope as the rest of the API.
Source: internal/orchestrator/metrics/metrics.go.
Gauge, always 1. Labeled with the build version (-ldflags -X main.version=... or dev). Useful for stamping the version on
dashboards.
fangs_orchestrator_info
Gauge. Number of runners currently registered + heartbeat-fresh. Pruner evicts at 90s no-heartbeat; this gauge reflects post-prune state.
# alert when no runners
fangs_runners_registered == 0
Counter. Incremented every time SubmitScan enqueues a job.
# scan rate
rate(fangs_scans_queued_total[5m])
Counter. Incremented per arrived event, labeled by event type
(file_access, exec, net_connect, dns_query, tls_sni).
# events per second by type
sum by (type) (rate(fangs_events_received_total[1m]))
# total event throughput
sum(rate(fangs_events_received_total[1m]))
Cardinality: bounded — 5 types.
Counter. Lifetime sum of events_dropped from every ScanResult —
incremented by ObserveEventsDropped on each result POST.
# alert on ringbuf overflow
rate(fangs_events_dropped_total[5m]) > 0
A non-zero rate means at least one recent run lost events to ringbuf overflow. See Sensor-Probes#drop-counter for tuning advice.
Counter. Incremented per Differ-emitted deviation, labeled by severity.
# deviation rate by severity
sum by (severity) (rate(fangs_deviations_written_total[1h]))
# critical deviations
fangs_deviations_written_total{severity="critical"}
Cardinality: bounded — 4 standard severities + unknown.
Counter. Tracks baseline promotions:
-
auto— Differ promoted a zero-deviation run -
manual— operator ranfangs baseline promote
# ratio of human-triggered promotes (a high ratio means most
# releases need operator review — possible tuning opportunity)
sum(fangs_baseline_promoted_total{trigger="manual"})
/ sum(fangs_baseline_promoted_total)
Note: today this counter only fires from the Differ's auto-promote
path. The CLI's fangs baseline promote doesn't update the metric
since it writes to the DB directly without going through the
orchestrator. That's a v2 fix.
Counter. Incremented per delivery attempt by the Notifier.
Labels:
-
notifier— the target'sname(bounded by your configured targets — typically <10) -
status—sent|failed|permanent
# delivery success rate per notifier
sum by (notifier) (rate(fangs_notifications_total{status="sent"}[1h]))
/ sum by (notifier) (rate(fangs_notifications_total[1h]))
# alert on a notifier producing permanent failures
rate(fangs_notifications_total{status="permanent"}[5m]) > 0
Cardinality: bounded — # notifiers × 3.
The metrics package also registers prometheus.NewGoCollector() and
prometheus.NewProcessCollector(...), so you get for free:
go_goroutines
go_gc_duration_seconds
go_memstats_*
process_cpu_seconds_total
process_resident_memory_bytes
process_open_fds
...
Useful for orchestrator host monitoring without an extra agent.
Total series count for a healthy deployment is small:
| Source | Series |
|---|---|
fangs_* counters/gauges |
~10 base + (5 event types) + (4 severities) + (3 statuses × N notifiers) ≈ 25-50 |
| Go runtime | ~30 |
| Process runtime | ~10 |
Should easily stay under 100 series — Prometheus has no trouble.
prometheus.yml:
scrape_configs:
- job_name: fangs
static_configs:
- targets: ['fangs.internal:8443']
scheme: https
tls_config:
ca_file: /etc/prometheus/fangs-ca.crt
cert_file: /etc/prometheus/fangs-client.crt
key_file: /etc/prometheus/fangs-client.keyFor mTLS deployments, Prometheus needs a client cert signed by the
orchestrator's -tls-client-ca to scrape. Issue one via
docs/scripts/gen-tls.sh with RUNNER_ID=prom-scraper.
For plain-HTTP development:
scrape_configs:
- job_name: fangs
static_configs:
- targets: ['127.0.0.1:8443']groups:
- name: fangs
rules:
- alert: FangsNoRunners
expr: fangs_runners_registered == 0
for: 2m
labels: {severity: critical}
annotations:
summary: No FANGS runners registered for 2 minutes
- alert: FangsRingbufOverflow
expr: rate(fangs_events_dropped_total[5m]) > 0
for: 5m
labels: {severity: warning}
annotations:
summary: FANGS sensor ringbuf overflowing
description: Events being dropped at probe time; tune ringbuf or reduce concurrent scans
- alert: FangsNotifierFailing
expr: rate(fangs_notifications_total{status="permanent"}[10m]) > 0
labels: {severity: warning}
annotations:
summary: Notifier delivery permanently failing
description: A configured webhook target is returning 4xx persistently
- alert: FangsHighSeverityDeviation
expr: increase(fangs_deviations_written_total{severity=~"high|critical"}[5m]) > 0
labels: {severity: critical}
annotations:
summary: High/critical FANGS deviation
description: Investigate the pending review queueSome signals would be useful but aren't surfaced today:
-
Per-run differ duration (
fangs_differ_duration_seconds) — histogram showing how long deltas take. - Per-package run frequency — high-cardinality (one label per watched package) so it's intentionally absent; the DB has the data.
- Sandbox lifecycle latency — image pull duration, container exit code distribution. Useful for operators tuning sandbox limits.
Each is a small addition. PRs welcome.
Not wired today. OpenTelemetry hooks would be valuable at the per-run state machine + per-target Notifier retry loops. v2 item.