Skip to content

Metrics

cyb3rjerry edited this page May 23, 2026 · 1 revision

Metrics

Prometheus surface exposed at /metrics on the orchestrator's main listen address (-addr 127.0.0.1:8443 by default). Same TLS/mTLS envelope as the rest of the API.

Source: internal/orchestrator/metrics/metrics.go.

Series

fangs_orchestrator_info{version}

Gauge, always 1. Labeled with the build version (-ldflags -X main.version=... or dev). Useful for stamping the version on dashboards.

fangs_orchestrator_info

fangs_runners_registered

Gauge. Number of runners currently registered + heartbeat-fresh. Pruner evicts at 90s no-heartbeat; this gauge reflects post-prune state.

# alert when no runners
fangs_runners_registered == 0

fangs_scans_queued_total

Counter. Incremented every time SubmitScan enqueues a job.

# scan rate
rate(fangs_scans_queued_total[5m])

fangs_events_received_total{type}

Counter. Incremented per arrived event, labeled by event type (file_access, exec, net_connect, dns_query, tls_sni).

# events per second by type
sum by (type) (rate(fangs_events_received_total[1m]))

# total event throughput
sum(rate(fangs_events_received_total[1m]))

Cardinality: bounded — 5 types.

fangs_events_dropped_total

Counter. Lifetime sum of events_dropped from every ScanResult — incremented by ObserveEventsDropped on each result POST.

# alert on ringbuf overflow
rate(fangs_events_dropped_total[5m]) > 0

A non-zero rate means at least one recent run lost events to ringbuf overflow. See Sensor-Probes#drop-counter for tuning advice.

fangs_deviations_written_total{severity}

Counter. Incremented per Differ-emitted deviation, labeled by severity.

# deviation rate by severity
sum by (severity) (rate(fangs_deviations_written_total[1h]))

# critical deviations
fangs_deviations_written_total{severity="critical"}

Cardinality: bounded — 4 standard severities + unknown.

fangs_baseline_promoted_total{trigger}

Counter. Tracks baseline promotions:

  • auto — Differ promoted a zero-deviation run
  • manual — operator ran fangs baseline promote
# ratio of human-triggered promotes (a high ratio means most
# releases need operator review — possible tuning opportunity)
sum(fangs_baseline_promoted_total{trigger="manual"})
  / sum(fangs_baseline_promoted_total)

Note: today this counter only fires from the Differ's auto-promote path. The CLI's fangs baseline promote doesn't update the metric since it writes to the DB directly without going through the orchestrator. That's a v2 fix.

fangs_notifications_total{notifier, status}

Counter. Incremented per delivery attempt by the Notifier.

Labels:

  • notifier — the target's name (bounded by your configured targets — typically <10)
  • statussent | failed | permanent
# delivery success rate per notifier
sum by (notifier) (rate(fangs_notifications_total{status="sent"}[1h]))
  / sum by (notifier) (rate(fangs_notifications_total[1h]))

# alert on a notifier producing permanent failures
rate(fangs_notifications_total{status="permanent"}[5m]) > 0

Cardinality: bounded — # notifiers × 3.

Go runtime collectors

The metrics package also registers prometheus.NewGoCollector() and prometheus.NewProcessCollector(...), so you get for free:

go_goroutines
go_gc_duration_seconds
go_memstats_*
process_cpu_seconds_total
process_resident_memory_bytes
process_open_fds
...

Useful for orchestrator host monitoring without an extra agent.

Cardinality

Total series count for a healthy deployment is small:

Source Series
fangs_* counters/gauges ~10 base + (5 event types) + (4 severities) + (3 statuses × N notifiers) ≈ 25-50
Go runtime ~30
Process runtime ~10

Should easily stay under 100 series — Prometheus has no trouble.

Sample scrape config

prometheus.yml:

scrape_configs:
  - job_name: fangs
    static_configs:
      - targets: ['fangs.internal:8443']
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/fangs-ca.crt
      cert_file: /etc/prometheus/fangs-client.crt
      key_file:  /etc/prometheus/fangs-client.key

For mTLS deployments, Prometheus needs a client cert signed by the orchestrator's -tls-client-ca to scrape. Issue one via docs/scripts/gen-tls.sh with RUNNER_ID=prom-scraper.

For plain-HTTP development:

scrape_configs:
  - job_name: fangs
    static_configs:
      - targets: ['127.0.0.1:8443']

Recommended alerts

groups:
  - name: fangs
    rules:
      - alert: FangsNoRunners
        expr: fangs_runners_registered == 0
        for: 2m
        labels: {severity: critical}
        annotations:
          summary: No FANGS runners registered for 2 minutes

      - alert: FangsRingbufOverflow
        expr: rate(fangs_events_dropped_total[5m]) > 0
        for: 5m
        labels: {severity: warning}
        annotations:
          summary: FANGS sensor ringbuf overflowing
          description: Events being dropped at probe time; tune ringbuf or reduce concurrent scans

      - alert: FangsNotifierFailing
        expr: rate(fangs_notifications_total{status="permanent"}[10m]) > 0
        labels: {severity: warning}
        annotations:
          summary: Notifier delivery permanently failing
          description: A configured webhook target is returning 4xx persistently

      - alert: FangsHighSeverityDeviation
        expr: increase(fangs_deviations_written_total{severity=~"high|critical"}[5m]) > 0
        labels: {severity: critical}
        annotations:
          summary: High/critical FANGS deviation
          description: Investigate the pending review queue

What's NOT instrumented

Some signals would be useful but aren't surfaced today:

  • Per-run differ duration (fangs_differ_duration_seconds) — histogram showing how long deltas take.
  • Per-package run frequency — high-cardinality (one label per watched package) so it's intentionally absent; the DB has the data.
  • Sandbox lifecycle latency — image pull duration, container exit code distribution. Useful for operators tuning sandbox limits.

Each is a small addition. PRs welcome.

Tracing

Not wired today. OpenTelemetry hooks would be valuable at the per-run state machine + per-target Notifier retry loops. v2 item.

Clone this wiki locally