Skip to content

Operating

cyb3rjerry edited this page May 23, 2026 · 1 revision

Operating

Day-2 reference. Workflows, triage, troubleshooting.

Mental model

FANGS is a delta detector. It compares each run against the package's rolling baseline and surfaces what's new. Three states a run can be in:

State Meaning
First run for a package Baseline gets seeded; zero deviations regardless
Subsequent zero-deviation run Auto-promoted to baseline (occurrence_count bump only)
Subsequent any-deviation run Lands in fangs pending for operator decision

Adding a package to the watch list

fangs package add lodash

What happens:

  1. Validates lodash exists on registry.npmjs.org. Bogus names get rejected before the DB row lands.
  2. Inserts into the packages table.
  3. Records the current dist-tags.latest as last_seen_version so the next watcher poll doesn't re-flag it as "new."
  4. Queues an immediate kickoff scan of that version. First run → seeds baseline.

Subsequent releases auto-trigger via the watcher (default 5min poll).

Useful subcommands:

fangs package watched       # current watch list
fangs package list          # packages with runs (post-scan summary view)
fangs package remove lodash # stop watching

Skip the kickoff scan if you want — useful for offline / batch flows:

fangs package add lodash -skip-initial-scan

One-off scans

fangs scan submit -package axios -version 1.7.7

Validates axios@1.7.7 exists on the registry, then POSTs /v1/scans to the orchestrator (default http://127.0.0.1:8443). Returns the assigned run_id + the watch URLs.

Useful flags:

Flag Purpose
-orchestrator URL when the orchestrator isn't on localhost
-runner ID target a specific runner (default: the operator's hostname)
-duration 90s longer sandbox time for heavy installs
-skip-registry-validate useful for offline tests against private fake-registries

Reviewing pending findings

The triage queue lives in two places — one for browsing, one for scripting:

# Browser
http://127.0.0.1:8443/ui/pending

# CLI
fangs pending
fangs pending -package axios
fangs pending -min-severity high
fangs pending --json | jq .   # for piping into other tools

Each row shows the run, package, version, deviation count, max severity, last detected, and the literal fangs baseline promote … command you can copy-paste.

Triaging a deviation

Five steps:

  1. Look at the package + version. Is it a known dependency? Was there a recent advisory? Sometimes the answer comes from context alone.
  2. Look at the deviation list for the run. fangs deviation list -run <run_id> or click into the run on the UI.
  3. Look at the evidence event. fangs deviation show <id> prints the full kernel-event JSON that triggered the finding. Or click "→ lineage" in the UI to see the process tree.
  4. Decide. Three real options:
    • Promote — fangs baseline promote <run_id>. The whole run's fingerprints merge into baseline, deviation rows clear.
    • Allowlist the noise — fangs allow add with the appropriate kind. See below.
    • Investigate — pull the package tarball, do offline analysis. Nothing in FANGS changes until you act.

Allowlisting

Suppress known-benign noise before it becomes a deviation:

# Global — applies to every package
fangs allow add -kind cidr -value 10.0.0.0/8 -note "internal"

# Per-package — only applies to runs of `axios`
fangs allow add -kind sni -value telemetry.example -package axios

# Path exclusion — silence noisy file_access events
fangs allow add -kind path -value /opt/vendor/ -note "trusted vendored dir"

Three kinds map to three Differ categories:

kind suppresses example
cidr net_new_destination 10.0.0.0/8
path fs_new_path_* /opt/vendor/
sni net_new_https_host telemetry.example

The hardcoded CDN allowlist (Cloudflare/GitHub/Google/Fastly/CloudFront) applies underneath — entries here are additive.

fangs allow list — show all entries. Config-managed ones (from config/orchestrator.yaml) have cfg…-prefixed IDs.

fangs allow remove <id-prefix> — git-style short ID match.

UI: /ui/allowlist.

Promoting a baseline manually

When a clean release should join baseline but the run had deviations you've accepted as legitimate:

fangs baseline list -package lodash    # see what's in baseline
fangs baseline promote <run-id-prefix>

Promote re-extracts the run's fingerprints (with allowlists applied), merges them into baseline_fingerprints, marks the run is_baseline=true, and clears any deviation rows for it.

Configuring notifiers

Without a notifier, deviations sit in the DB waiting for someone to refresh the UI. With one, every run with ≥1 deviation fires one webhook per configured + enabled target.

# Slack
fangs notifier add -name soc-slack \
  -url 'https://hooks.slack.com/services/T.../B.../...' \
  -template slack

# Discord
fangs notifier add -name soc-discord \
  -url 'https://discord.com/api/webhooks/.../...' \
  -template discord

# Generic — for SIEM / event bus / Lambda
fangs notifier add -name siem \
  -url 'https://intake.internal/fangs' \
  -template generic \
  -secret-env FANGS_HMAC \
  -min-severity high

Knobs:

Flag Purpose
-template slack | discord | generic
-secret-env ENV_VAR env-var name holding HMAC secret (generic targets only)
-min-severity only fire when ≥1 deviation has severity ≥ threshold
-headers JSON extra HTTP headers as JSON object
-enabled=false disable without removing

Verify wiring:

fangs notifier test soc-slack            # fires a synthetic message
fangs notifier list
fangs notifier history -run <run_id>     # delivery attempts for one run

UI: /ui/notifiers.

See Notifier for retry policy, template internals, HMAC details.

Inspecting the runtime

UI overview (/ui/) shows:

  • Packages watched / packages ever tracked
  • Runs total / runs on baseline
  • Open deviations / packages affected
  • Lifetime events dropped (sensor ringbuf overflow indicator)
  • Runner pool with heartbeat freshness + active-run links
  • Recent runs + recent deviations

Prometheus at /metrics — see Metrics for every series.

CLI:

fangs run list -package lodash -limit 20
fangs run show <run-id-prefix>
fangs release list -package lodash

Common failures

"No runners registered"

The orchestrator received a scan request but has no runner. Cause: no fangs-runner is running, or its heartbeat went stale (>90s) and it got pruned. Restart the runner.

Stale runners on /ui/

Runner crashed without deregistering. Wait ~90 seconds and the heartbeat pruner evicts. Pruner ticks every 30s.

events_dropped > 0 on the dashboard

The eBPF ringbuf overflowed during one or more runs. Causes:

  1. Very-high-throughput sandboxes (e.g. ESM-only packages that open thousands of files at install).
  2. Slow consumer — the runner's event-stream HTTP POST is backed up. Check runner.log for high batch_send_errors in event streamer closed lines.
  3. Orchestrator backpressure under heavy concurrent scans.

The ringbuf is 64 MB per probe (compile-time constant). Most cases resolve by tuning sandbox concurrency.

"container exit_code=137"

OOM kill. The sandbox hit Memory: 512 MB (default). For packages with heavy install steps (TypeScript, native bindings), bump memory in your scan submission's sandbox spec. Per-package memory override is a v2 item.

"container exit_code=125"

Docker daemon couldn't start the container. Usually a permission problem on /var/run/docker.sock or a Docker daemon misconfiguration. Check journalctl -u docker.

"AddCgroup: cgroup id already registered"

Two scans tried to share the same cgroup_id — shouldn't happen in practice. Restart the runner.

Sensor probes failing to attach

Check the runner log for attach tracepoint lines. Common causes:

  • tracefs not mounted — runner auto-mounts at startup; manual fallback: sudo mount -t tracefs nodev /sys/kernel/tracing
  • libssl not loadable — Debian/Ubuntu/Kali libssl is mode 644; fix: sudo chmod +x /usr/lib/x86_64-linux-gnu/libssl.so.3
  • kprobe tcp_v4_connect attach failed — symbol absent on very old kernels; sensor logs warn-and-continue, io_uring TCP connects on this path go unobserved

Differ produces too many false positives

In order of preference:

  1. Allowlist the recurring noise.
  2. Promote a known-good run as baseline.
  3. Tighten path normalization in internal/orchestrator/differ/normalize.go — source edit, rebuild.

Backup + restore

SQLite — copy var/lib/fangs/fangs.db while orchestrator isn't running, or use sqlite3 .backup for hot backups.

Postgres — standard pg_dump / pg_restore.

The DB is the only persistent state. Nothing else needs preserving between hosts.

Stopping cleanly

# Orchestrator: SIGINT/SIGTERM cleanly stops the HTTP listener.
sudo systemctl stop fangs-orchestrator
# Runner: SIGTERM cleans up the per-run cgroup parent before exit.
sudo systemctl stop fangs-runner

Hard kill (SIGKILL) leaves orphan cgroups at /sys/fs/cgroup/.../fangs/<run_id>/. Remove with sudo rmdir once the container exited.

Retention

Raw events are auto-pruned after -retention-days (default 90). Pruner runs daily; first prune fires 30s after orchestrator startup so operators can verify it's wired.

Deviation-evidence events are PINNED forever — the "click for evidence" link on historical findings keeps working past the retention horizon.

Baselines never expire. They're the load-bearing state.

Disable pruning entirely: -retention-days=0.

Clone this wiki locally