Skip to content

Architecture

cyb3rjerry edited this page May 23, 2026 · 1 revision

Architecture

The system design behind FANGS. Read this when you want to understand WHY the pieces fit together the way they do. For operator-facing material see Installation, Configuration, Operating.

Mental model

FANGS is a delta detector, not a malware detector. We never classify "is this code malicious." We only answer "did this run behave differently from the last N versions of this same package."

Every component is shaped by that goal. The hard work that goes into a malware classifier (AV signatures, sandbox-evasion games, ML classification arms races) gets skipped; the easy work of comparing sequences of syscalls becomes the entire product.

Top-level architecture

flowchart TB
    subgraph EXT["External"]
        NPM["npm registry"]
        WH["Webhook receivers<br/>Slack · Discord · generic"]
        PROM["Prometheus scraper"]
    end

    subgraph CFG["Config / state"]
        OCONF["config/orchestrator.yaml<br/>watched_paths · allow"]
        DB[("storage<br/>sqlite · postgres")]
    end

    subgraph ORCH["fangs-orchestrator"]
        direction TB
        API["HTTP API<br/>scans · events · result · heartbeat"]
        W["Watcher<br/>poll 5m"]
        D["Differ<br/>fingerprint · diff · auto-promote"]
        N["Notifier<br/>per-target retry · HMAC opt-in"]
        P["Pruner<br/>retention · stale runners"]
        UI["/ui/<br/>read-only dashboard"]
        MET["/metrics<br/>Prometheus"]
    end

    subgraph RNNR["fangs-runner (Linux + Docker)"]
        direction TB
        AG["agent<br/>register · long-poll · heartbeat · result"]
        SB["sandbox driver<br/>hardened Docker container"]
        SEN["eBPF sensor<br/>persistent attach · cgroup ancestor walk"]
    end

    subgraph CLI["fangs (operator CLI)"]
        FCLI["package · run · deviation · baseline<br/>notifier · allow · scan submit · pending"]
    end

    NPM -.poll.-> W
    W -- submit job --> API
    API -- long-poll --> AG
    AG -- POST events / result --> API
    SEN -- ringbuf --> AG
    SB -- spawn container --> SEN
    API --> D
    D -- novel finding --> N
    N --> WH
    OCONF -.startup.-> ORCH
    DB <--> ORCH
    DB <--> FCLI
    MET --> PROM
    UI -.browser.-> DB
Loading

Two long-running processes (orchestrator + runner), one CLI, one optional config file, one storage backend.

Process boundaries

Component Process Trust
Orchestrator fangs-orchestrator (long-running) Trusted — operators run it
Runner fangs-runner (long-running, one per Linux host) Trusted but disposable — runs attacker code in sandboxes
Sandbox container child of runner via Docker Hostile — assume every byte is attacker-controlled
Operator CLI fangs (short-lived) Trusted — talks directly to storage backend
eBPF sensor runs in runner process; probes attached in kernel Trusted; cannot be subverted by sandbox code

The sensor lives in the runner process but its probes run in kernel space attached to syscall tracepoints. The sandboxed code cannot subvert what bpf_get_current_* reports about its own syscalls — the kernel is the source of truth.

Components

internal/orchestrator/api/ — HTTP front-end

Endpoints (plain HTTP by default; HTTPS/mTLS when TLS flags set):

Method Path Purpose
GET /v1/health liveness probe
POST /v1/runners/register runner introduces itself
POST /v1/runners/{id}/heartbeat LastSeen refresh + active-run reporting
GET /v1/runners/{id}/jobs long-poll for the next job (25s server cap)
POST /v1/scans operator (or watcher) queues a scan
POST /v1/runs/{run_id}/events runner streams NDJSON event batches
POST /v1/runs/{run_id}/result runner posts final ScanResult

The handler for /v1/scans (SubmitScan) is shared by the HTTP path and the in-process watcher. It injects the orchestrator's configured default watched_paths onto jobs that arrived without their own list — single source of truth for what every scan watches.

internal/orchestrator/watcher/ — autonomous trigger

Polls registry.npmjs.org every 5 minutes (configurable) for each row in the packages table. On a new dist-tags.latest, records a release row and invokes SubmitScan to dispatch a job. Detects "new" via comparison against packages.last_seen_version, so a fresh package add followed by an immediate scan doesn't double-queue.

Per D31, v1 supports the official npm registry only; alternate registries (Verdaccio, Artifactory) deferred to v2 via a Registry interface.

internal/orchestrator/differ/ — delta analysis

Triggered by a 2-second debounce after each event-batch arrival per run. Steps:

  1. Load all events for the run + the package's baseline fingerprints.

  2. Build a filter from the merged (CDN allowlist + operator allowlist) entries.

  3. Extract fingerprints from events. Six categories — see Differ-Rules for details.

  4. Apply normalization (PID/temp-file/cacache rewriting) so per-run noise doesn't masquerade as a deviation.

  5. Apply the operator + hardcoded allowlists — matching fingerprints get suppressed.

  6. Diff against the baseline:

    • First run for the package: seed baseline + mark is_baseline=true. Zero deviations regardless of behavior.
    • Subsequent zero-deviation run: auto-promote to baseline.
    • Subsequent any-deviation run: stays unflagged. Each novel fingerprint becomes a row in the deviations table.

    Idempotent — DeleteDeviationsForRun clears prior findings before writing the fresh set.

internal/orchestrator/notifier/ — webhook fan-out

Per-run-summary semantics: when the Differ writes ≥1 deviation, one webhook fires per configured + enabled target. Three templates:

  • slack — Block Kit + mobile-friendly fallback text
  • discord — embed with color-by-severity
  • generic — FANGS-native envelope, the SIEM/intake target

Each delivery is logged in the notifications table for audit. Retry policy: 5 attempts max, exponential backoff (1s × 2^(n-1) ± 25% jitter), 4xx (non-408/429) is permanent. HMAC opt-in via per-target secret_env; Slack/Discord skip HMAC (URL is the secret).

See Notifier for template internals + delivery semantics.

internal/orchestrator/core/ — job dispatcher

In-memory per-runner FIFO queue with long-poll wake channels. Survives the orchestrator's process lifetime; runs queued while a runner is offline are lost (acceptable for v1).

Heartbeat pruner sits beside it: ticks every 30s, evicts runners whose LastSeen is older than 90s.

internal/orchestrator/storage/ — persistence

Dual backend (sqlite/ + postgres/) behind one Backend interface. Embedded migrations under migrations/{sqlite,postgres}/. A shared contract suite (storagetest/) runs against both backends in CI.

See Storage-Schema for every table + every column.

internal/orchestrator/metrics/ — Prometheus surface

Counters + gauges registered at startup; mounted at /metrics on the same listener as the API. Cardinality stays bounded — labels are enumerations, never operator-supplied strings.

See Metrics for every series.

internal/orchestrator/ui/ — read-only dashboard

Server-rendered Go templates. No JS framework, no build step. A tiny vanilla JS file (static/refresh.js, ~80 lines) gives the dashboard two niceties:

  • Auto-refresh for time-changing pages — any element with data-refresh-url re-fetches on an interval, optional CSS-selector extract for partial swaps. Overview refreshes every 5s; pending queue every 10s.
  • In-place chip-filter navigation — clicks on a.chip inside a data-dynamic-nav container fetch the URL, extract the same container from the response, swap innerHTML, pushState the URL. Expanded <details> are snapshotted by data-pid pre-swap and re-opened post-swap so filters don't collapse the rows you were looking at.

Pages: /ui/ overview, /ui/packages/{name}, /ui/runs/{id}, /ui/runs/{id}/lineage (indented process tree), /ui/deviations[/{id}], /ui/events/{id}, /ui/pending (triage queue), /ui/allowlist, /ui/notifiers, /ui/config.

Read-only by design — every state change goes through the fangs CLI.

internal/runner/agent/ — orchestrator HTTP client

Owns the registration handshake, the long-poll job loop, the event streamer, the heartbeat ticker, and the final ScanResult POST. The event streamer batches at 250 ms or 64 events (whichever first) into NDJSON; sequence numbers attached for future per-run dedup.

internal/runner/sandbox/ — Docker driver

Spawns one container per scan via stdlib net/http against /var/run/docker.sock (no Docker SDK). Hardened HostConfig — see Configuration for the full list.

The runner pre-creates /sys/fs/cgroup/.../fangs/<run_id>/, registers its inode in the BPF CGMAP, then launches the container with CgroupParent set to that path. The container's processes nest under the registered cgroup, and the BPF lookup_cgroup walks ancestors so events fire from the very first syscall in the container — no race window between docker-start and sensor-attach.

internal/runner/sensor/ — eBPF capture library

Probes attach ONCE at runner startup (Sensor.New). Per-job: AddCgroup populates the in-kernel filter; the container runs nested under that cgroup; events stream out via a 64 MB ringbuf; RemoveCgroup on teardown.

See Sensor-Probes for the full probe table + per-probe semantics.

internal/shared/proto/ — wire types

Binary event types matching the BPF C structs (OpenatEvent, ExecEvent, etc.) + JSON-encoded protocol types (Job, ScanResult, RunnerRegistration, Heartbeat, EventEnvelope). The runner does the parsing pass that adds human-readable string fields (PathName, CommStr, DestIP, SNI, QName, BinaryPathStr) before streaming events upstream.

Sequence diagrams

Autonomous: new release → deviation → operator decision

sequenceDiagram
    participant W as Watcher
    participant API as Orchestrator API
    participant DISP as Dispatcher
    participant R as Runner
    participant SEN as eBPF Sensor
    participant DB as Storage
    participant DIF as Differ
    participant NOT as Notifier
    participant OP as Operator

    W->>API: GET registry.npmjs.org/lodash<br/>(every 5m)
    API->>DB: UpdatePackageCheck, RecordRelease
    W->>API: SubmitScan(lodash@4.18.2)
    API->>DISP: QueueJob(target_runner, job)
    R->>API: GET /v1/runners/r1/jobs (long-poll)
    DISP-->>R: Job
    R->>R: CreateParentCgroup, AddCgroup(sensor)
    R->>R: docker run node:20-slim<br/>npm install lodash@4.18.2
    SEN-->>R: events (file/exec/net/dns/tls)
    R->>API: POST /v1/runs/<id>/events (NDJSON)
    API->>DB: AppendEvents
    R->>API: POST /v1/runs/<id>/result<br/>(ok, 312 emitted, 0 dropped)
    API->>DB: RecordScanResult
    Note over API,DIF: 2s debounce after last batch
    API->>DIF: AnalyzeRun(run_id)
    DIF->>DB: ListEventsByRun, LoadBaseline
    DIF->>DIF: Extract fingerprints, apply allowlist, diff
    DIF->>DB: WriteDeviations [3 rows]
    DIF->>NOT: Trigger(run_id)
    NOT->>NOT: Render per-target template
    NOT->>OP: webhook (Slack) — "lodash@4.18.2: 3 findings"
    OP->>API: GET /ui/pending (or fangs pending)
    API->>OP: Run row + 3 findings + promote command
    OP->>DB: fangs baseline promote <run>  OR  investigate
Loading

Manual scan via CLI

sequenceDiagram
    participant OP as Operator
    participant CLI as fangs CLI
    participant NPM as npm registry
    participant API as Orchestrator API
    participant DB as Storage

    OP->>CLI: fangs scan submit -package axios -version 1.7.7
    CLI->>NPM: GET /axios (validate version exists)
    NPM-->>CLI: 200 + metadata
    CLI->>API: POST /v1/scans (job WITHOUT watched_paths)
    API->>API: SubmitScan — stamps default watched_paths from config
    API->>DB: CreateRun (state=pending)
    API-->>CLI: {queued: true, run_id: 18b2...}
    CLI-->>OP: queued run_id=18b2089cca... watch: /ui/runs/18b2089cca...
Loading

Run lifecycle

   pending      ← CreateRun on /v1/scans
      ↓
   (runner picks up via long-poll, runs the sandbox)
      ↓
   (events stream into /v1/runs/<id>/events; Differ runs 2s after last batch)
      ↓
   ─── if any deviation: deviation rows written, runs.is_baseline=false ───
   ─── if zero:         is_baseline=true (auto-promote)                ───
      ↓
   done | failed | timeout   ← RecordScanResult finalizes the row

done, failed, and timeout are the terminal states. failed carries a failure_reason. The runner POSTs the final state via /v1/runs/<id>/result so transitions are runner-driven, not inferred from event-stream EOF.

What this design explicitly is NOT

  • Not an in-product malware classifier. No signatures, no heuristics, no ML on event streams. The whole product is delta-vs- baseline. If you need malware classification, feed deviations into your existing AV pipeline as the "look-at-this" signal.
  • Not a runtime-protection agent. FANGS observes sandboxed scans; it doesn't gate npm install on production hosts or block CI.
  • Not a multi-tenant system. One DB, one watch list. Add tenant columns when team adoption forces the question.
  • Not magic at first-watch. The first run for a package becomes baseline regardless of what it observes. Bringing your own trust at watch-add time is a load-bearing assumption. v2 may require K consecutive zero-deviation runs before auto-promotion.

Clone this wiki locally