full-stack network + host monitoring for the edge — every signal on one timeline, in stdlib python and ~30 mb of ram. no cloud, no agent to pay for: collection/ship/hub are pure stdlib; you add only the probe tools (fping/mtr/iperf3/iw) and a plotting lib for graphs.
smokemon watches network and the box it runs on — ping loss & latency spread, bandwidth, http breakdown, per-hop routes, wifi, throughput, cpu/mem/temp/psi/power — and lays it all on a single timeline, so you can see what else was happening the moment things went bad.
the core is pure-stdlib python: a raspberry pi or jetson runs it for ~30 mb of ram and well under 1% of one core — it graphs its own footprint to prove it. point many nodes at one hub and watch the whole fleet from a terminal or a browser.
new here? copy-paste install & use in QUICKSTART.md. full reference below.
the code? the package is documented file-by-file in smokemon/README.md; each probe (what it measures, what it refuses to do, the footprint rules) in smokemon/probes/README.md.
view: smoke
smoke live 24h
smoke kiosk 24h
smoke png (or python -m smokemon.cli …)
run: python -m smokemon.collect {fast|slow}
(launchd/systemd do this; see deploy/)
macOS cp deploy/launchd/*.plist ~/Library/LaunchAgents/ && bootstrap each
Linux curl -fsSL https://raw.githubusercontent.com/oovets/smokemon/main/install.sh \
| sudo bash -s -- --node NAME [--hub-url URL --secret S]
== analysis engine + dashboard + alerting ==
- smokemon/analyze.py (hub-side, read-only, stdlib): incident detection (isp-outage /
link-down / packet-loss / latency-spike / dns-slow), multi-signal blame (what deviated
during the window + new processes), time-of-day anomaly baseline, change-point detection,
mtr path intelligence, bandwidth attribution.
- text surfaces (run on a node too): `smoke status` (sparkline health line), `smoke
incidents` (incidents + blame), `smoke digest` (plain-english summary). `smoke replay`
scrubs any past window. `--bell` rings on degraded health; `--notify` pushes incidents.
- hub now serves a live fleet dashboard at GET / , a prometheus /metrics endpoint, and a
family of read-only /api/* json endpoints (nodes/latest/fleet/fleet-status/heatmap/risks/
cost/services/logs/ports/network/inventory/ingest-rate/spark + render plot/png). push
alerts via smokemon/notify.py (ntfy/slack/discord/webhook).
- node-side: a `self` panel graphs smokemon's own RSS/CPU; opt-in synthetic transactions
(captive-portal + DoH) via probes/synthetic.py; opt-in lightweight external HTTP
scrapes via probes/ext.py (bounded timeout/body/metric caps, no log streams);
opt-in Redis stream health via probes/redisq.py uses stdlib socket reads only. Jetson
GPU util/frequency is read from sysfs, not tegrastats.
== v0.11 rich host metrics + grid layout ==
- new tables: thermal_zones (all sensors, not just max), power_samples (jetson INA3221
per-rail watts), tcp_samples (retransmits / RSTs / udp errors / conntrack fill),
disk_health (SD wear-level, hourly). host_samples adds PSI cpu/mem/io, swap/cache,
oom_kill_count, cpu_freq_mhz, cpu_throttle_count, pi_throttle_bits. wifi_samples adds
bssid + retry/discard/beacon counters; render shows roam count across BSSIDs.
- renderer: 5 new panels (thermal, power, tcp, psi, freq). 2-col grid by default
(PNG when >=3 panels, TUI when terminal >=140 cols). --cols N to force.
- perf: ping_rtt percentiles (p25/p75) pre-aggregated at insert -> load_ping_smoke skips
the ping_rtts scan for new rows. hub ingest uses executemany. load_net uses SQL LAG()
(sqlite >=3.25). SQLite stays on WAL + synchronous=NORMAL only; cache/mmap PRAGMAs were
tried and reverted to keep node RSS low (smokemon reports its own RSS, so they'd skew it).
== v0.10 package refactor ==
- flat scripts -> smokemon/ package: config (env/NODE/paths), core (log/connect/
signals/run_scheduler), schema (single-source DDL -> node+hub + STD_TABLES + generic
insert), adapters/{darwin,linux}, probes/{ping,net,http,mtr,wifi,iperf,host}, collect
(one daemon, group fast|slow|all), ship, hub, query (shared loaders + --node),
render/{tui,png}, cli (`smoke` subcommands).
- 3 collector daemons -> 2 (fast=ping/net; slow=http/mtr/wifi/host). live.sh/daily_graph.sh
-> `smoke live|kiosk|daily`. dedup: schema, daemon loop, plot loaders, the duplicate
wifi_probe (all gone). net caches the TS iface (5 min). hub: ThreadingHTTPServer + write
lock. entrypoints: python -m smokemon.* (PYTHONPATH=repo, no install needed).
earlier versions (v0.1 - v0.9) → CHANGELOG.md
smoke [tui] static TUI; 18 panel types: ping,net,http,mtr,wifi,iperf,host,
gpu,redis,docker,pipeline,disk,thermal,power,tcp,psi,freq,self|all
--cols N|0(auto). a panel only draws if the node has that data.
psi+freq are Linux-only; thermal/power/tcp also work on macOS
(cpu_speed_limit, battery rail, netstat -s parsing)
smoke live 24h | smoke kiosk 24h [--refresh N] [--bell] live / clean wall display
smoke replay [DATE|Nh] [--frame MIN] DVR scrubber (←/→ scrub, ↑/↓ step, q)
smoke fleet [live] aggregated terminal view of every node reporting to the hub
(worst-first, colour-coded; the TUI twin of GET /). --ranked
for uptime/downtime over --hours; --heatmap [--metric loss|rtt]
for a node×hour sparkline grid; --hub-url URL reads the hub's
read-only /api over HTTP (no hub DB access needed); --bell.
smoke footprint collector rows/day + SQLite and shipper gzip byte estimates
for a node DB (or --node on a hub DB). --ship-rtts includes
raw ping RTTs in the wire estimate.
smoke png [--width N --dpi N --cols N] | smoke daily PNG -> Preview / dated 24h PNG
smoke status | smoke incidents | smoke digest [--notify] text analysis (stdlib, node-ok)
common: --minutes N|--hours N|--since|--until --targets --panels --node (req. on hub DB)
analysis: smokemon/analyze.py (incident detection + multi-signal blame + anomaly/change-
point/path/attribution stats, hub-side read-only). hub also serves a live fleet
dashboard at GET / (grid/table/ranking/heatmap/risks/services/logs/cost tabs),
plus GET /metrics (prometheus) and read-only GET /api/* (nodes, latest, fleet,
fleet-status, heatmap, risks, cost, services, logs, ports, network, inventory,
ingest-rate, spark; plot/png render a node's panels). see INSTALL.md for the list.
alerting: set SMOKEMON_NOTIFY_URL (ntfy/slack/discord/webhook) + `smoke digest --notify`
or the smokemon-notify timer. synthetic checks: SMOKEMON_SYNTHETIC=1.
external checks: SMOKEMON_EXT_HTTP='app=http://127.0.0.1:8080/health'.
redis/docker/pipeline: auto-detected (default on, silent no-op when absent);
name redis streams with SMOKEMON_REDIS_STREAMS=a,b,c; SMOKEMON_{REDIS,DOCKER,
PIPELINE}=0 disables, =1 forces a down-row even when the service is unreachable.
daemons: python -m smokemon.collect {fast|slow} | .probes.iperf | .probes.synthetic
| .ship | .hub | .notify (PYTHONPATH=repo)
multi-node: nodes run collect + iperf + ship (push delta -> hub); hub runs
python -m smokemon.hub (-> smokemon-hub.db). plot a single node with
--node NAME; see the whole fleet at once with `smoke fleet` (or GET /).
repoint a node with `smoke hub NEW-HUB` (writes SMOKEMON_HUB_URL).
deploy: macOS deploy/launchd/*.plist (collect-fast/slow, iperf, daily, shipper, hub);
Linux sudo ./install.sh --node NAME --hub-url URL --secret S
(hub: --hub --secret S). secret must match node<->hub.
deps: node: fping,mtr,iperf3,iw + python3 stdlib + plotext(TUI);
hub: +matplotlib/numpy(PNG) + iperf3 (runs iperf3 -s as the nodes' bandwidth target).
== what the metrics mean (the non-obvious ones) ==
rtt spread the p25-p75 / p0-p100 band around median ping, not a single number - a
wide band = jitter even when the average looks fine.
bufferbloat grade A+..F from idle ping vs ping-under-load (iperf). F = the link buffers
badly under load (calls/games stutter while something downloads).
psi linux pressure-stall info (/proc/pressure): % of time tasks stalled on
cpu/mem/io. rises *before* utilisation hits 100% - an early warning.
conntrack fill how full the kernel's connection-tracking table is. near 100% = new
connections get dropped (looks like packet loss, isn't the link).
death clocks linear extrapolation to a limit: disk-full eta, sd/emmc wear-out eta, and
headroom (degC) before the cpu thermally throttles.
roam count how many times wifi jumped between bssids (access points) in the window;
frequent roams correlate with throughput dips.
throttle bits raspberry pi vcgencmd flags (under-voltage / freq-capped / throttled),
past and currently-active - the usual cause of silent pi slowdowns.