What we see
Both the KPI tile and the CPU-over-time chart on the server-detail page tend to report values clustered near 0% or near 100%, with very little in between, even on idle hosts (e.g. my LXC). The chart looks spiky, the tile flips between extremes, and the two often disagree because they sample at slightly different times (KPI from ConnectionCheck, chart from StatsCollect).
Why
Both metrics come from `Mast.Hosts.Metrics.parse_cpu/1` (lib/mast/hosts/metrics.ex:18), which reads the idle field out of the first `%Cpu(s):` line of `top -bn1`. That line is an instant snapshot — over the ~1ms window `top` measures, a core is essentially always fully idle or fully busy. Averaged usage isn't actually being measured.
Options
- `top -bn2 -d 1`, take the second iteration. Top's second pass is a delta over the elapsed second, so we get a real 1s-averaged number. One-character SSH command change plus picking the second `%Cpu(s):` line in the regex. Biggest realism win for the smallest diff.
- Read `/proc/stat` twice ~250ms apart and compute the delta ourselves. Most accurate, no top-output fragility, but new code.
- Surface `load_1` as the KPI instead of CPU%. Already kernel-smoothed. Misleading on multi-core (load=2 on 2 cores is ≠ 100%), but honest about what it is.
Recommend (1) when we pick this up.
Out of scope here
Not changing the chart renderer or the schema. `server_stats.stats["cpu"]` keeps the same shape; only the value gets more meaningful.
Where
- `lib/mast/hosts/metrics.ex` — `parse_cpu/1`
- `lib/mast/workers/connection_check.ex` — `top -bn1 | head -3`
- `lib/mast/workers/stats_collect.ex` — the same `top` call lives in here too; both probes need the new command
What we see
Both the KPI tile and the CPU-over-time chart on the server-detail page tend to report values clustered near 0% or near 100%, with very little in between, even on idle hosts (e.g. my LXC). The chart looks spiky, the tile flips between extremes, and the two often disagree because they sample at slightly different times (KPI from ConnectionCheck, chart from StatsCollect).
Why
Both metrics come from `Mast.Hosts.Metrics.parse_cpu/1` (lib/mast/hosts/metrics.ex:18), which reads the idle field out of the first `%Cpu(s):` line of `top -bn1`. That line is an instant snapshot — over the ~1ms window `top` measures, a core is essentially always fully idle or fully busy. Averaged usage isn't actually being measured.
Options
Recommend (1) when we pick this up.
Out of scope here
Not changing the chart renderer or the schema. `server_stats.stats["cpu"]` keeps the same shape; only the value gets more meaningful.
Where