memory-harness is a small Linux-only Rust CLI that runs a child command and
reports its kernel-accounted resource usage, especially peak RSS.
The entire purpose of this repo is to build a memory harness with low enough
wrapper noise that perf can run concurrently and still produce a useful CPU
profile for the wrapped program.
That goal drives the design:
- keep the Rust binary small
- use kernel accounting like
getrusagefor peak RSS - make optional sampling lightweight
- keep the default profiling workflow synchronized in one run
- minimize wrapper-side work that would drown out the target process
It is intentionally narrow:
- one command per invocation
- no internal run matrix
- no statistical aggregation
- no allocator-specific profiling
The wrapped command's result includes:
- wall clock time
- peak RSS in KiB and MiB
- user CPU time
- system CPU time
- minor and major page faults
- voluntary and involuntary context switches
- exit code or terminating signal
- optional throughput metadata:
objects_processed(caller-supplied count)objects_per_second(derived from elapsed time)seconds_per_object(derived from elapsed time)
The harness also reports its own peak RSS separately so you can see whether the wrapper overhead is small relative to the child process.
Where cgroup v2 memory accounting is available and writable, the harness also collects an aggregate memory view from a short-lived run cgroup:
- current memory (
memory.current) - peak memory (
memory.peak) anon,file, andshmemfields frommemory.stat
This gives coverage for descendant/worker processes and cache-backed memory pressure
that is not visible from process-only /proc/<pid> accounting.
The report JSON includes this as child_cgroup_memory when available, including
memory.current, memory.peak, anon, file, shmem, cache, rss, and
rss_huge totals for the run cgroup.
When cgroup v2 is not available or setup fails, the report still includes
child_cgroup_memory_status with a short reason.
The harness can also sample /proc/<pid>/status while the child is running.
This is complementary to ru_maxrss, not a replacement for it.
Use it when you want a lightweight time series in addition to the final kernel high-water mark:
cargo run --release -- \
--format json \
--sample-proc-status-ms 10 \
--sample-proc-status-out tmp/proc-status.tsv \
-- my-program arg1 arg2What this adds:
- a TSV file with elapsed time and RSS-related fields from
/proc/<pid>/status - summary fields in the final report for sampled
VmRSS,VmHWM, anon/file/shmem peaks, and thread peak
Why it exists:
ru_maxrssis the authoritative lifetime peak/proc/<pid>/statushelps explain the shape of the run over time
The point of the harness is peak-memory measurement, not point-in-time sampling.
That distinction matters:
/proc/<pid>/statustells you what RSS looks like when you read the file.getrusagegives you the kernel-maintained high-water mark inru_maxrss.
If the target spikes memory for a short interval and then frees it before you
sample /proc, a polling design can miss the peak. ru_maxrss exists so the
kernel can report the maximum resident set size observed over the process
lifetime.
This repo uses:
wait4(2)to collect the child processstruct rusageon exitgetrusage(RUSAGE_SELF)to show the harness process overhead
The child result is the primary measurement. The harness self result is there to make the wrapper cost explicit.
The repo carries scaffold-derived local build config in .cargo/config.toml.
That file assumes these tools are available:
clangmoldsccache
If your machine does not have them, adjust .cargo/config.toml.
The release profile intentionally keeps full DWARF debug sections in the ELF:
[profile.release]
opt-level = 1
lto = false
codegen-units = 16
incremental = true
debug = 2
split-debuginfo = "off"
strip = "none"What these variables do:
debug = 2enables full debug info.split-debuginfo = "off"keeps the debug sections in the binary instead of moving them into side files.strip = "none"keeps symbols and debug data intact.
Those three fields are the key ones for readable perf report output.
The repo also defines a ship profile:
[profile.ship]
inherits = "release"
opt-level = 3
lto = true
codegen-units = 1
incremental = falseUse ship when you want a more fully optimized build while still inheriting the
debug-section settings from release.
Standard release build:
cargo build --releaseMore optimized build:
cargo build --profile shipRun a program with inherited stdout and stderr:
cargo run --release -- -- /usr/bin/env trueWrap a command in a different working directory:
cargo run --release -- --cwd /tmp -- git statusEmit JSON instead of human-readable output:
cargo run --release -- --format json -- /usr/bin/env truePass a processed-object count to get throughput fields:
cargo run --release -- --format json --objects-processed 1950 -- /usr/bin/env truePass environment variables through the harness:
cargo run --release -- --env RUST_LOG=debug --env FOO=bar -- my-program --flagTry the included demo target:
cargo run --example spiky-alloc -- 8 512 20Run a fixed-length sample where the harness owns the timeout boundary:
cargo run --release -- \
--format json \
--child-timeout-seconds 3 \
-- my-program arg1 arg2Use the harness and perf together when you want one timestamped artifact bundle
from a single wrapped run.
The preferred repo workflow is the wrapper script, which starts
memory-harness, waits for it to announce the child PID, attaches perf
directly to that child process, and writes a timestamped log bundle under
~/.memory-harness/logs/ by default. The harness stops the child immediately
after spawn, lets the wrapper attach perf, and then continues it.
When the local perf supports --control, the wrapper now uses control FIFOs
instead of blind settle sleeps and signal-based shutdown:
perf recordstarts disabled with--delay=-1- the wrapper waits for
perfto become ready - the wrapper sends
enableand waits forack - only then does it release the child
- when the run ends, the wrapper sends
stopand waits forackbefore collectingperf report
For direct harness-only memory and resource accounting:
./target/release/memory-harness -- my-program arg1 arg2For raw target-only CPU hotspot profiling:
perf record --call-graph dwarf ./target/release/my-program arg1 arg2
perf reportThe design target is that the harness stays quiet enough that this concurrent
workflow remains useful. The wrapper now improves that by attaching perf
directly to the paused child PID instead of profiling the harness parent path.
Direct perf on the target remains the absolute lowest-noise option, but this
repo exists to push the concurrent path as far toward low-noise usefulness as
practical.
For direct harness-only runs, --child-timeout-seconds N is still the right
way to bound the child.
For the concurrent wrapper workflow, --timeout-seconds N is owned by the
wrapper instead. That is deliberate: the wrapper needs a chance to ask perf
to stop before it terminates the child. The harness receives a timeout marker
file so the JSON result still reports timed_out=true for the same run.
If you want the synchronized one-run bundle, use the wrapper. If you want the
cleanest target-only CPU profile, run perf on the target command directly.
The repo includes a convenience wrapper:
scripts/run-with-memory-and-perf.sh -- my-program arg1 arg2By default it now runs perf at the same time as memory-harness, but it does
so by attaching perf to the announced child PID instead of wrapping the
harness process itself.
It can also:
- repeat the whole workflow multiple times with
--repeat N - capture
perf statcounters with--perf-stat - bound every measured pass with wrapper-owned
--timeout-seconds N - store proc-status samples from the memory-harness pass
- write timestamped output under
~/.memory-harness/logs/
The wrapper writes side-by-side artifacts into a timestamped directory:
combined.logmemory-harness.jsonproc-status.tsvperf.dataperf.report.txtperf.stat.txtwhen--perf-statis enabledsummary.txtcommand.txt
Optional output directory override:
scripts/run-with-memory-and-perf.sh --out-dir tmp/run-1 -- my-program arg1 arg2Optional environment override for the harness binary:
MEMORY_HARNESS_BIN=target/ship/memory-harness \
scripts/run-with-memory-and-perf.sh -- my-program arg1 arg2Fixed-length concurrent sample:
scripts/run-with-memory-and-perf.sh \
--timeout-seconds 15 \
-- my-program arg1 arg2Optional perf call-graph mode override:
MH_PERF_CALL_GRAPH=dwarf \
scripts/run-with-memory-and-perf.sh -- my-program arg1 arg2Note on call-graph modes: The wrapper default for MH_PERF_CALL_GRAPH is
dwarf (set by the script). dwarf usually gives better stack quality, but
fp can be more robust when processes terminate abruptly (especially under
timeout), leaving perf.data complete and readable more often.
Repeat the concurrent workflow twice and collect perf stat too:
scripts/run-with-memory-and-perf.sh \
--repeat 2 \
--perf-stat \
--out-dir ~/.memory-harness/logs/demo \
-- cargo run --example spiky-alloc -- 8 512 20After a release build, inspect the ELF sections:
readelf -S target/release/memory-harness | rg '\.debug_|\.symtab|\.eh_frame'You should see sections such as:
.debug_info.debug_line.debug_abbrev.debug_str
- Linux only
- best-effort cgroup v2 aggregate memory (child + descendants)
- no child-tree sampling daemon
- no flamegraph generation
- no direct Valgrind or Heaptrack integration
- wrapper repeat support is intentionally shell-level, not built into the Rust binary
That is deliberate. The current design optimizes for clarity, accurate peak RSS reporting, low implementation overhead, and clean separation between resource measurement and CPU profiling orchestration.