memory-harness

memory-harness is a small Linux-only Rust CLI that runs a child command and reports its kernel-accounted resource usage, especially peak RSS.

The entire purpose of this repo is to build a memory harness with low enough wrapper noise that perf can run concurrently and still produce a useful CPU profile for the wrapped program.

That goal drives the design:

keep the Rust binary small
use kernel accounting like getrusage for peak RSS
make optional sampling lightweight
keep the default profiling workflow synchronized in one run
minimize wrapper-side work that would drown out the target process

It is intentionally narrow:

one command per invocation
no internal run matrix
no statistical aggregation
no allocator-specific profiling

What it measures

The wrapped command's result includes:

wall clock time
peak RSS in KiB and MiB
user CPU time
system CPU time
minor and major page faults
voluntary and involuntary context switches
exit code or terminating signal
optional throughput metadata:
- objects_processed (caller-supplied count)
- objects_per_second (derived from elapsed time)
- seconds_per_object (derived from elapsed time)

The harness also reports its own peak RSS separately so you can see whether the wrapper overhead is small relative to the child process.

Cgroup v2 aggregate memory (best effort)

Where cgroup v2 memory accounting is available and writable, the harness also collects an aggregate memory view from a short-lived run cgroup:

current memory (memory.current)
peak memory (memory.peak)
anon, file, and shmem fields from memory.stat

This gives coverage for descendant/worker processes and cache-backed memory pressure that is not visible from process-only /proc/<pid> accounting.

The report JSON includes this as child_cgroup_memory when available, including memory.current, memory.peak, anon, file, shmem, cache, rss, and rss_huge totals for the run cgroup. When cgroup v2 is not available or setup fails, the report still includes child_cgroup_memory_status with a short reason.

Optional proc-status sampling

The harness can also sample /proc/<pid>/status while the child is running. This is complementary to ru_maxrss, not a replacement for it.

Use it when you want a lightweight time series in addition to the final kernel high-water mark:

cargo run --release -- \
  --format json \
  --sample-proc-status-ms 10 \
  --sample-proc-status-out tmp/proc-status.tsv \
  -- my-program arg1 arg2

What this adds:

a TSV file with elapsed time and RSS-related fields from /proc/<pid>/status
summary fields in the final report for sampled VmRSS, VmHWM, anon/file/shmem peaks, and thread peak

Why it exists:

ru_maxrss is the authoritative lifetime peak
/proc/<pid>/status helps explain the shape of the run over time

Why the code uses `getrusage`

The point of the harness is peak-memory measurement, not point-in-time sampling.

That distinction matters:

/proc/<pid>/status tells you what RSS looks like when you read the file.
getrusage gives you the kernel-maintained high-water mark in ru_maxrss.

If the target spikes memory for a short interval and then frees it before you sample /proc, a polling design can miss the peak. ru_maxrss exists so the kernel can report the maximum resident set size observed over the process lifetime.

This repo uses:

wait4(2) to collect the child process struct rusage on exit
getrusage(RUSAGE_SELF) to show the harness process overhead

The child result is the primary measurement. The harness self result is there to make the wrapper cost explicit.

Build setup

The repo carries scaffold-derived local build config in .cargo/config.toml. That file assumes these tools are available:

clang
mold
sccache

If your machine does not have them, adjust .cargo/config.toml.

`Cargo.toml` settings that matter for `perf`

The release profile intentionally keeps full DWARF debug sections in the ELF:

[profile.release]
opt-level = 1
lto = false
codegen-units = 16
incremental = true
debug = 2
split-debuginfo = "off"
strip = "none"

What these variables do:

debug = 2 enables full debug info.
split-debuginfo = "off" keeps the debug sections in the binary instead of moving them into side files.
strip = "none" keeps symbols and debug data intact.

Those three fields are the key ones for readable perf report output.

The repo also defines a ship profile:

[profile.ship]
inherits = "release"
opt-level = 3
lto = true
codegen-units = 1
incremental = false

Use ship when you want a more fully optimized build while still inheriting the debug-section settings from release.

Build

Standard release build:

cargo build --release

More optimized build:

cargo build --profile ship

Use the harness

Run a program with inherited stdout and stderr:

cargo run --release -- -- /usr/bin/env true

Wrap a command in a different working directory:

cargo run --release -- --cwd /tmp -- git status

Emit JSON instead of human-readable output:

cargo run --release -- --format json -- /usr/bin/env true

Pass a processed-object count to get throughput fields:

cargo run --release -- --format json --objects-processed 1950 -- /usr/bin/env true

Pass environment variables through the harness:

cargo run --release -- --env RUST_LOG=debug --env FOO=bar -- my-program --flag

Try the included demo target:

cargo run --example spiky-alloc -- 8 512 20

Run a fixed-length sample where the harness owns the timeout boundary:

cargo run --release -- \
  --format json \
  --child-timeout-seconds 3 \
  -- my-program arg1 arg2

How to use it with `perf`

Use the harness and perf together when you want one timestamped artifact bundle from a single wrapped run.

The preferred repo workflow is the wrapper script, which starts memory-harness, waits for it to announce the child PID, attaches perf directly to that child process, and writes a timestamped log bundle under ~/.memory-harness/logs/ by default. The harness stops the child immediately after spawn, lets the wrapper attach perf, and then continues it.

When the local perf supports --control, the wrapper now uses control FIFOs instead of blind settle sleeps and signal-based shutdown:

perf record starts disabled with --delay=-1
the wrapper waits for perf to become ready
the wrapper sends enable and waits for ack
only then does it release the child
when the run ends, the wrapper sends stop and waits for ack before collecting perf report

For direct harness-only memory and resource accounting:

./target/release/memory-harness -- my-program arg1 arg2

For raw target-only CPU hotspot profiling:

perf record --call-graph dwarf ./target/release/my-program arg1 arg2
perf report

The design target is that the harness stays quiet enough that this concurrent workflow remains useful. The wrapper now improves that by attaching perf directly to the paused child PID instead of profiling the harness parent path. Direct perf on the target remains the absolute lowest-noise option, but this repo exists to push the concurrent path as far toward low-noise usefulness as practical.

For direct harness-only runs, --child-timeout-seconds N is still the right way to bound the child.

For the concurrent wrapper workflow, --timeout-seconds N is owned by the wrapper instead. That is deliberate: the wrapper needs a chance to ask perf to stop before it terminates the child. The harness receives a timeout marker file so the JSON result still reports timed_out=true for the same run.

If you want the synchronized one-run bundle, use the wrapper. If you want the cleanest target-only CPU profile, run perf on the target command directly.

Wrapper for both passes

The repo includes a convenience wrapper:

scripts/run-with-memory-and-perf.sh -- my-program arg1 arg2

By default it now runs perf at the same time as memory-harness, but it does so by attaching perf to the announced child PID instead of wrapping the harness process itself.

It can also:

repeat the whole workflow multiple times with --repeat N
capture perf stat counters with --perf-stat
bound every measured pass with wrapper-owned --timeout-seconds N
store proc-status samples from the memory-harness pass
write timestamped output under ~/.memory-harness/logs/

The wrapper writes side-by-side artifacts into a timestamped directory:

combined.log
memory-harness.json
proc-status.tsv
perf.data
perf.report.txt
perf.stat.txt when --perf-stat is enabled
summary.txt
command.txt

Optional output directory override:

scripts/run-with-memory-and-perf.sh --out-dir tmp/run-1 -- my-program arg1 arg2

Optional environment override for the harness binary:

MEMORY_HARNESS_BIN=target/ship/memory-harness \
scripts/run-with-memory-and-perf.sh -- my-program arg1 arg2

Fixed-length concurrent sample:

scripts/run-with-memory-and-perf.sh \
  --timeout-seconds 15 \
  -- my-program arg1 arg2

Optional perf call-graph mode override:

MH_PERF_CALL_GRAPH=dwarf \
scripts/run-with-memory-and-perf.sh -- my-program arg1 arg2

Note on call-graph modes: The wrapper default for MH_PERF_CALL_GRAPH is dwarf (set by the script). dwarf usually gives better stack quality, but fp can be more robust when processes terminate abruptly (especially under timeout), leaving perf.data complete and readable more often.

Repeat the concurrent workflow twice and collect perf stat too:

scripts/run-with-memory-and-perf.sh \
  --repeat 2 \
  --perf-stat \
  --out-dir ~/.memory-harness/logs/demo \
  -- cargo run --example spiky-alloc -- 8 512 20

Verify the binary carries debug sections

After a release build, inspect the ELF sections:

readelf -S target/release/memory-harness | rg '\.debug_|\.symtab|\.eh_frame'

You should see sections such as:

.debug_info
.debug_line
.debug_abbrev
.debug_str

Current boundaries

Linux only
best-effort cgroup v2 aggregate memory (child + descendants)
no child-tree sampling daemon
no flamegraph generation
no direct Valgrind or Heaptrack integration
wrapper repeat support is intentionally shell-level, not built into the Rust binary

That is deliberate. The current design optimizes for clarity, accurate peak RSS reporting, low implementation overhead, and clean separation between resource measurement and CPU profiling orchestration.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.cargo		.cargo
examples		examples
samples/cgroup-child-processes		samples/cgroup-child-processes
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

memory-harness

What it measures

Cgroup v2 aggregate memory (best effort)

Optional proc-status sampling

Why the code uses `getrusage`

Build setup

`Cargo.toml` settings that matter for `perf`

Build

Use the harness

How to use it with `perf`

Wrapper for both passes

Verify the binary carries debug sections

Current boundaries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

memory-harness

What it measures

Cgroup v2 aggregate memory (best effort)

Optional proc-status sampling

Why the code uses getrusage

Build setup

Cargo.toml settings that matter for perf

Build

Use the harness

How to use it with perf

Wrapper for both passes

Verify the binary carries debug sections

Current boundaries

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why the code uses `getrusage`

`Cargo.toml` settings that matter for `perf`

How to use it with `perf`

Packages