Add Sampler and SampledList for heap profiling infrastructure by jayakasadev · Pull Request #852 · microsoft/snmalloc

jayakasadev · 2026-05-28T13:12:51Z

Summary

Two header-only components behind #ifdef SNMALLOC_PROFILE, with zero impact on non-profiling builds and no changes to existing allocation paths.

src/snmalloc/mem/sampler.h — per-thread Poisson sampler.

Models the allocation stream as a byte sequence with each byte independently marked with probability 1/interval. An allocation is sampled iff it spans at least one marked byte:

P(sample) = 1 - e^(-size/interval)

Fast path: one subtraction and branch per allocation. Slow path fires ~once per g_sample_interval bytes (default 512KB) and draws the next interval from a geometric distribution via xorshift64 + inverse-CDF. Same statistical model as tcmalloc's Sampler.

src/snmalloc/mem/sampled_list.h — SampledAlloc struct and global SampledList.

SampledAlloc holds raw program-counter addresses (symbolication is deferred to profile-dump time), allocation size, and sample weight. SampledList is a doubly-linked list with lock-free push() (CAS on head) and mutex-guarded remove()/iterate().

Also defines g_in_sample_recording — a thread-local re-entrancy guard that suppresses recursive sampling when backtrace() or operator new is called from within the sample-recording path.

Neither component is wired into the allocator yet — connecting them to alloc()/dealloc() is a follow-on PR.

Test plan

New src/test/func/sampler/sampler.cc (auto-discovered by CMake, no CMakeLists changes required):

test_sampler_rate — sampled fraction converges to 1 - e^(-size/interval) within 5% over 100K allocations
test_sampler_disabled — g_sample_interval = 0 produces zero samples over 100K allocations
test_sampler_large_alloc — allocations much larger than the interval are sampled ≥95% of the time
test_list_push_remove — push/remove/iterate correctness on a single thread
test_list_remove_head — removing the head node leaves remaining nodes intact
test_list_concurrent_push — 8 threads × 128 nodes, all 1024 nodes present after join

All tests pass under both fast and check build variants.

Introduces two header-only components behind SNMALLOC_PROFILE, with no changes to existing allocation paths: - sampler.h: per-thread Poisson sampler. Fast path is one subtraction and branch per allocation. Slow path fires ~once per g_sample_interval bytes (default 512KB) and draws the next interval from a geometric distribution via xorshift64 + inverse-CDF. - sampled_list.h: SampledAlloc struct (raw PCs, size, weight) and a global doubly-linked SampledList. push() is lock-free (CAS on head); remove() and iterate() hold a std::mutex. Also defines the thread-local re-entrancy guard (g_in_sample_recording) that prevents recursive sampling when backtrace() or operator new is called from record_sample(). Tests cover sampling rate convergence, disabled-interval behaviour, large-allocation sampling probability, and concurrent push correctness.

mjp41 · 2026-05-28T13:50:59Z

Thanks for looking into this. I think it might be good to raise an issue where the design could be discussed before moving to a full implementation.

I think there are a few things that need defining

Sample criteria

I can see a few different criteria:

By bytes - after a certain amount of bytes sample (amount selected from RNG in some way.)
By number of allocations - after a certain number of object
By number of allocations per sizeclass - after a certain number of objects in a particular sizeclass

Doing by bytes, will represent large allocations much more in the samples than small. By number of allocations will sample small more. If the sampling is for working out which are the most active allocation sites, then number is correct. If you are trying to sample where the bytes of memory are allocated then by bytes is the right approach. If you are trying to get good coverage for GWP-Asan like sampling, then I think by number of allocations per sizeclass is probably right.

Global or thread-local sampling

Are thresholds a global property or thread-local. I would prefer thread-local as it doesn't introduce synchronisation.

Generally I dislike anything global as it is slow.

How to branch

By counter on each allocation
Using existing free list slow path

I think the simplist approach is to have a counter that counts (whatever we decide based on the previous). This however, involves adding an ALU instruction and branch on the fast path. We could only performing sampling on the slow path, without changes this would be very coarse-grained by default.

I think we could adapt the freelist algorithm to see if a particular freelist being allocated would take us beyond the sampling threshold, and if it would, then we have to split the freelist up, and use a part of the list, and leave the rest behind. This would put more work into setup, but in the common case where the whole freelist wouldn't take us beyond the threshold, then we could just use it. The fast path would have no additional instructions.

Where to sampled allocations live

We have the concept of a secondary allocator that @SchrodingerZhu added. Whether allocations that are being sampled live in the main snmalloc heap, or in a secondary heap is something we might want to consider. The GWP-Asan passes through to it on occasion, which means the main heap doesn't need to maintain anything extra, for deallocation it just fails to find it in snmallocs heap, and passes it to the next heap. This gives us a great fast path perf for most allocations, and the sampled are slower.

Apologies

This is a bit of a brain dump, but I wanted to get this down before you spent too long on it.

jayakasadev · 2026-05-28T17:07:50Z

happy to raise an issue before taking this further

jayakasadev mentioned this pull request May 28, 2026

Design: heap profiling for snmalloc #853

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sampler and SampledList for heap profiling infrastructure#852

Add Sampler and SampledList for heap profiling infrastructure#852
jayakasadev wants to merge 1 commit into
microsoft:mainfrom
jayakasadev:rust-heap-profiling-infra

jayakasadev commented May 28, 2026

Uh oh!

mjp41 commented May 28, 2026

Uh oh!

jayakasadev commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayakasadev commented May 28, 2026

Summary

Test plan

Uh oh!

mjp41 commented May 28, 2026

Sample criteria

Global or thread-local sampling

How to branch

Where to sampled allocations live

Apologies

Uh oh!

jayakasadev commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants