Skip to content

Add Sampler and SampledList for heap profiling infrastructure#852

Open
jayakasadev wants to merge 1 commit into
microsoft:mainfrom
jayakasadev:rust-heap-profiling-infra
Open

Add Sampler and SampledList for heap profiling infrastructure#852
jayakasadev wants to merge 1 commit into
microsoft:mainfrom
jayakasadev:rust-heap-profiling-infra

Conversation

@jayakasadev
Copy link
Copy Markdown
Contributor

Summary

Two header-only components behind #ifdef SNMALLOC_PROFILE, with zero impact on non-profiling builds and no changes to existing allocation paths.

src/snmalloc/mem/sampler.h — per-thread Poisson sampler.

Models the allocation stream as a byte sequence with each byte independently marked with probability 1/interval. An allocation is sampled iff it spans at least one marked byte:

P(sample) = 1 - e^(-size/interval)

Fast path: one subtraction and branch per allocation. Slow path fires ~once per g_sample_interval bytes (default 512KB) and draws the next interval from a geometric distribution via xorshift64 + inverse-CDF. Same statistical model as tcmalloc's Sampler.

src/snmalloc/mem/sampled_list.hSampledAlloc struct and global SampledList.

SampledAlloc holds raw program-counter addresses (symbolication is deferred to profile-dump time), allocation size, and sample weight. SampledList is a doubly-linked list with lock-free push() (CAS on head) and mutex-guarded remove()/iterate().

Also defines g_in_sample_recording — a thread-local re-entrancy guard that suppresses recursive sampling when backtrace() or operator new is called from within the sample-recording path.

Neither component is wired into the allocator yet — connecting them to alloc()/dealloc() is a follow-on PR.

Test plan

New src/test/func/sampler/sampler.cc (auto-discovered by CMake, no CMakeLists changes required):

  • test_sampler_rate — sampled fraction converges to 1 - e^(-size/interval) within 5% over 100K allocations
  • test_sampler_disabledg_sample_interval = 0 produces zero samples over 100K allocations
  • test_sampler_large_alloc — allocations much larger than the interval are sampled ≥95% of the time
  • test_list_push_remove — push/remove/iterate correctness on a single thread
  • test_list_remove_head — removing the head node leaves remaining nodes intact
  • test_list_concurrent_push — 8 threads × 128 nodes, all 1024 nodes present after join

All tests pass under both fast and check build variants.

Introduces two header-only components behind SNMALLOC_PROFILE, with no
changes to existing allocation paths:

- sampler.h: per-thread Poisson sampler. Fast path is one subtraction
  and branch per allocation. Slow path fires ~once per g_sample_interval
  bytes (default 512KB) and draws the next interval from a geometric
  distribution via xorshift64 + inverse-CDF.

- sampled_list.h: SampledAlloc struct (raw PCs, size, weight) and a
  global doubly-linked SampledList. push() is lock-free (CAS on head);
  remove() and iterate() hold a std::mutex. Also defines the thread-local
  re-entrancy guard (g_in_sample_recording) that prevents recursive
  sampling when backtrace() or operator new is called from record_sample().

Tests cover sampling rate convergence, disabled-interval behaviour,
large-allocation sampling probability, and concurrent push correctness.
@mjp41
Copy link
Copy Markdown
Member

mjp41 commented May 28, 2026

Thanks for looking into this. I think it might be good to raise an issue where the design could be discussed before moving to a full implementation.

I think there are a few things that need defining

Sample criteria

I can see a few different criteria:

  • By bytes - after a certain amount of bytes sample (amount selected from RNG in some way.)
  • By number of allocations - after a certain number of object
  • By number of allocations per sizeclass - after a certain number of objects in a particular sizeclass

Doing by bytes, will represent large allocations much more in the samples than small. By number of allocations will sample small more. If the sampling is for working out which are the most active allocation sites, then number is correct. If you are trying to sample where the bytes of memory are allocated then by bytes is the right approach. If you are trying to get good coverage for GWP-Asan like sampling, then I think by number of allocations per sizeclass is probably right.

Global or thread-local sampling

Are thresholds a global property or thread-local. I would prefer thread-local as it doesn't introduce synchronisation.

Generally I dislike anything global as it is slow.

How to branch

  • By counter on each allocation
  • Using existing free list slow path

I think the simplist approach is to have a counter that counts (whatever we decide based on the previous). This however, involves adding an ALU instruction and branch on the fast path. We could only performing sampling on the slow path, without changes this would be very coarse-grained by default.

I think we could adapt the freelist algorithm to see if a particular freelist being allocated would take us beyond the sampling threshold, and if it would, then we have to split the freelist up, and use a part of the list, and leave the rest behind. This would put more work into setup, but in the common case where the whole freelist wouldn't take us beyond the threshold, then we could just use it. The fast path would have no additional instructions.

Where to sampled allocations live

We have the concept of a secondary allocator that @SchrodingerZhu added. Whether allocations that are being sampled live in the main snmalloc heap, or in a secondary heap is something we might want to consider. The GWP-Asan passes through to it on occasion, which means the main heap doesn't need to maintain anything extra, for deallocation it just fails to find it in snmallocs heap, and passes it to the next heap. This gives us a great fast path perf for most allocations, and the sampled are slower.

Apologies

This is a bit of a brain dump, but I wanted to get this down before you spent too long on it.

@jayakasadev
Copy link
Copy Markdown
Contributor Author

happy to raise an issue before taking this further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants