Add Sampler and SampledList for heap profiling infrastructure#852
Add Sampler and SampledList for heap profiling infrastructure#852jayakasadev wants to merge 1 commit into
Conversation
Introduces two header-only components behind SNMALLOC_PROFILE, with no changes to existing allocation paths: - sampler.h: per-thread Poisson sampler. Fast path is one subtraction and branch per allocation. Slow path fires ~once per g_sample_interval bytes (default 512KB) and draws the next interval from a geometric distribution via xorshift64 + inverse-CDF. - sampled_list.h: SampledAlloc struct (raw PCs, size, weight) and a global doubly-linked SampledList. push() is lock-free (CAS on head); remove() and iterate() hold a std::mutex. Also defines the thread-local re-entrancy guard (g_in_sample_recording) that prevents recursive sampling when backtrace() or operator new is called from record_sample(). Tests cover sampling rate convergence, disabled-interval behaviour, large-allocation sampling probability, and concurrent push correctness.
|
Thanks for looking into this. I think it might be good to raise an issue where the design could be discussed before moving to a full implementation. I think there are a few things that need defining Sample criteriaI can see a few different criteria:
Doing by bytes, will represent large allocations much more in the samples than small. By number of allocations will sample small more. If the sampling is for working out which are the most active allocation sites, then number is correct. If you are trying to sample where the bytes of memory are allocated then by bytes is the right approach. If you are trying to get good coverage for GWP-Asan like sampling, then I think by number of allocations per sizeclass is probably right. Global or thread-local samplingAre thresholds a global property or thread-local. I would prefer thread-local as it doesn't introduce synchronisation. Generally I dislike anything global as it is slow. How to branch
I think the simplist approach is to have a counter that counts (whatever we decide based on the previous). This however, involves adding an ALU instruction and branch on the fast path. We could only performing sampling on the slow path, without changes this would be very coarse-grained by default. I think we could adapt the freelist algorithm to see if a particular freelist being allocated would take us beyond the sampling threshold, and if it would, then we have to split the freelist up, and use a part of the list, and leave the rest behind. This would put more work into setup, but in the common case where the whole freelist wouldn't take us beyond the threshold, then we could just use it. The fast path would have no additional instructions. Where to sampled allocations liveWe have the concept of a secondary allocator that @SchrodingerZhu added. Whether allocations that are being sampled live in the main snmalloc heap, or in a secondary heap is something we might want to consider. The GWP-Asan passes through to it on occasion, which means the main heap doesn't need to maintain anything extra, for deallocation it just fails to find it in snmallocs heap, and passes it to the next heap. This gives us a great fast path perf for most allocations, and the sampled are slower. ApologiesThis is a bit of a brain dump, but I wanted to get this down before you spent too long on it. |
|
happy to raise an issue before taking this further |
Summary
Two header-only components behind
#ifdef SNMALLOC_PROFILE, with zero impact on non-profiling builds and no changes to existing allocation paths.src/snmalloc/mem/sampler.h— per-thread Poisson sampler.Models the allocation stream as a byte sequence with each byte independently marked with probability
1/interval. An allocation is sampled iff it spans at least one marked byte:Fast path: one subtraction and branch per allocation. Slow path fires ~once per
g_sample_intervalbytes (default 512KB) and draws the next interval from a geometric distribution via xorshift64 + inverse-CDF. Same statistical model as tcmalloc'sSampler.src/snmalloc/mem/sampled_list.h—SampledAllocstruct and globalSampledList.SampledAllocholds raw program-counter addresses (symbolication is deferred to profile-dump time), allocation size, and sample weight.SampledListis a doubly-linked list with lock-freepush()(CAS on head) and mutex-guardedremove()/iterate().Also defines
g_in_sample_recording— a thread-local re-entrancy guard that suppresses recursive sampling whenbacktrace()oroperator newis called from within the sample-recording path.Neither component is wired into the allocator yet — connecting them to
alloc()/dealloc()is a follow-on PR.Test plan
New
src/test/func/sampler/sampler.cc(auto-discovered by CMake, no CMakeLists changes required):test_sampler_rate— sampled fraction converges to1 - e^(-size/interval)within 5% over 100K allocationstest_sampler_disabled—g_sample_interval = 0produces zero samples over 100K allocationstest_sampler_large_alloc— allocations much larger than the interval are sampled ≥95% of the timetest_list_push_remove— push/remove/iterate correctness on a single threadtest_list_remove_head— removing the head node leaves remaining nodes intacttest_list_concurrent_push— 8 threads × 128 nodes, all 1024 nodes present after joinAll tests pass under both
fastandcheckbuild variants.