POC: Memory profiler allocation labels by rudolf · Pull Request #62649 · nodejs/node

rudolf · 2026-04-09T15:28:52Z

This is a POC for initial feedback. If we can get alignment within Node.js I could try to contribute the v8 changes upstream.

Summary

Adds the ability to tag sampling heap profiler allocations with string labels that propagate through async context (via CPED). This enables attributing memory usage to specific HTTP routes, tenants, or operations — something no JS runtime currently supports.

V8 changes

HeapProfileSampleLabelsCallback — embedder callback invoked on sampled allocations to retrieve labels from the current async context
AllocationProfile::Sample::labels — key-value pairs on each sample (behind V8_HEAP_PROFILER_SAMPLE_LABELS compile flag)

Datadog's Attila Szegedi proposed a similar label mechanism for CPU profiling on v8-dev (July 2025). V8 team (Leszek Swirski) indicated they would review non-invasive patches behind #ifdefs. This PR applies the same approach to heap profiling, which is simpler. Everything runs on the allocation thread with no signal-safety concerns.

Node.js changes:

v8.withHeapProfileLabels(labels, fn) — runs fn with labels that propagate across await
v8.setHeapProfileLabels(labels) — sets labels for current async scope (for framework middleware patterns)
v8.getAllocationProfile() returns samples[].labels and per-label externalBytes (Buffer/ArrayBuffer)
ProfilingArrayBufferAllocator tracks external allocations per label (single atomic load overhead when disabled)

#62273 landed the SyncHeapProfileHandle API with Symbol.dispose support. The labels API proposed here is complementary, it adds context (which route/tenant) to the samples that SyncHeapProfileHandle already collects. A follow-up could integrate withHeapProfileLabels as a method on the handle.

Motivation

In multi-tenant or multi-route Node.js servers, a memory spike today tells you how much memory grew but not what caused it. Operators resort to code inspection or heap snapshots but these don't scale to collecting data over long timespans for large deployments. With labeled heap/external memory profiling, you can answer "route /api/search accounts for 400MB of the 1.2GB heap" directly from production telemetry (e.g. via OTel).

This mirrors Go's pprof.Labels capability

Overhead

20-run benchmark (two-server realistic HTTP workload):

Sampling profiler alone: 0.6% (not statistically significant)
Sampling + labels: 2.2% total (p<0.01)
When disabled: zero overhead (no code path changes)

Test plan

V8 cctests for label callback
JS tests: label propagation across await, concurrent contexts, setHeapProfileLabels, external memory tracking, GC cleanup
Micro and macro benchmarks in benchmark/v8/ and benchmark/http/

nodejs-github-bot · 2026-04-09T15:28:58Z

Review requested:

@nodejs/gyp
@nodejs/performance
@nodejs/security-wg
@nodejs/v8-update

Qard · 2026-04-09T17:13:11Z

+  // This happens when --experimental-async-context-frame is not set on
+  // Node.js 22, causing all contexts to map to Smi::zero() (address 0).
+  if (cped.IsEmpty() || cped->IsUndefined()) return;
+  uintptr_t addr = node::GetLocalAddress(cped);


Storing in binding data by the CPED address won't work at all. Because all AsyncLocalStorage contexts are combined into a single AsyncContextFrame map, any changes to any contexts will change what this value is, even if the particular store you are interested in has not changed at all within that map frame.

You would need to have V8 capture the CPED value at the time of the sample and store that on the heap profile itself alongside the samples, then use that actual AsyncContextFrame instance to look up what the corresponding data was in that frame for the label store.

szegedi

Very interesting, I had to take a look since I was mentioned in the PR description itself 😄. I generally like the direction this is going in, long term we can probably replace Datadog's heap profiler that directly wraps V8 heap profiler with this and reduce our maintenance surface. This solution indeed looks like it can only be implemented in Node.js itself, and not as an add-on since the v8::ArrayBuffer::Allocator instance is a global Isolate::CreateParams setting so it's controlled by the embedder, that is, Node.js.

szegedi · 2026-04-14T12:29:33Z

+      v8::Local<v8::Value> context =
+          v8_isolate->GetContinuationPreservedEmbedderData();
+      if (!context.IsEmpty() && !context->IsUndefined()) {
+        sample->cped.Reset(v8_isolate, context);


Can't you get the value associated with the AsyncLocalStorage from the AsyncContextFrame here and only store that? Essentially the additional step from BuildSamples. You'll be retaining in memory all ALSes this ACF references as keys, some might have large retained set themselves. If you can safely call GetContinuationPreservedEmbedderData (which creates a v8::Local) I'd think you can also safely get a local to the ALS key from its global, and call v8::Map::Get on context too? (Or direct V8 hashmap reading like I suggested in that other comment in ProfilingArrayBufferAllocator::FindCurrentLabels)

Great point about memory retention. You're right that storing the entire CPED keeps all ALS stores alive as long as the sample exists, not just the labels store 💣 💥

OrderedHashMap::FindEntry is a great suggestion!

szegedi · 2026-04-14T13:26:30Z

+  // BackingStore::Allocate inside the ArrayBuffer constructor).
+  // Use AsArray() which reads the internal backing store directly without
+  // calling JS builtins, then iterate entries by identity comparison.
+  v8::Local<v8::Array> entries = frame->AsArray();


What's the memory requirement of this AsArray call? It sounds like it'd have to construct a whole new array?
You're lucky that you have a whole embedded copy of V8, so you can use existing internals, something like this roughly sketched might work:

#include "src/objects/js-collection.h" #include "src/objects/ordered-hash-table.h" // Given a v8::Local<v8::Map>, get to the internal table: i::Tagged<i::JSMap> js_map = *Utils::OpenDirectHandle(*frame); i::Tagged<i::OrderedHashMap> table = i::Cast<i::OrderedHashMap>(js_map->table()); // no-JS lookup in the table: i::InternalIndex entry = table->FindEntry(isolate, *Utils::OpenDirectHandle(*als_key)); if (entry.is_found()) { i::Tagged<i::Object> value = table->ValueAt(entry); // go back from Tagged to Local: v8::Local<v8::Value> val = Utils::ToLocal(i::direct_handle(value, i_isolate)); // use val as before... }

yeah AsArray() allocates a full JS Array and copies the Map backing store. I moved away from AsArray() to Map::Get() in an unpushed revision, but your OrderedHashMap::FindEntry approach would be even better.

Since we're already modifying V8 source in deps/v8/src/profiler/ and have access to internals from src/ as well, I can adopt this pattern in both locations:

SampleObject (allocation time) — extract ALS value, store as Global on Sample

ProfilingArrayBufferAllocator::TrackAllocate — same pattern for ArrayBuffer tracking

The read-time callback in GetAllocationProfile then receives the already-extracted flat array, making it trivial (just string conversion).

szegedi · 2026-04-14T13:34:23Z

@@ -11,6 +11,11 @@
 #include <unordered_set>
 #include <vector>

+#ifdef V8_HEAP_PROFILER_SAMPLE_LABELS


So I presume this'll need upstreaming, right?

yeah, I'm hoping if we can show that Nodejs would definitely use this feature they'd be more open to accepting it

rudolf

Very interesting, I had to take a look since I was mentioned in the PR description itself 😄. I generally like the direction this is going in, long term we can probably replace Datadog's heap profiler that directly wraps V8 heap profiler with this and reduce our maintenance surface.

@szegedi Thanks for popping by! Your thread sparked this idea and made me think maybe it's not all that hard (at least for memory profiling, sounds like CPU profiles might be a different beast).

rudolf · 2026-04-14T14:56:51Z

+      v8::Local<v8::Value> context =
+          v8_isolate->GetContinuationPreservedEmbedderData();
+      if (!context.IsEmpty() && !context->IsUndefined()) {
+        sample->cped.Reset(v8_isolate, context);


Great point about memory retention. You're right that storing the entire CPED keeps all ALS stores alive as long as the sample exists, not just the labels store 💣 💥

OrderedHashMap::FindEntry is a great suggestion!

rudolf · 2026-04-14T15:01:17Z

+  // BackingStore::Allocate inside the ArrayBuffer constructor).
+  // Use AsArray() which reads the internal backing store directly without
+  // calling JS builtins, then iterate entries by identity comparison.
+  v8::Local<v8::Array> entries = frame->AsArray();


yeah AsArray() allocates a full JS Array and copies the Map backing store. I moved away from AsArray() to Map::Get() in an unpushed revision, but your OrderedHashMap::FindEntry approach would be even better.

Since we're already modifying V8 source in deps/v8/src/profiler/ and have access to internals from src/ as well, I can adopt this pattern in both locations:

SampleObject (allocation time) — extract ALS value, store as Global on Sample

ProfilingArrayBufferAllocator::TrackAllocate — same pattern for ArrayBuffer tracking

The read-time callback in GetAllocationProfile then receives the already-extracted flat array, making it trivial (just string conversion).

rudolf · 2026-04-14T15:02:55Z

@@ -11,6 +11,11 @@
 #include <unordered_set>
 #include <vector>

+#ifdef V8_HEAP_PROFILER_SAMPLE_LABELS


yeah, I'm hoping if we can show that Nodejs would definitely use this feature they'd be more open to accepting it

Add V8 API for attaching embedder-defined labels to sampling heap profiler samples. Labels propagate through async context via CPED and are resolved at profile-read time. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Add Node.js C++ bindings that wire up the V8 sample labels API. Handles callback registration, ALS key setup, and cleanup on worker termination. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Add JS API for labeling heap profiler samples via AsyncLocalStorage. Labels are pre-flattened at set time to avoid V8 property access during resolution. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Cover basic labeling, multi-key labels, async propagation, worker cleanup, and C++ callback tests. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Document the new labels API on startSamplingHeapProfiler, getAllocationProfile, withHeapProfileLabels, and setHeapProfileLabels. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Measure per-allocation overhead and HTTP throughput impact across profiling modes. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Expose a public API to look up ALS values from the CPED map, so the allocator can resolve labels without duplicating V8 internal logic. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Track per-label Buffer/ArrayBuffer backing store allocations. The allocator is installed as a delegate when profiling is active, with zero overhead otherwise. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Cover per-label attribution, GC cleanup, multi-key labels, and JSON serialization for externalBytes. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Document the externalBytes field in getAllocationProfile output. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

IlyasShabi · 2026-04-20T12:35:14Z

+    // label array) instead of the full CPED avoids retaining all ALS
+    // stores for the lifetime of the sample.  Labels are resolved from
+    // this at read time (in GetAllocationProfile) via the callback.
+    Global<Value> label_value;


This means we are storing the label value per sample, even when many samples share the same label set. I think this is likely expensive in memory and in GetAllocationProfile()

IMO it's better to store a small uint32_t label_id = 0 on each sample instead, and keep a shared map for labels. I believe that with this GetAllocationProfile() could resolve labels once per unique context rather than once per sample.

I agree in principle but wonder if the memory overhead is really an issue. We only retain samples and memory that have survived GC. Starting with per sample overhead: ~120 B base + ~24 B label handle slot = ~144 B. And assuming Poisson sampling with 512kb intervals:

200 MB live heap would have ~400 retained samples with 58 KB / 0.029% sample overhead (48 KB base, 10 KB labels)

1GB live heap would have ~2000 samples with 288kb sample overhead (240KB base, 48KB labels)

There's additional overhead like the shared ALS array but that's a function of number of unique lables not a per sample overhead so would be present in both designs. Feels worth profiling/benchmarking to make sure my math adds up, but the label_id solution comes at the cost of additional complexity for a small memory gain.

It makes a lot of sense to add a local cache during GetAllocationProfile to avoid the expensive callbacks for labels we have already resolved. I think that can make a noticeable difference to CPU overhead.

We only retain samples and memory that have survived GC

Probably not if you call it like:

hp->StartSamplingHeapProfiler( 64, 16, static_cast<v8::HeapProfiler::SamplingFlags>( v8::HeapProfiler::kSamplingIncludeObjectsCollectedByMinorGC | v8::HeapProfiler::kSamplingIncludeObjectsCollectedByMajorGC ) );

@IlyasShabi I'm stuck trying to measure this. The labels work adds bookkeeping that lives in C++ heap so process.memoryUsage().rss and v8.getHeapStatistics().used_heap_size don't directly capture it.

What I've tried so far:

v8.getHeapStatistics().used_global_handles_size surprisingly doesn't track Sample::label_value Globals

RSS deltas across 3-iteration matrix runs at 30 s each has a noise band of +/-60 MB

Is there a methodology you'd use to measure this kind of work?

Did some measurement work on this. With a realistic HTTP workload (~2,500 rps, ~750 MB/s V8 alloc churn), 30 s with includeCollectedObjects: true at64 KB sampling interval, ~300K retained samples: aggregate libc-malloc shows the labels feature adds 6-9 MB on top of ~29 MB of profiler-only bookkeeping. Below 5% of the underlying sampler cost at this workload, scales linearly with sample count.

However, while running the load test I hit a segfault in ProfilingArrayBufferAllocator::TrackFree: V8's ArrayBufferSweeper calls it on a background worker thread, and the per-allocation Global label_value runs ~Global off-thread which trips the node->IsInUse() CHECK in GlobalHandles::NodeSpace::Release.

The fix it I think we need the same machinery as what you suggested for Sample::label_value so will take a stab at that

IlyasShabi · 2026-04-20T12:39:35Z

+  if (sampleInterval !== undefined) validateUint32(sampleInterval, 'sampleInterval', true);
+  if (stackDepth !== undefined) validateUint32(stackDepth, 'stackDepth');
+  if (options !== undefined) validateObject(options, 'options');
+  return _startSamplingHeapProfiler(sampleInterval, stackDepth, options);


We should call ensureHeapProfileLabelsALS() before this line to make sure the ALS store exists and its key is registered before sampling begins.

szegedi · 2026-04-21T11:46:55Z

@szegedi Thanks for popping by! Your thread sparked this idea and made me think maybe it's not all that hard (at least for memory profiling, sounds like CPU profiles might be a different beast).

Oh yeah, I could probably talk for hours about CPU sample labeling :-)

Reading labels to associate with samples is the tricky part with CPU profiles. Since it happens in a signal handler, we we can't use basically any V8 APIs, as even getting a Local<Value> from GetContinuationPreservedEmbedderData violates signal safety. So in our profiler we end up with code like this as well as a reimplementation of V8 logic for reading hashmaps.

There's also some other fun details, like needing to have a ringbuffer for samples around as you can't allocate memory in the signal handler. And we're also using shared_ptr<Global<Value>> to store the reference to the current label set as we can safely copy a shared_ptr but couldn't be allocating a different global (globals can't be copied either.) FWIW, in our implementation, we aren't converting labels up front, we just take a regular JS object representing them and store a reference to it on sampling, and then only generate labels from its entries when we build the profiles.

Associating labels in that ringbuffer with samples is also somewhat tricky, we do it by correlating monotonic clock values – we invoke v8::base::TimeTicks::Now in the signal handler and store the value in the ring buffer struct and then later correlate them with samples using v8::CpuProfiler::GetSampleTimestamp

As I said, I can probably go on for at least an hour. I might need to write a talk :-)

Measures v8.getAllocationProfile() throughput while varying the number of samples sharing each ALS label set, so the impact of resolution caching can be measured in isolation. Configurations: samples_per_unique_label_set in {1, 10, 100, 1000}, ~5000 retained samples per run. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

SetHeapProfileLabelsStore() stored the ALS instance in BindingData but did not forward it to V8's HeapProfiler. With the canonical JS pattern (startSamplingHeapProfiler then withHeapProfileLabels) V8 saw no ALS key and SampleObject() never captured a label. Also hoist HeapProfilingCleanup above ~BindingData() so the destructor can call DoCleanup() and delete on a complete type. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

GetAllocationProfile() invoked the labels callback once per sample, even when many samples shared the same ALS value (typical for any withHeapProfileLabels scope holding multiple allocations). Add a per-call cache keyed by ALS value Address. Empty resolutions are also cached to avoid re-invoking the callback for values that legitimately resolve to nothing. Cache misses on GC-moved objects are correctness safe; the callback runs again. Benchmark numbers in benchmark/v8/heap-profiler-labels-resolution.js. Refs: nodejs#62649 (comment) Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

Qard · 2026-04-25T02:11:58Z

Reading labels to associate with samples is the tricky part with CPU profiles. Since it happens in a signal handler, we we can't use basically any V8 APIs, as even getting a Local<Value> from GetContinuationPreservedEmbedderData violates signal safety. So in our profiler we end up with code like this as well as a reimplementation of V8 logic for reading hashmaps.

Yeah, this is why I wanted to have V8 itself capture CPED pointer state in each sample internally so that could be read back as actual local values later when you process the profile tree. Never got around to that before I changed teams though.

nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Apr 9, 2026

Qard reviewed Apr 9, 2026

View reviewed changes

rudolf force-pushed the poc-allocation-profiler-tags-v2 branch from 8743634 to 302bebe Compare April 10, 2026 09:22

szegedi reviewed Apr 14, 2026

View reviewed changes

rudolf commented Apr 14, 2026

View reviewed changes

rudolf added 10 commits April 15, 2026 16:15

deps: add heap profile sample labels to V8 profiler

2febdb4

Add V8 API for attaching embedder-defined labels to sampling heap profiler samples. Labels propagate through async context via CPED and are resolved at profile-read time. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

src: add heap profile labels bindings

a393276

Add Node.js C++ bindings that wire up the V8 sample labels API. Handles callback registration, ALS key setup, and cleanup on worker termination. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

lib: add heap profile labels API to v8 module

a8ac88e

Add JS API for labeling heap profiler samples via AsyncLocalStorage. Labels are pre-flattened at set time to avoid V8 property access during resolution. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

test: add heap profile labels tests

c1934a5

Cover basic labeling, multi-key labels, async propagation, worker cleanup, and C++ callback tests. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

doc: add heap profiler labels API documentation

0e7004b

Document the new labels API on startSamplingHeapProfiler, getAllocationProfile, withHeapProfileLabels, and setHeapProfileLabels. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

benchmark: add heap profiler labels overhead benchmarks

18a6b90

Measure per-allocation overhead and HTTP throughput impact across profiling modes. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

deps: add LookupAlsValue to V8 heap profiler

37180c7

Expose a public API to look up ALS values from the CPED map, so the allocator can resolve labels without duplicating V8 internal logic. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

src: add ProfilingArrayBufferAllocator for external memory tracking

3e44751

Track per-label Buffer/ArrayBuffer backing store allocations. The allocator is installed as a delegate when profiling is active, with zero overhead otherwise. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

test: add external memory tracking tests

d5fa37d

Cover per-label attribution, GC cleanup, multi-key labels, and JSON serialization for externalBytes. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

doc: add externalBytes to heap profiler documentation

da6af28

Document the externalBytes field in getAllocationProfile output. Signed-off-by: Rudolf Meijering <skaapgif@gmail.com>

rudolf force-pushed the poc-allocation-profiler-tags-v2 branch from 302bebe to da6af28 Compare April 17, 2026 20:12

IlyasShabi reviewed Apr 20, 2026

View reviewed changes

rudolf added 3 commits April 23, 2026 22:15

Uh oh!

Conversation

rudolf commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

V8 changes

Node.js changes:

Motivation

Overhead

Test plan

Uh oh!

nodejs-github-bot commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szegedi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolf Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasShabi Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szegedi commented Apr 21, 2026

Uh oh!

Qard commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rudolf commented Apr 9, 2026 •

edited

Loading

rudolf Apr 21, 2026 •

edited

Loading

IlyasShabi Apr 21, 2026 •

edited

Loading