Supporting multiple trace providers: Trace Events, DTrace, Lttng, ETW #163

jasnell · 2018-02-15T00:06:19Z

Node.js core supports multiple trace providers... we currently have to instrument for each individually, which increases maintenance burden. Before we consider making any changes here, however, we need to understand the critical use cases and requirements for the existing probes. Let's use this thread to discuss.

jasnell · 2018-02-15T00:10:27Z

Some feedback received so far:

The ability to switch on tracing probe points on a running process is critical.
The data collected at each probe point is also critical. We have existing probe points with existing data collections that should be maintained. Whether or not there are more that should be included needs to be determined.
The semantics of the probe points must be maintained (e.g. gc_start must emit at the start of gc) but the specific function that triggers those is not as critical.

(will update this list as we go)

jasnell · 2018-02-15T00:26:24Z

One key question that needs to be answered: does a single Node.js binary need to support multiple trace mechanisms? For instance, current builds on OSX support both DTrace and v8 Trace Events independently. If we include a single instrumentation mechanism, does that need to support multiple trace backends at runtime, or is it ok to pick one backend at compile time?

jclulow · 2018-02-15T01:08:44Z

As long as DTrace probes are included in the OS X and SmartOS (and FreeBSD?) binaries provided by the project, I would definitely be OK with leaving out the v8 Trace Events probes.

jasnell · 2018-02-15T02:52:41Z

I have a distinct feeling that current users of the v8 trace event probes will say the exact same thing. I think we need to be resigned to the fact that the default node.js binary is simply going to be required to support multiple trace targets for the forseeable future. The trick will be finding a way of efficiently instrumenting once for those multiple targets.

bnoordhuis · 2018-02-15T08:31:28Z

Trace events are a lot more flexible, maintainable and consumable than static probes so if we had to pick one or the other, it's trace events all the way. I don't think it has to be either/or, though.

That said, the static probes do suffer from technical debt; see e.g. nodejs/node#18074. Get involved if you want to ensure their bright future.

danielkhan · 2018-02-19T13:55:01Z

I also thought that we agreed to think about deprecating DTrace in favor of trace events. If there is no feature gap to be expected, maybe we can initiate that process and think of EOLing DTrace with ~ v12?

jasnell · 2018-02-19T14:00:56Z

Given feedback I've received I don't believe we're going to be able to deprecate any time soon

bnoordhuis · 2018-02-19T21:21:40Z

I'm curious, feedback from whom? I can't imagine many people caring about our static probes; it might be a literal fingers-on-one-hand deal.

Data point: I quietly removed them from libuv a few years back and no one complained.

AndreasMadsen · 2018-02-20T20:16:15Z

Instead of supporting LTTng, could we look into using the Linux native trace events: https://lwn.net/Articles/379903/

If I remember correctly, they are also consumed by LTTng, so hopefully it will make little difference for the LTTng users. (edit: I remember correctly.)

AndreasMadsen · 2018-02-22T23:43:58Z

new LTTng PR: nodejs/node#18945 - it suggests there is some use.

bnoordhuis · 2018-02-23T08:51:13Z

Probably very little though. The build was broken for close to two years without anyone noticing.

simark · 2018-02-23T16:24:25Z

I opened the PR mentioned above, and was asked to provide some feedback here.

My problem is that we have a node process with an apparent memory leak (we let it run, the memory usage goes up endlessly). Inspecting using the devtools didn't show anything obvious, there wasn't a big difference before/after in the number or total size of allocated objects. But there was a difference of tens of gigabytes of mapped virtual memory (original issue: eclipse-theia/theia#1284).

I happen to know LTTng and saw that node had some LTTng instrumentation, so I wanted to try it, that's when I hit the issues addressed by my PR. I don't know if it will actually help, but I thought I'd give it a shot. I didn't look yet, but I assume there are some tracepoints in the memory allocation/deallocation functions. If there isn't, I thought I could add some. I am not familiar with the node code, so I'm in exploration mode :)

I think that properly instrumenting the binary with tracepoints placed at strategic places is useful. You won't use it often, but the day you need it you're happy it's there.

In the end I don't really mind if it's LTTng or another tracer. But one thing to keep in mind is that some tricky performance problems or rare crashes can only be reproduced in production, so it's sometimes necessary to enable tracing in production. LTTng is very efficient and designed for such use cases.

Tracing needs to have the smallest possible overhead. If it's possible to support multiple trace backends dynamically without any performance cost, then why not. But otherwise, if having to choose at compile time results in a more efficient solution, I would prefer that.

ofrobots · 2018-02-24T05:53:24Z

@simark lttng based tracing would probably not help find the memory leak if it happens to be in the JS heap. V8 JIT generates allocation code inline at each allocation site – something that would be hard to intercept using a tracing mechanism. You might want to check out the experimental sampling heap profiler in V8 that is designed for in-production memory leak debugging.

simark · 2018-02-26T00:28:59Z

@simark lttng based tracing would probably not help find the memory leak if it happens to be in the JS heap. V8 JIT generates allocation code inline at each allocation site – something that would be hard to intercept using a tracing mechanism.

And that JITed code doesn't call a common allocation function?

You might want to check out the experimental sampling heap profiler in V8 that is designed for in-production memory leak debugging.

Thanks, I'll give it a try.

simark · 2018-02-26T02:42:51Z

@simark in the interest of keeping the thread on topic of tracing, let's move the memory leak discussion to a separate thread, perhaps in the https://github.com/nodejs/help repo. If the leak is suspected in Node core itself, the issue can be in https://github.com/nodejs/node.

Sorry, I had both issues opened, and posted in the wrong tab. You can delete all history that this ever happened :)

mike-kaufman · 2018-02-26T17:45:03Z

I feel it would be helpful (at least for me) if we had a matrix kinda' like the following, so we could look at different trace libraries we're discussing, and understand differences in capabilities. The rows aren't complete yet, so don't get too hung up on that.

Do others find this useful? If so, should we move into top of issue & iterate on filling out in more detail? Or move into a markdown doc somewhere so ppl can PR it?

	chromium trace_events	Linux Native TRACE_EVENTS	DTrace	LTTNG	ETW
supported OS?	all	linux	SmartOS, Mac OS X (partially),freeBSD	linux	windows
requires re-compile?	no	yes (no, assuming kernel >= v4.2 [2015])	no	yes	no
dynamic categories?	yes	no	no (yes, with libusdt)	no	no
dynamic listeners?	no (yes, with effort)	yes	yes	yes	yes
interrupt (e.g. stack trace)	no	yes	yes	yes	yes
documented in Node.js	Yes (poorly)	No	No	No	No
developer experience	TDB	TBD	TBD	TBD	TBD
others?

edit: filled out by @AndreasMadsen
edit: minor clarification on ETW traces by @mike-kaufman
edit: added easiness of usage row by @mcollina
edit: added FreeBSD to Dtrace column by @mike-kaufman
edit: renamed "easiness of usage" to "developer experience" and tagged all columns as TBD by @mcollina

mcollina · 2018-02-28T16:55:52Z

I would add another line that assess how easy is to use each tool.

I would say that using ETW is extremely hard (and requires a lot of knowledge of Windows internals). dtrace needs Solaris (but maybe not in the future). LTTNG is being removed.

mike-kaufman · 2018-02-28T17:14:44Z

@mcollina - feel free to edit table above to add rows you think are useful.

"Easy to use" risks being a subjective thing, but it's valid, and would be great to see if there's consensus around different tools being "easy"/"hard".

The other angle here (not captured in table above), is what are the expectations for a runtime on the target OSs? E.g., for windows, expectation is probably ETW. Not saying that's a strong argument to support one library over another, but it warrants consideration IMO.

mcollina · 2018-02-28T19:59:58Z

@mike-kaufman I've added the row

jclulow · 2018-02-28T22:19:16Z

Please remove the "easiness of usage" row. It's completely subjective, where at least the other rows seem like they could be definitively measured one way or another. Given that you've put "hard" against everything that isn't Chromium Trace Events, it feels fairly politically motivated as well.

You should also add FreeBSD to the list of operating systems providing DTrace support.

mike-kaufman · 2018-02-28T22:26:54Z

@jclulow - I agree the "easy"/"hard" classification subjective. However, usability is a key criteria. I'd like to get to a place where we have more detailed specifics about what's contributing to the easy/hard perception. Matteo's classification at least starts that conversation.

If you have specifics or details that you think contribute to usability of any these libraries, it would be really helpful if you could chime in with them.

jclulow · 2018-02-28T22:30:55Z

I can't stress enough how frustrating it is to have an entirely fictional criteria added, in which the features that I use every day are immediately on the back foot with no explanation. If you want to add some concrete criteria in the future, that's fine -- we can talk about that.

This is not that, though, and it needs to go immediately. Alternatively, you can change the entry for DTrace from "hard" to "easy", at which point obviously I agree! (See the problem?)

mike-kaufman · 2018-02-28T22:33:02Z

@mcollina - Can you help the thread understand where your "easy"/"hard" perceptions come from?
@jclulow - can you provide more color around the end-user will consume DTrace events?

mcollina · 2018-02-28T22:39:48Z

@jclulow I used both dtrace and perf in the past. Setting up and using those tools is not something a typical Node.js user is able to do. I think developer experience is a key criteria for these tools, as most of those options are either a) poorly documented, b) not documented at all, c) not working at all (lttng).

dtrace, lttng, Linux TRACE_EVENT are completely not documented within our organization. We document Chromium trace events: https://nodejs.org/api/tracing.html. Assuming documentation as one of the key metrics for usability, all of them minus Chromium trace events are hard to use.
I'm not talking about the tool itself, but rather their usability with the Node.js project itself.
In most cases, adding some documentation could definitely improve usability.

@jclulow I kindly ask to use more calm language. No one is offending you here.

jclulow · 2018-02-28T22:44:46Z

I am completely calm. Like I said, if you want to talk about something more concrete where we can measure alternatives by some objective criteria, that's fine. But, again, that's not what's up there right now!

By way of example, if you wanted to put a row for "described in the Node documentation", that would be fine -- it's an objectively measurable fact. It also has the benefit of being something we could take action on to fix, if needed.

The criterion "easiness of usage" is, by contrast, simply playing favourites with no justification. It should be taken down, first, then (if it makes sense) replaced with objective criteria afterwards.

As an aside, I would note that you can hardly tell somebody else whether they are offended or not.

mike-kaufman · 2018-02-28T22:45:57Z

@jclulow - the point here (at least from my pov) is to have a conversation about the different solutions & try to come to some objective, shared understanding. The easy/hard thing is simply a starting point for that conversation, which I think we're having. I don't foresee that column lasting too long.

mike-kaufman · 2018-02-28T22:48:00Z

RE developer experience, there are two different facets here:

experience of someone adding traces
experience of someone consuming traces

Complexity with the consumption can generally be addressed with additional tools if that is a concern.

mcollina · 2018-02-28T22:51:29Z

I've put the "easiness of usage" as TBD, and added a documentation line.

I strongly argue that we need a measure for this, not with the tool themselves but with their usage with Node.js. perf as a tool can be simple or hard to use, but are we making things harder to use Node.js with perf? how can we improve the developer experience on the various platform?

jasnell · 2018-02-28T22:52:35Z

One thing that I'd like to clarify... the "hard" / "easy" bit is not a criterion for removing or favoring one mechanism over another. We will be continuing to support Dtrace/eBPF/ETW style tracing along with v8 style tracing. Each has it's own complexities, limitations, strengths, and relative difficulty of use. For instance, v8 trace events are theoretically "easier" but suffer from a lack of comprehensive tooling.

geek · 2018-02-28T22:59:18Z

@jasnell please expand with specifics on your claim

we currently have to instrument for each individually, which increases maintenance burden

Is this really true, are tracing providers getting in our way? What problems aren't we able to solve because of our various trace providers? Are developers getting stuck or slowed in some way because of our trace providers?

misterdjules · 2018-02-28T23:04:03Z

@jasnell The premise of this discussion seems to be:

Node.js core supports multiple trace providers... we currently have to instrument for each individually, which increases maintenance burden.

Can we be a bit more specific (maybe quantify somehow?) the "maintenance burden" of specific tracing frameworks that are currently implemented in node core? Otherwise I'm not sure why we need to have this discussion.

I think I'm familiar with what is and has been needed to maintain the DTrace support in node core, but I may have missed some things:

Maintaining the ustack helper: this can be challenging and can block V8 upgrades, but has usually been fixed in a timely manner by either @davepacheco, @cjihrig, myself or others (I'm sure I'm forgetting people here, my apologies)
Maintaining the build side of that support, which IIRC has not been a major challenge, or hasn't been a problem too often

Is there anything else? If so, it doesn't seem to be such a challenge to me, but again, I may be missing things.

@bnoordhuis mentioned nodejs/node#18074 as an example of maintenance burden for static probes in general (not just for DTrace), but unless I'm missing something those probes are not supported by DTrace, so I don't believe this example applies to it (not saying Ben was suggesting that).

Now, I can also provide my feedback as a user of the DTrace support in node: in general, I haven't had the need to have static probes in libuv, which is maybe a sign of why removing them a while ago hasn't generated any significant pushback.

On the question of:

understand[ing] the critical use cases and requirements for the existing probes

I would say that I use the existing DTrace static probes very often to understand the behavior of node HTTP clients and servers (latency, distribution of requests methods/paths/etc., etc.) while:

having the guarantee that my observations do not have a significant impact on performance/stability of the observed system
having access to the whole system's view when those probes fire (e.g native stack frames)

@davepacheco mentioned specific use cases in a previous Twitter thread at https://twitter.com/dapsays/status/963921954309730304 that are similar to mine. I can definitely help anyone understand those use cases better. If people are interested, just let me know.

However, just to make sure this was not lost in the length of this comment: I would really like us to be a bit more precise about what the "maintenance" burden is. If there's anyway users of the DTrace support can help with that, let's talk about that too.

jasnell · 2018-02-28T23:04:18Z

@geek ... look at the lltng probes, for instance... apparently using --with-lltng has been broken in build for a while and no one noticed. We're now just going ahead and pulling those out because apparently no one is using them.

For the dtrace probes, I'm evaluating now the best way to instrument the new http2 code paths and I'm trying to avoid having to instrument for dtrace and v8 trace events separately. I also want to instrument more for the performance API. Part of this entire discussion is to see if it's possible to develop a single instrumentation approach that allows us to support multiple providers and ensures that we have appropriate test coverage across those.

jasnell · 2018-02-28T23:08:07Z

To put folks at ease a bit: no one is considering removing the DTrace support :-) ... I floated the idea as a hypothetical test balloon to solicit feedback. As soon as I did, enough folks came back with panicked screams that it the very notion could be quickly discarded (as I suspected it would). The key point is: v8 trace events are not an alternative to dtrace probes and it would be fantastic to have an approach that supports both ... preferably an approach that makes things more useful for everyone.

misterdjules · 2018-02-28T23:08:39Z

@mcollina

dtrace, lttng, Linux TRACE_EVENT are completely not documented within our organization.

FWIW, @davepacheco had submitted a PR to document tracing at https://github.com/nodejs/node-v0.x-archive/pull/9206/files. That never landed, but after taking another brief look now, it looks that it would require at most minor changes to port that to the current documentation.

mcollina · 2018-02-28T23:14:26Z

@misterdjules that would be fantastic to add. Possibly it best sits in https://github.com/nodejs/nodejs.org/tree/master/locale/en/docs/guides (https://nodejs.org/en/docs/guides/) considering the current structure of the project and documentation. I think we should open a separate discussion on this. Might be better to have in in nodejs/node as it can change between versions. I think this is a feature that we should document.

misterdjules · 2018-02-28T23:19:10Z

@mcollina I'll see if I can move that forward.

bnoordhuis · 2018-02-28T23:56:27Z

Given that you've put "hard" against everything that isn't Chromium Trace Events, it feels fairly politically motivated as well.

Trace events integrate seamlessly with DevTools. That's the default debugging experience these days so I think it's a fair comparison, nothing political about it.

jkrems · 2018-03-01T00:04:17Z

I think at the summit we had similar disagreements about "easy/hard" and it was brought up that "complex/simple to use" might be more to the point (how many moving pieces are there to setting it up and evaluating the data).

Worth noting that this is always a reflection of the current state. A tool that requires a lot of setup today might become one-click tomorrow. But with that restriction it seems relatively objective (and useful).

mike-kaufman · 2018-03-01T15:30:49Z

RE motivation here, in addition to reducing maintenance, we want a consistent set of trace data across supported OS platforms. Bonus points if this can be done in a way that is idiomatic with the tools/practices of each platform.

github-actions · 2020-07-18T00:37:06Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

mmarchini · 2020-07-21T01:04:59Z

Keeping the stale label for the same reason as #10 (this likely falls under the new tracing strategic initiative)

mike-kaufman mentioned this issue Feb 20, 2018

make trace-events write to configurable sink #152

Closed

AndreasMadsen mentioned this issue Feb 22, 2018

src: fix build with --with-lttng nodejs/node#18945

Closed

GlenTiki mentioned this issue Feb 24, 2018

Deprecating lttng tracepoints. (and add alternative for linux if wanted) nodejs/node#18971

Closed

mike-kaufman mentioned this issue Feb 28, 2018

Performance profiles for typical Node.js applications #161

Closed

github-actions bot added the stale label Jul 18, 2020

github-actions bot closed this as completed Aug 5, 2020

kvakil mentioned this issue Jul 14, 2022

eBPF event loop blockage finder #569

Closed

Supporting multiple trace providers: Trace Events, DTrace, Lttng, ETW #163

Supporting multiple trace providers: Trace Events, DTrace, Lttng, ETW #163

Comments

jasnell commented Feb 15, 2018

jasnell commented Feb 15, 2018

jasnell commented Feb 15, 2018

jclulow commented Feb 15, 2018 • edited Loading

jasnell commented Feb 15, 2018

bnoordhuis commented Feb 15, 2018

danielkhan commented Feb 19, 2018

jasnell commented Feb 19, 2018

bnoordhuis commented Feb 19, 2018

AndreasMadsen commented Feb 20, 2018 • edited Loading

AndreasMadsen commented Feb 22, 2018

bnoordhuis commented Feb 23, 2018

simark commented Feb 23, 2018

ofrobots commented Feb 24, 2018

simark commented Feb 26, 2018 • edited Loading

simark commented Feb 26, 2018

mike-kaufman commented Feb 26, 2018 • edited by mcollina Loading

mcollina commented Feb 28, 2018

mike-kaufman commented Feb 28, 2018

mcollina commented Feb 28, 2018

jclulow commented Feb 28, 2018

mike-kaufman commented Feb 28, 2018 • edited Loading

jclulow commented Feb 28, 2018

mike-kaufman commented Feb 28, 2018 • edited Loading

mcollina commented Feb 28, 2018

jclulow commented Feb 28, 2018

mike-kaufman commented Feb 28, 2018 • edited Loading

mike-kaufman commented Feb 28, 2018

mcollina commented Feb 28, 2018

jasnell commented Feb 28, 2018

geek commented Feb 28, 2018

misterdjules commented Feb 28, 2018

jasnell commented Feb 28, 2018

jasnell commented Feb 28, 2018

misterdjules commented Feb 28, 2018

mcollina commented Feb 28, 2018 • edited Loading

misterdjules commented Feb 28, 2018

bnoordhuis commented Feb 28, 2018

jkrems commented Mar 1, 2018

mike-kaufman commented Mar 1, 2018

github-actions bot commented Jul 18, 2020

mmarchini commented Jul 21, 2020

jclulow commented Feb 15, 2018 •

edited

Loading

AndreasMadsen commented Feb 20, 2018 •

edited

Loading

simark commented Feb 26, 2018 •

edited

Loading

mike-kaufman commented Feb 26, 2018 •

edited by mcollina

Loading

mike-kaufman commented Feb 28, 2018 •

edited

Loading

mike-kaufman commented Feb 28, 2018 •

edited

Loading

mike-kaufman commented Feb 28, 2018 •

edited

Loading

mcollina commented Feb 28, 2018 •

edited

Loading