-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diagnostics Channel - Next steps #180
Comments
Overhead and performance should be a first level concern about this API. It should result in no-ops if it is not enabled, and get smart if it is enabled. I think we should trade usability for performance whenever possible, as long as it still satisfy our use case. |
@mcollina |
@jkrems 👍 ! |
Added to the top post |
I brought it up in the diag meeting just now, but I worked a bit on a proof-of-concept universal diagnostics channel type thing recently, after I got back from the Ottawa summit and then Elastic{ON}. It's a very rough demo and definitely needs more perf focus before it could be actually usable, but you can have a look at it here: https://github.com/Qard/universal-monitoring-bus It uses subscription queries to express interest in certain messages in the channel, similar to trace_events, but in a currently more flexible way (sacrificing performance for flexibility while trying to identify interesting filter data). It has no direct graph-forming, instead opting to direct the message bus into aggregators, which can live in-process or out-of-process, and makes no assumptions about transport mechanism. It is simply a callback that receives JSON-serialization-ready objects that you can do whatever you like with. It also expresses both spans and metrics updates uniformly in a single message format, with purpose-specific extension, allowing for an easy shared communication channel or aggregator re-use and even correlation between the spans and metrics data. |
I'm picking this back up now and working on some ideas of how to reduce/eliminate the runtime overhead. My main concern is that the prototype is basically just an event emitter so it has to lookup the subscribers on every publish. It does have a I also don't think the module patches and context management aspects should be included. For core adoption, we should stick strictly to the data firehose component and strip that down to the bare essentials for something speedy and highly extendable in userland. |
Here's my prototyping progress this week: https://gist.github.com/Qard/b21bb8cb1be03274f4a8c585c659adf2 I tried to follow the original design, for the most part, but added regex subscribers and the concept of channel and subscriber objects to take the lookup cost out-of-band from the publish so it doesn't have to do the lookup every time. The result is publish overhead on-par with a single variable increment, which is pretty good! What do people think? Is the simple text name a flexible enough way to search for events in the firehose? Another possibility is object matching, which I did in this prototype several years back. I think it's a bit friendlier and more flexible, but I couldn't figure out a good way to pull the object-style search out-of-band from the messaging like I did for the prototype above. 🤔 |
Some questions from today's meeting:
Overall I think it's a great start for the API and it would be good to get input from potential consumers (APMs) and publishers (frameworks) before attempting to land it in core. |
There's a lot to consider in how the ecosystem would want to engage with such an API to be able to deliver something that is satisfying, especially given our increasing focus on long-term vision of the project. We need to make sure this is not just relevant now, but will remain relevant and useful for the foreseeable future. I have a few months to hack on this so there's plenty of room for iteration. We'll see how successful I am in getting this to a place where people are happy with it and it can deliver on the promise of a high-performance unified data firehose. 😅 |
Sorry for not participating the meetings but the meeting times we have are in general hard to catch for me. Regarding (7): I think it's vital that this is sync to allow subscribers to correlate with their current state/context. Just getting the info that e.g. a DB query XXX is sent is not that helpful on its own. APMs correlate events to show e.g. that query X happened during processing request Y. As X and Y in general originate from different modules the events itself can't hold this information. I'm not sure if we need the regex subscribers at all. Published data is only of use if subscriber know the structure of the content for each event. Therefore I think subscribers are quite specific and using them for a possibly unbounded amount of publishers sounds not that reasonable. Maybe for something like a log/batching subscriber which just pushes the data as is or stringified further to the next consumer. |
Let me share my opinions too:
And I have a question too:
|
@Flarna good point on the regex subscriber. Can we turn the Diagnostics Channel into an EventEmitter and emit and event every time a publisher or subscriber is added or removed? This way we can replace the regex with a more flexible approach, where the user fetches all publishers, filters out to get what they want, and listens to the added/removed events so they can update the list of publishers they're subscribing to. This could even be extended (possibly in userland) as a RegExSubscriber, which keeps a cache of publishers that match a regex using the list + events. |
Yes, that's basically what I have locally, though it's not done yet. Replacing the regex-specific subscribers concept with a generalized object that has a function like As for the need for sync: yes, I mostly agree with that assessment. There's performance implications to that though, moving the subscriber execution into the direct execution path. It also might be impossible to do sync if we consider possibilities like subscribing from a worker thread, so context management may need to get baked in, if we go that route. This is all very early in the experimentation process--I've only been working on this a week and a bit, so still just experimenting with ideas and getting for a feel for what the possibilities are in what the ultimate implementation could look like. The code as it is now is intended only as a proof-of-concept of API ergonomics and as a way to test approximate performance impact of injecting such a thing into core code paths. The code is likely to evolve considerably before actually considering it for core. |
I'm curious why async is seen as more performant then sync in this case. |
I'm a bit uncertain also, but I've got comments suggesting moving APM agents into worker threads to take their overhead out of the main application, which seems possibly reasonable in theory anyway. I'm personally a bit skeptical that would be better though. 🤔 |
Collecting the data must be in main thread. What can be moved is serialization/compression/correlation of the collected data and transmission to the APM server. But this is something APM tools have to work on and not the diagnostics channel. |
Well, ideally if diagnostics channel does what it's intended to do, the need for APM code to run on the main thread should mostly disappear. Context management is the one challenge there though which leads me to think that might not be entirely feasible. If at all possible, I'd love to be able to fully isolate all APM code to a separate thread, but I'm not convinced that's doable without some fairly major changes to Node.js. |
ok so the main thread publishes data; either via some sensors placed by the APM tool or by modules having this built in. This data is collected in a worker, processed and transmitted. |
fwiw |
Yeah, it's my intuition that it's better to just keep it in the main thread and encourage APMs themselves to dispatch that to their own threaded thing. Timestamps are an easy fix with a wrapper object, but context management gets harder, especially with multi-tenancy. |
Pull request opened! 🎉 nodejs/node#34895 |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
Removing stale as @quad is working on it. |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
The concept of a diagnostic channel was discussed in the Diagnostics summit as well as here: #134
In a nutshell the concept is to provide an API that modules can use to publish events, and an API that consumers (such as APM vendors) can consume. The key differences between the channel provided by these APIs and trace events are:
This approach requires both implementation of the diagnostic channel as well as updates to modules in order to use the API. An interim step where monkey patching is used to add calls to the API is seen as a good way to bootstrap.
Microsoft already has an experimental implementation which we should evaluate to see what parts (if any) that we can re-use.
An initial cut at next steps include:
The text was updated successfully, but these errors were encountered: