Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add diagnostic channel for server creation #4278

Closed
wants to merge 1 commit into from

Conversation

AdriVanHoudt
Copy link
Contributor

@AdriVanHoudt AdriVanHoudt commented Aug 26, 2021

This is very basic implementation to add diagnostic channels to hapi.
This came out of Twitter thread https://twitter.com/AdriVanHoudt_/status/1430068665102241795.

The pros for this would be that it would be way easier and handier for apm vendors to instrument hapi, which in turn makes the experience for hapi users way better.
The current solution is to monkey patch all over the place.
See https://github.com/DataDog/dd-trace-js/tree/master/packages/datadog-plugin-hapi/src and https://github.com/googleapis/cloud-trace-nodejs/blob/main/src/plugins/plugin-hapi.ts for examples.

The downside is that this is not a thing on node 12 and the node api is experimental. https://nodejs.org/docs/latest-v16.x/api/diagnostics_channel.html

Feel free to give feedback on usefulness, naming and code structure.

cc @Qard for feedback on whether this is what they had in mind and if they have suggestions on other events/channels that would help their apm use case.

Slight side note that I might not respond to this soon due to vacation but I will pick this back up as soon as I'm back ^^

@AdriVanHoudt
Copy link
Contributor Author

CI seems to fail on missing coverage, I can make a test on node 14 that removes all channels that makes sures everything keeps working but not sure how useful/realistic that is 🤔

@Qard
Copy link

Qard commented Aug 27, 2021

There's also a polyfill you could use, which would be an easy way to reclaim that coverage. https://github.com/simon-id/diagnostics_channel-polyfill

Copy link

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@AdriVanHoudt
Copy link
Contributor Author

That is definitely an option although hapi generally doesn't use external modules, so this would be up to the maintainers if this would be worth it.

Copy link
Contributor

@kanongil kanongil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the experimental nature of diagnostic channels, and the lack of v12 support, this is probably better exposed as a more generic hook that implementors or maybe plugins can access.

@@ -85,6 +86,10 @@ internals.Server = class {
}

core.registerServer(this);

if (Channels.serverChannel && Channels.serverChannel.hasSubscribers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasSubscribers should only be checked when it is expensive to prepare the object that is published.

@Qard
Copy link

Qard commented Aug 27, 2021

There's a polyfill for v12, and it's not going to be experimental for long. I just need some people to start using it before I can request it gets marked stable.

Also, it's not experimental in the sense that the API is likely to change. The API was decided over years of effort in userland before I moved it into core, so I see the experimental status as more a formality than anything.

The change just to have the init is very minimal, and behind a try/catch, so I don't feel like there's a strong argument against just having it. 🤷‍♂️

@kanongil
Copy link
Contributor

One way to support this, without adding a new dependency or relying on experimental APIs, is to use the current API, but access it through an object provided by the user.

Ie. it could be enabled something like this:

const DiagnosticsChannel = require('diagnostics_channel');
const Hapi = require('@hapi/hapi');

const server = Hapi.server({
    diagnostics: DiagnosticsChannel
});

Then hapi could have a simple default null implementation, so the API always works and would not need a conditional check before calling (which could cause code coverage issues). Eg:

const NullChannel = class {

    hasSubscribers = false;

    publish(message) {}

    static channel(name) {

        return new NullChannel();
    }

    static hasSubscribers(name) {

        return false;
    }
};

@Qard
Copy link

Qard commented Aug 28, 2021

That kind of defeats the purpose which is to enable automatic instrumentation. If we could get the user to manually supply our channel to hapi we could just as easily get them to pass their hapi server into our API. That's not what users want or are willing to do. They want to be able to attach instrumentation with zero code changes, which requires modules we are interested in to publish diagnostics_channel events that we can listen to in our products, otherwise we need to continue with the not great practice of monkey-patching everything.

@devinivy
Copy link
Member

devinivy commented Aug 29, 2021

That kind of defeats the purpose which is to enable automatic instrumentation.... If we could get the user to manually supply our channel to hapi we could just as easily get them to pass their hapi server into our API.

This makes it sounds like user opt-in is at odds with instrumenters having to monkey patch, which I do not really see/understand. Even if users opt-in, hapi would still provide diagnostics over the proper channel for instrumenters, which will allow us to eventually get rid of the need for monkey-patching. I see these as independent questions, but please let me know if that is misguided! I also see all sorts of benefits to using diagnostics_channel outside of the automaticness of implicit opt-in.

I see a couple different things going on here, and I wonder if it might help to talk about them separately. Sorry this got sort of long!

  • Do we want to support diagnostics in hapi, and contribute to diagnostics_channel becoming stable?
  • Do we want to rely on an experimental API?
  • Do we want users to opt-in to hapi publishing diagnostics about their application, or should this be the default?
  • If we start by publishing hapi:server, do we intend to continue publishing additional diagnostics in future work?

Here are my personal views on it:

Do we want to support diagnostics in hapi, and contribute to diagnostics_channel becoming stable?

Yes! I think that it would be nice to participate in the process leading to diagnostics_channel becoming stable, by showing that we can offer support for it. Allowing hapi users to take advantage of a common means to consume diagnostic data from the framework could drive all sorts of useful tooling, for both commercial APMs and open source developer tools. I see it as being good for our ecosystem, and for the broader node ecosystem, to support this.

Do we want to rely on an experimental API?

In my view, if we can mitigate relying on an experimental API that would make me most comfortable. We talk about one of hapi's priorities being stability, and as part of that I definitely count mitigating risk of potential breakage or breaking changes that could come from use of experimental node modules. As Stephen mentioned, even though the API is likely not to change, it still needs to show adoption before being marked as stable, and this process is not complete yet so we don't know for sure how it will land. It doesn't seem difficult to mitigate this while also getting all the most important benefits for hapi users by supporting the adoption of this emerging standard around diagnostics_channel.

Do we want users to opt-in to hapi publishing diagnostics about their application, or should this be the default?

It seems friendly to me to allow users to opt-in to publishing diagnostic information about their application, especially as we dip our toes into support for this new and experimental functionality. I don't see a major benefit for users to implicitly publish diagnostics to their process, when the tradeoff we're talking about is having the user add a single line of configuration to opt-in. I wasn't able to find much conversation about it on the web, but I am interested if folks see automatic use of diagnostics_channel as a sanctioned way for modules to siphon information about node applications. I wonder if @nlf or @vdeturckheim have some perspective on any security implications. My impression is that it makes nefarious activity that is already possible easier to implement and harder to identify as being nefarious, or even falling into grey area (e.g. if an APM gathers more information than it purports).

If we start by publishing hapi:server, do we intend to continue publishing additional diagnostics in future work?

Yes, and I think that this is critical to the purpose of diagnostics_channel. We want the framework to offer-up diagnostics rather than have libraries working to dig into hapi in unsupported ways, e.g. using monkey-patching. I would be comfortable starting by just pubishing hapi:server, then collaborating with users, APMs, and other frameworks to see what kind of diagnostics we should be providing.

That's where I'm at! I really like Gil's recommendation, as I think it alleviates my concerns while also allowing us to support the adoption of diagnostics_channel and helping it become stable. As diagnostics_channel becomes stable and we better understand hapi users' perspectives, as well as the security/privacy implications of automatic publishing of diagnostics, then we certainly could revisit. It feels like a good compromise of several concerns to me. If hapi users show-up here and say they don't want to add a line to their server config to opt-in, that may also be instructive.

@Qard
Copy link

Qard commented Aug 30, 2021

So the main point of diagnostics_channel is that it provides a decoupled communication system between things that could report diagnostics information and things that would be interested in consuming that information. There's no requirement of passing or otherwise sharing objects of any sort. It functions simply by two places acquiring a channel of the same name. One place acquires a named channel and publishes, and another acquires the same named channel and subscribes. The two ends of the communication don't need to care if there's anything on the other end.

It also is specifically optimized such that publishing without any listeners is zero-cost. The publish method of the channel is literally a no-op until a subscriber starts listening, so it's perfectly fine to publish liberally and not worry about performance degrading.

I designed diagnostics_channel to be very intentionally structurally different from a plain event emitter to eliminate all cost of publishing. Basically, making each channel a unique object moves all the lookup costs out of the repetitive publish time and into the one-time channel creation, leaving only a scope lookup, which is generally very low-cost as the JIT is usually smart enough to keep the reference on-stack when it sees a function is going to use it.

Anyway, I'd recommend using the polyfill. It's basically a copy/paste of what is in core, so it will work basically identically. Apart from that, I'm confident it should be easy to support and should be quite stable. Like I said, the design came out of years of prior experimentation, so it's unlikely to change much, if at all. But also, the surface area currently is quite minimal so easy to code defensively to avoid future issues if you want to be extra paranoid about it. 😅

Copy link

@vdeturckheim vdeturckheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
As @devinivy there are already hacks to get to the same results and I'd say it's cleaner this way. I don't see any major security implication imho.
Also, as mentioned, this could be a hook at package level and emit on DC when available?
DC will really really make things better as APM/RASP vendors will stop doing scary things with libraries' internals ^^

@YoannMa
Copy link

YoannMa commented Aug 30, 2021

Just giving my 2 cents here,

I'm in favor for this feature but not a fan of using a polyfill, especially for Node 12 as there is a discussion for dropping it in #4279

@kanongil
Copy link
Contributor

The diagnostics_channel design does seem to make sense. I'm just trying to provide ways to incorporate it without compromising the promises that hapi have made.

@Qard
Copy link

Qard commented Aug 30, 2021

If you plan on dropping 12.x then you don't need the polyfill. I just suggested that for dealing with long-term stability concerns, but if you're just going to drop the version that diagnostics_channel can't get backported to then you have nothing to worry about there. 🤷‍♂️

@nlf
Copy link
Member

nlf commented Aug 30, 2021

how about instead of a polyfill we use feature detection, and wrap diagnostics_channel in a shim that's always no-ops in unsupported versions of node?

that removes the necessity of a polyfill, but does more for the typical out of the box user than only providing hooks for a user to implement their own diagnostics. maybe that's a reasonable compromise?

@Qard
Copy link

Qard commented Aug 31, 2021

If you plan on wrapping, or especially rebuilding the channel object, be sure to understand the implications of that. It internally does a prototype swap when a subscriber is attached to switch from the default zero-cost publish method to one that will dispatch to the subscribers. This means if you do anything weird in your custom object it may behave strangely. Due to the simplicity of the interface, I'd recommend just making a channel factory that returns the channel as-is or a completely separate object. Also, the hasSubscribers getter would be helpful here--you can just construct an object with the Boolean field on it. As I recall, this is what undici is doing.

@Nargonath
Copy link
Member

I'm definitely in favor of having DC baked into hapi but I also agree on some of the concerns expressed in this issue.

Regarding the experimental API, even though @Qard is pretty confident that the API is very unlikely to change, as long as it's flagged as such it's still possible that it could occur without a Node major version change. If ever that happens it would put hapi in a peculiar situation where errors could arise in userland just because they updated their Node version within their project. I don't think that's a situation we want to put hapi in.

I like Gil's suggestion also since I do not agree on the fact that you don't need to make changes to your codebase when using APM or other tools alike. IIRC you usually need to install a package in your project and require() it somewhere once at least. If the user needs to add 2 more lines of code in only one file which would probably be the same, it doesn't seem like a big deal. Providers would have to update their documentation, which can be a pain point but I'm just guessing. That way we'll still be able to bring adoption for it to be marked as stable. Once that happens we can revisit and make it more transparent for the user.

Giving the choice to the user to either pass the DC object - in the first version - or disable a flag once we baked DC in hapi's core would enable use cases where you could disable the reporting in your staging/review env for example while keeping it enabled for production without much of added boilerplate.

@Qard
Copy link

Qard commented Sep 4, 2021

Many APMs can be loaded automatically, without code changes. At the very least, basically all of them can be used with node -r flag to preload before the app and use environment variables to configure so ops people can add it without needing to do anything to app code.

But even putting that aside, the vast majority of users will actually complain about needing any sort of manual intervention for instrumentation. I've been in the industry for a decade now and worked with several different APM vendors and all have lost users to expectations that things just work automatically.

A manual API doesn't help us at all because, as I said before, we can just as easily make an API to have the user pass the servers into our API to interact with it that way. It's more stable than monkey-patching that way, but then users refuse to use it because they expect it to be automatic.

@Nargonath
Copy link
Member

Thanks for the explanation @Qard. 👍 I wasn't aware of that way to use such tools.

I'm not against a fully transparent/automatic usage of DC but I'm not sure we're ready to set it in the project considering its experimental flagging and hapi's core values.

@kanongil
Copy link
Contributor

kanongil commented Sep 6, 2021

FYI, I'm dead set against using an experimental API in core logic that doesn't enable anything new for many users.

Also I'm not convinced that this will end up being a hands-off APM integration. Eg. we might not publish the full set of channels that the APM vendors require. How would we ensure this?

Will the channels be public API? If so we need to document it and avoid breaking changes. This is currently internal stuff, and could limit / complicate any changes that we make. If not public API, then how can APM vendors rely on it?

Regarding the DC API itself, I find it annoying that namespacing and versioning (to avoid conflicts), are left to users to decide. If DC support is added to wreck, it is quite likely that some user will eventually end up will multiple major versions in a project. How is an APM supposed to know which version published on a channel? It can only infer from the passed object.

@Qard
Copy link

Qard commented Sep 6, 2021

For fastify we only needed an "app created" event and just used the plugin API to create the rest. The same can probably be done for hapi too.

As for the note about multiple version: that shouldn't actually matter because APMs don't generally care about module versions, we're just looking for high-level concepts like a request started or ended, a route changed, a database library made a query, etc.

And the channel names thing I don't think is an issue. Most people would never have a reason to use diagnostics_channel directly. It's meant pretty much just for framework authors and APMs, of which there is a limited number of either. And it's a way for us to make a pull request saying we want X while making it in a generic way that can be consumed by everyone and not just some vendor-specific thing that is likely to get rejected.

Also, as APMs are the consumer and we're already used to paranoia-level defensive coding due to the high-risk nature of monkey-patching everything, I don't foresee much issue differences between version or inconsistency of how concepts are expressed across modules. We'll just tailor each subscriber to the specific case. 🤷‍♂️

@devinivy
Copy link
Member

That's all great context and info, thanks! I think pretty much all the information is out on the table now and we've heard from various interested parties.

My interpretation of where we're at is that:

  • there's agreement DC is a great idea, and we want to support the effort of seeing it become a stable API.
  • there is a desire from APM vendors to ensure zero code changes are required by their users.
  • there is concern over relying on an experimental API, which is counter to principles of the framework.
  • there's some awkwardness that node's support for DC and hapi's support for node aren't quite in sync.
  • there's a question related to both items above around how best to ensure 100% code coverage.
  • it's sufficient to publish a hapi:server event, especially for now. But to mitigate monkey patching, going forward we'd like to remain open to instrumenting more of the framework.

If we implement Gil's suggested API in hapi v20, we can speak to all items except for the desire of APM vendors to ensure zero code changes for their users. To account for that and ensure this is not a long-term issue for APMs, in hapi v21 when we drop node v12 support we will revisit the status of the DC API and if it has become stable (as it sounds it is likely to be) then we could have diagnostics automatically enabled.

@AdriVanHoudt I know your work ended-up at the center of quite a conversation 😄 Thanks for bringing this all up, it's appreciated. Are you open to continuing this work?

@kanongil
Copy link
Contributor

This does not look like it is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants