Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for "zero-touch telemetry" requirements #5

Merged
merged 3 commits into from
Jul 13, 2019
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions 0002-telemetry-without-manual-instrumentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# (Open) Telemetry Without Manual Instrumentation

_Cross-language requirements for automated approaches to extracting portable telemetry data with zero source code modification._

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add injection of tracers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AloisReitbauer can you be more specific about the requirement here? I can interpret "injection of tracers" in various ways and don't want to misunderstand/misrepresent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below on specifying the trace at program startup and OpenTel being able to use this parameter to set - and load - the trace at runtime.


## Motivation

The purpose of OpenTelemetry is to make robust, portable telemetry a built-in feature of cloud-native software. For some software and some situations, that instrumentation can literally be part of the source code. In other situations, it’s not so simple: for example, we can’t necessarily edit or even recompile some of our software, the OpenTelemetry instrumentation only exists as a plugin, or instrumentation just never rises to the top of the priority list for a service-owner. Furthermore, there is occasionally a desire to disable instrumentation for a single plugin or module at runtime, again without requiring developers to make changes to source code.

One way to navigate situations like these is with a software layer that adds OpenTelemetry instrumentation to a service without modifying the source code for that service. (In the conventional APM world, these software layers are often called “agents”, though that term is overloaded and ambiguous so we try avoid it in this document.)

### Why “cross-language”?

Many people have correctly observed that “agent” design is highly language-dependent. This is certainly true, but there are still higher-level “product” objectives for OpenTelemetry that can guide the design choices we make across languages and help users form a consistent impression of what OpenTelemetry provides (and what it does not).

### Suggested reading

* This GitHub issue: [Propose an "Auto-Instrumentation SIG"](https://github.com/open-telemetry/community/pull/87)
* [Rough notes from the June 11, 2019 meeting](https://docs.google.com/document/d/1ix0WtzB5j-DRj1VQQxraoqeUuvgvfhA6Sd8mF5WLNeY/edit) following this ^^ issue
* The [rough draft for this RFC](https://docs.google.com/document/d/1sovSQIGdxXtsauxUNp4qUMEIJZzObdukzPT52eyPCHM/edit#), including the comments

## Proposed guidelines

### Requirements

Without further ado, here are a set of requirements for “official” OpenTelemetry efforts to accomplish zero-source-code-modification instrumentation (i.e., “OpenTelemetry agents”) in any given language:
* No _manual_ source code modifications allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be relaxed a bit. In some languages it may not be possible to attach instrumentation without some minimal O(1) initialization code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll loosen it a bit. For languages (e.g., Java) where there is a lot of prior art for truly zero-touch instrumentation, I think that should still be our goal, though.

... There are certainly a lot of legacy JARs out there that can only be instrumented from the outside in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(PTAL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this, but will happily debate this point further in subsequent PRs to the RFC itself.

* Licensing must be permissive (e.g., ASL / BSD)
* Packaging must allow vendors to “wrap” or repackage the portable (OpenTelemetry) library into a single asset that’s delivered to customers
* That is, vendors do not want to require users to comprehend both an OpenTelemetry package and a vendor-specific package
* Explicit, whitebox OpenTelemetry instrumentation must interoperate with the “automatic” / zero-source-code-modification / blackbox instrumentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the far biggest issue right for automatic instrumentation. Unfortunately, it is a bit more complicated than just managing spans. The biggest problem we have is double instrumentation i.e. the same action is instrumented with blackbox and whitebox instrumentation - as you call it.

Detection and disabling is one thing, but maybe one instrumentation augments the information of another. How do you want to handle these situations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not planning to make augmentation a requirement for our first take on this problem. I'll add it as a nice-to-have, though, as it certainly would be... well, nice to have.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. I think we need to get started on this. As we see more real-world use cases of double instrumentation we will also better understand how to solve the issue. Currently, there is quite some research going on at Dynatrace to figure out how to make this work best.

Components in OpenTelemetry might already solve a big part of the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that all makes perfect sense and I agree wholeheartedly with

Components in OpenTelemetry might already solve a big part of the problem.

"Components" (or "plugins" or whatever we call them) are a crucial abstraction, especially when mixing and matching various forms of instrumentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding interop, it'll be helpful to clarify the versioning/compatibility story open-telemetry/opentelemetry-specification#115.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang no argument on this at all... open-telemetry/opentelemetry-specification#115 is unresolved at the moment, so difficult to incorporate here, but certainly the versioning story (and "component"-ization story as @AloisReitbauer has brought up here) is/are a big part of the challenge for interoperable whitebox+blackbox instrumentation.

* If the blackbox instrumentation starts a Span, whitebox instrumentation must be able to discover it as the active Span (and vice versa)
Copy link
Member

@reyang reyang Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the whitebox instrumentation has multiple tracers, how to define the "active span"?

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/tracing-api.md#getcurrentspan

Probably we need a way for the blackbox instrumentation to discover the tracers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang I don't know what you mean by "multiple traces" – do you mean "multiple tracers"? If so, I was of the understanding (feel free to correct me 😉) that the "multiple tracer" scenario is more of a "tracer multiplexer" where there would still be only a single active span.

Probably we need a way for the blackbox instrumentation to discover the tracers?

Almost certainly, yes, though IMO that's "an implementation detail," albeit an important one. I.e., I wouldn't call it a "requirement", just something that we'll need to do in order to satisfy the requirements (if that distinction makes sense).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks. (and you're right, I've made a typo, it is "multiple tracers", I've updated the comment.

* Relatedly, there also must be a way to discover and avoid potential conflicts/overlap/redundancy between explicit whitebox instrumentation and blackbox instrumentation of the same libraries/packages
* That is, if a developer has already added the “official” OpenTelemetry plugin for, say, gRPC, then when the blackbox instrumentation effort adds gRPC support, it should *not* “double-instrument” it and create a mess of extra spans/etc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the developer took whitebox approach and configured a specific exporter/propagator? Should the blackbox instrumentation modify/tweak it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what is most likely going to happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is naive of me, but my hope (??) was that both the blackbox and whitebox OpenTelemetry code would be portable and non-vendor-specific. I.e., that the choice of exporter/propagator should not be relevant here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally - yes. Our initial research of double instrumentation and potential issues indicates that this is anyways a problem that needs to be solved in backend analytics.


* The code in the OpenTelemetry package must not take a hard dependency on any particular vendor/vendors (that sort of functionality should work via a plugin or registry mechanism)
* Further, the code in the OpenTelemetry package must be isolated to avoid possible conflicts with the host application (e.g., shading in Java, etc)


### Nice-to-have properties

* Run-time integration (vs compile-time integration)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us this is a must have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this here for two reasons:

  1. Some other folks on the original GitHub issue (or maybe it was the google doc?) were suggesting we drop the requirement to make v1 easier to build, and
  2. More fundamentally, in languages like Go it's unclear how this would work without forking the compiler. So it felt odd listing this as a requirement

Certainly runtime integration is preferable, all other things being equal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are discussing this with the Go team right now. Happy to bring them in.

I also believe it, there would be an easy way for dynamic tracer loading using an environment variable for the tracer to load and the OpenTel library to load this tracer. We need to define how situations are handled when a tracer is compiled in.

Generally, I would propose to only compile the code with a noop tracer and provider the actual tracer to use as a parameter.

Going a bit deep into the weeds here, but I want to clarify my point and what I would like this group to look at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpful, thanks for clarifying. Speaking only for myself, I feel uneasy mandating that Tracers are loaded via environment variables only, though I certainly acknowledge the benefits of structuring things that way in certain scenarios.

Re this point:

We are discussing this with the Go team right now. Happy to bring them in.

I think that might be premature, at least for this high-level multi-language RFC. For Go specifically, it would be fantastic to get them involved (and I imagine the various Googlers involved with the project can help with that, too).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not mean only, but there should be a well-defined way to load via environment variables.

* Automated and modular testing of individual library/package plugins
* Note that this also makes it easy to test against multiple different versions of any given library
* A fully pluggable architecture, where plugins can be registered at runtime without requiring changes to the central repo at github.com/open-telemetry
* E.g., for ops teams that want to write a plugin for a proprietary piece of legacy software they are unable to recompile
* Augemntation of whitebox instrumentation by blackbox instrumentation (or, perhaps, vice versa). That is, not only can the trace context be shared by these different flavors of instrumentation, but even things like in-flight Span objects can be shared and co-modified (e.g., to use runtime interposition to grab local variables and attach them to a manually-instrumented span).


## Trade-offs and mitigations

Approaching a problem this language-specific at the cross-language altitude is intrinsically challenging since "different languages are different" – e.g., in Go there is no way to perform the kind of runtime interpositioning that's possible in Python, Ruby, or even Java.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely true. Dynatrace has runtime instrumentation for Go :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a dig on Dynatrace, but I'm sure it doesn't work as well or as flexibly as it does in Java... so I would stand by the "that's possible in" bit here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No offense was taken :-). I believe we agree on the problem. We - meaning all tracing tool providers - need to invest a lot of magic to make this work. We often use undocumented features or even (security) loopholes of platforms to make this work.


There is also a school of thought that we should only be focusing on the bits and bytes that actually escape the running process and ignore how that's actually accomplished. This has a certain elegance to it, but it also runs afoul of the need to have manual instrumentation interoperate with the zero-touch instrumentation, especially when it comes to the (shared) distributed context itself.

## Prior art and alternatives

There are many proprietary APM language agents – no need to list them all here. The Datadog APM "language agents" are notable in that they were conceived and written post-OpenTracing and thus have been built to interoperate with same. There are a number of mature JVM language agents that are pure OSS (e.g., [Glowroot](https://glowroot.org/)).