Issues with current design #19

cstockton · 2019-06-23T17:31:21Z

First will there be an official process to make appeals against the projects design decisions? I am worried that collectively open-telemetry is implementing the former projects designs under what I believe to be a false premise it can "always change things later". I feel such changes will be impossible once the library is in wide spread use.

I have a number of concerns with the actual implementation here but it will be a large effort for me to enumerate them. For now I would like to start with some higher level issues that would resolve many of my code related concerns.

Exporters should not exist in this project (fundamental issue with OpenTelemetry as a whole)

If I could choose a single issue to focus on this would be it, as I feel it creates most of the design issues I have to begin with. This is an issue with the greater OpenTelemetry project as a hole. As far as "openness" OpenTelemetry is a step backwards from OpenTracing. It's placing OpenTelemetry as a gatekeeper of implementations, in fact it defines no formal specifications at all which I reported an issue about open-telemetry/opentelemetry-specification#20 and discuss in this gist. Instead of regurgitating these points in detail here I would ask anyone interested to please read this conversation, to summarize:

OpenTracing - Each SDK provides an interface in opentracing- which trace providers may implement in their own repositories to ship traces, for example

Python API defined in https://github.com/opentracing/opentracing-python and each trace provider must implement this:
- https://github.com/lightstep/lightstep-tracer-python
- https://github.com/jaegertracing/jaeger-client-python
- ... and on it goes for anyone who wants to provide tracing
Go API defined in https://github.com/opentracing/opentracing-go and each trace provider must implement this:
- https://github.com/lightstep/lightstep-tracer-go
- https://github.com/jaegertracing/jaeger
- ... and on it goes for anyone who wants to provide tracing

For tracing to be useful you must have participation from many teams across your organization. It's not uncommon for a request to span many languages (java, C#, Go, Python, Nodejs, ...) at a medium sized company. Thus if you want to provide a new tracer and integrate well with an existing ecosystem you must commit the dev cycles to implement all of these languages. There really is no reason to continue on this path with the "next generation" convergence efforts of "OpenTelemetry" that will leave a footprint on the ecosystem for years.

OpenTelemetry - Each SDK provides no interface as a "standard" design principle (not that I've seen anything resembling a standard design principle as I've echo'd here, in gitter, etc). This project seems to provide an ABI only, which is a step backwards from the API of OpenTracing. In addition it seems there is a drive to have the default OpenTelemetry SDK's all talk to a central "OpenTelemetry" agent which will be bundled with exporters further diminishing "openness". Push for an agent after your SDK's reduces the per-language burden we see above with OpenTracing but inserts OpenTelemetry as the gatekeeper of tracing implementations. At the very worst the "maintainers / governing body / process" could be captured by commerical interests to impede competition. At best it slowly grows to be a more arduous and painful process as layers of "exporters" increase bloat, complexity and reduce maintainability until it's no longer worth the efforts.

MyProposal - Stop writing code. Collectively define a structural specification that will slowly produce the initial version for a wire format for carrying trace events. Separate trace backends from the clients at the protocol level. Now you can resume writing code.

Trace vendors implement 1 server for every language

Open telemetry can provide reference implementations of clients that ship to these trace servers, which is what this repository would become. But this leaves room for those who want to implement their own clients because the chosen balance of ergonomics, performance, was not right for them. As it stands now you have a monumental effort beyond writing a client for a well defined wire-format. If the maintainers of this project disagree with my stance on plugins, great! I don't care. I can write my own client. As it stands I won't have that option without also implementing the glue for the tracer which defeats the purpose of investing resources in these convergence efforts. I really would like to urge the project owners to pump the brakes a little and consider the amount of effort that could spared by taking a different approach. Or at least provide some strong technical reasons to march on with this paradigm that expand beyond "this is what we had".

Testing Frameworks

Unit tests are not just for testing results, they also lock in your public API. Using a large unit tests abstraction that accepts interfaces everywhere means that type changes are possible without breaking unit tests. It also imposes additional cognitive load for potential contributors who are unfamiliar with the chosen libraries. In short they only provide a tool to circumvent engineering rigor, relieving authors of both the responsibility to think about struct composition and the benefits of "testing" the said structures to begin with. I strongly believe the projects quality would suffer substantially by using them.

Plugins

I am not sure the design considerations behind the scenes that led to requiring a plugin for tracing via "stdout" but I very strongly object. Plugins should be entirely out of the question as an API requirement.

Plugins may not be unloaded
Plugins provide a potential source for API compatibility mismatches, as plugins:
- can't be locked / versioned by a go.mod file
- can't have large interfaces or impose compile time errors
- insert any argument against ABI as an API here
Plugins can negatively impact the API from evolving as suddenly ABI compatibility is a consideration
Plugins add additional deployment complexity and opportunity for runtime failures
Plugins do not provide nor enforce any code signing by default, adding potential attack vectors

I could continue on, but I feel this is a decent start for a technical debate. The most important argument against plugins is that they may still be provided for proprietary tracing solutions with careful design, you define the Tracer interface and write a Tracer which is backed by a plugin. Knowing this, I think it's easy to prove there is no real reason to circumvent an API to provide an ABI.

Dependencies

First, I feel this project should depend on only the standard library, maybe a couple packages from golang.org/x if well justified. I strongly believe this should be a non-negotiable requirement because it will force a healthier design than what the current tracing libraries of today have. As it stands now this library depends on a massive (257) amount of packages go.sum and this project is just getting started. I understand that part of this is because there is a fundamental problem with the design of OpenTelemetry as a whole providing implementations rather than specifications, but there is still plenty of opportunity to improve here.

The text was updated successfully, but these errors were encountered:

jmacd · 2019-06-24T06:18:59Z

@cstockton I don't know if this helps your concern, but the prototype code that we merged in Go was very incomplete, and I've filed issue 20 about separating the API from the SDK properly, which is a requirement from OpenTracing. I am working on this separation.

In the terminology that's developing, an "exporter" is different than a "plugin". I think I agree with your reasoning. We can have an SDK that has practically no dependencies except standard libraries. An SDK could write to stdout or it could send spans to a server.

The OpenCensus style of library put a lot more logic into the client, which allows various kinds of exporter to be implemented in the process. I'm more interested in a client that quickly encodes data for another process to process and export data, which is why we have to separate the SDK and API here and in the other repos. Still, the goal is to end up like OpenTracing, where there's an API that you can use without knowing what the SDK provides. If you're objecting to the current state of the code, please wait for issue 20 to be resolved, after that and once the trace data format begins to crystalize, we can have an SDK that does very little other than encode bytes and write them somewhere.

tigrannajaryan · 2019-06-24T17:23:17Z

@cstockton I am not an OpenTelemetry Admin or a member of TSC, the following is just my personal opinion.

My understanding is that the primary current goal for OpenTelemetry is merging of OpenCensus and OpenTracing in backwards compatible manner. The motivation is to have a clear support and migration path for existing users of OpenCensus and OpenTracing. This is already a pretty difficult goal, anything on top of it should likely be avoided at this stage.

My opinion is that any improvements should be left for later. Attempting to merge these 2 projects and at the same time come up with significant new improvements and conceptual changes will make the project a lot more difficult and possibly result in a failure. I think that would be wrong and I support gradual change and improvement approach.

I understand your suggestion to design a better wire protocol and can see the value in the particular approach suggested (shipping span events instead of complete spans). I can also see downsides to the suggested approach of shipping events, so it is not so clear-cut (I can expand on this if interested).

I applaud your effort to come up with better solutions than what we already have, however I do not think process-wise what you propose helps with the project goals. I think a better process would be for you and other interested parties to work on your own proposal in parallel to current efforts by the rest of OpenTelemetry community, come up with design documents and sample implementation demonstrating the benefits, present it to OpenTelemetry community and aim to make it part of future OpenTelemetry API (V2).

I am personally of the opinion that existing wire protocols and memory representation of span data is not well-suited to certain high-performance use cases and I plan on following my own advice and building my case for a new high-perf proto that I will present to the community if my experiments are successful.

This brings me to the other related topic on which you commented - API compatibility - so I'll answer it here.

What I wrote above is precisely why I think it is important to have a well-defined API compatibility story and guarantees. I do not think that we will be able to come up with the "best" OpenTelemetry in the first version and that we will need to have a V2. If this is true then in my opinion it is very important that today we send a clear message to all OpenTelemetry adopters (especially to third-party library authors) that when V2 is introduced it will provide zero or almost zero-effort backward compatibility story.

Strong compatibility story de-risks OpenTelemetry adoption for third-party library authors and for end users and at the same time provides a path for radical improvements in the future (so that "can always change things later" is a technical possibility).

iredelmeier · 2019-06-24T18:45:46Z

@cstockton: without getting too the in weeds here - although I will add that I agree with many of your points! - any chance you'd mind filing an RFC (or two!) here?

cstockton · 2019-06-24T20:21:05Z

What I wrote above is precisely why I think it is important to have a well-defined API compatibility story and guarantees. I do not think that we will be able to come up with the "best" OpenTelemetry in the first version and that we will need to have a V2. If this is true then in my opinion it is very important that today we send a clear message to all OpenTelemetry adopters (especially to third-party library authors) that when V2 is introduced it will provide zero or almost zero-effort backward compatibility story.

This is where my fundamental disagreement is. Merging OpenTracing and OpenCensus is going to cause massive churn for the ecosystem already- more so on the OpenTracing end as it seems that project is essentially being deleted in favor of the OpenCensus design. I can not find any benefits to go through these efforts with plans to "change everything to v2 later" - we are forcing churn in 2 communities (OpenTracing OpenCensus -> OpenTelemetry) and again for OpenTelemetry V1 -> V2. Why not do our best to come out with a V1 that has a foundation we can build upon for years? Maybe it really is impossible, but I don't understand why we can't slow down a bit and try. Both ecosystems work today, what is with the sense of urgency here? :/

@iredelmeier - I've started an issue here open-telemetry/opentelemetry-specification#20 would changing this into a RFC in your repo yield a different outcome? Currently my issues have been dismissed under the premise "we can't change anything because of compatibility" while assuring me the same argument won't be valid in the future after N months of user adoption and thousands of lines of code are written. You'll have to forgive me if I'm skeptical about investing further time to voice my concerns to OpenTelemetry.

I'll leave it to the OpenTelemetry team to close my issues if they would like. I'll review code at some point for any security related concerns / major bugs, but refrain from commenting on the design decisions I fervently disagree with to keep progress moving forward for everyone. Thanks.

bhs · 2019-06-24T23:36:32Z

@cstockton thank you for raising this.

This issue/thread seems almost by-definition unresolvable given the number of meaty potential tangents above and the constraints of GitHub issues as a medium (maybe I'm arguing for the "threading" of an RFC PR as @iredelmeier has been advocating for). As such, I apologize in advance for only addressing a few of the points above... for the many aspects I'm not addressing, it's not because I think they're trivial (they are not).

All of that said, without further ado:

The sense of urgency: the current state (of OpenTracing, OpenCensus, and OpenTelemetry concurrently co-developing) is an unstable equilibrium for the larger ecosystem. This won't be resolved until there's a viable migration path from OpenTracing and OpenCensus code to OpenTelemetry. One blocker there could be backwards-incompatibility (which has been raised already in this thread). It's worth mentioning that an inadequate OpenTelemetry API could also be a blocker and should be taken seriously. So, the urgency is about getting to a stable equilibrium where OpenTelemetry is the clear best-choice of the three, as the unstable situation today will cause friction and anxiety; but the actual API (as opposed to the SDK) does require careful thought and scrutiny since poor execution there will be the ultimate counterweight to adoption.

Separation of API and SDK: I think @jmacd addressed this well already, but it's important to reiterate that this has been a high-level goal for the OpenTelemetry project since the beginning... e.g., the Loose coupling of components section of the blog post announcing the intention to merge these projects.

tedsuo · 2019-07-02T04:01:53Z

@cstockton on vacation currently, sorry for the rambling reply. To address your core concerns:

The APIs are separated from SDKs, just like in OpenTracing. I really hope that is clear. I'm concerned that it isn't.
There is a "default" SDK, based on the OpenCensus model, which is available to get people started. OpenTracing lacked a default, and this was a big issue for new users. But the SDK is not a required component. Nor is the default "agent/sidecar" currently being proposed. Nor is any particular wire protocol required (though, again, there is a default). All of these components must be mix and match, or it's in violation of the design of the project. If you see a hard dependency arise between these components, please raise an alarm.
To prove there is complete API/SDK separation, I propose a litmus test: binding an event-based C++ implementation via foreign function calls. This is an excellent alternative implementation for those who require performance, works in almost all languages, and proves that the interfaces are separated from the SDK.
The spec absolutely needs to be cleaned up. I have shared concerns here. We cannot deviate in design to the point that OpenTelemetry instrumentation cannot coexist with OpenTracing/OpenCensus instrumentation, and allow for a progressive migration of instrumentation. But we can make something cleaner than what we had in either projects. I will not allow (to the degree I am able) an OpenTelemetry v1.0 to ship while it is messy. There will be pressure to do so, given the deadlines we have set ourselves. But we will push the deadlines back, rather than ship something which is not finished baking. And we will reimplement the APIS in all the languages if we have to, several times, until we get it right. But I want those deadlines to push towards being really engaged in the spec writing this summer. There's a difference between spending time and letting time pass. Fwiw, the details of the current spec timeline is here. We must defer any additional functionality (such as a logs interface) while we focus on cleaning up tracing, metrics, and context propagation.
There is a standard data format being proposed, along with semantic specifications for events. I agree that is an area where we need to front-load our attention. In fact, that's what I have listed on the timeline here. But obviously... it is not going to be done on july 6th as planned. I interpret this to mean our schedule is slipping, not that we are going to skip this step. Either an additional SIG focused on this, or additional meetings of the spec SIG, is needed in this area.
But, also... there are a number of API concerns which are fairly independent of the data format. Obviously the API must be able to efficiently produce the data. But a good portion of the API design is driven by concerns which are not obvious from the perspective of a wire protocol - the data format alone will not save us. So I believe we still need to be getting our feet wet in the various languages, and getting a feedback cycle going between the language SIGs and the Spec SIG. I would love help making this process less messy!

I really hope this feels like "being heard" and not like everyone is telling you no... you raise good points, and also are showing places where the messaging is clearly not landing - we should not be having some of these convos in github issues; the answers should be clear and obvious. The main criticism that lands (for me, at least) is there are not nearly enough high level design requirements being written down, and the data format needs to be put front and center.

Finally, I'm happy to jump on a vidchat when I come back from vacation, and discuss some of these topics live. I would also suggest coming to the OpenTelemetry Spec and community meetings if you can make it to them, and voicing your concerns there. I plan on bringing up the data-format-first and other issues you raise here in tomorrows meeting, on your behalf if you cannot attend.

cstockton · 2019-07-02T05:56:07Z

@tedsuo Thanks a lot for this response, most of your points deserve taking some time to digest and think about. One thing that is unclear to me is the API/SDK separation, perhaps this is a terminology thing and I've failed to convey one of my biggest concerns or I'm failing to understand your roadmap. I have two choices in the current design:

Import a vendors binary blob into my code as a plugin and have that ship my traces
Use the "builtin" binary blob that ships to the OpenTelemetry agent via an API not suitable for widespread adoption as a neutral wire format which then imports all of the vendor binary blobs.

But I strongly feel that I should be able to import opentelemetry-go and ship traces anywhere using one vendor neutral wire format. Without that the community isn't truly open, the biggest players are just bundling their transports together. To be honest, I would have been happy with any first revision of a wire format. Comma separated streams of JSON- sounds good. Binary stream of gzipped excel spread sheets, sure, why not. At least then the SME's for kafka, rmq, insert cool tech here can begin writing agents, aggregators, queuing mechanisms and more cool tech here to fit all kinds of use cases and begin a strong feedback loop for this newly emerging wire format.

Most importantly our (non-contributors) hands are untied and we have something to build upon.

I respect and accept that my vision may not reflect yours. I also am in no place to continue to contest it (I'm not doing the work here!) - this was just a final attempt to convey it. I'm happy to agree to disagree and let everyone get back to work. Overall I'm thrilled and look forward to the projects progress.

jmacd · 2019-07-16T22:25:26Z

I think we can close this based on the discussion here, since these two issues appear to be very similar:
#23 (comment)

* Add an updated copy of otel-go mock tracer * Add functions for getting attributes from HTTP requests * Add gorilla mux instrumentation * Add example for gorilla mux instrumentation Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>

cstockton mentioned this issue Jun 23, 2019

Define API compatibility story open-telemetry/opentelemetry-specification#115

Closed

bhs mentioned this issue Jun 24, 2019

Migrate RFC repo open-telemetry/community#130

Closed

iredelmeier mentioned this issue Jun 26, 2019

Separate example modules #26

Merged

cstockton mentioned this issue Jul 12, 2019

Why so many interfaces? #23

Closed

jmacd closed this as completed Jul 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with current design #19

Issues with current design #19

cstockton commented Jun 23, 2019

jmacd commented Jun 24, 2019

tigrannajaryan commented Jun 24, 2019 •

edited

Loading

iredelmeier commented Jun 24, 2019

cstockton commented Jun 24, 2019 •

edited

Loading

bhs commented Jun 24, 2019

tedsuo commented Jul 2, 2019 •

edited

Loading

cstockton commented Jul 2, 2019

jmacd commented Jul 16, 2019

Issues with current design #19

Issues with current design #19

Comments

cstockton commented Jun 23, 2019

Exporters should not exist in this project (fundamental issue with OpenTelemetry as a whole)

Testing Frameworks

Plugins

Dependencies

jmacd commented Jun 24, 2019

tigrannajaryan commented Jun 24, 2019 • edited Loading

iredelmeier commented Jun 24, 2019

cstockton commented Jun 24, 2019 • edited Loading

bhs commented Jun 24, 2019

tedsuo commented Jul 2, 2019 • edited Loading

cstockton commented Jul 2, 2019

jmacd commented Jul 16, 2019

tigrannajaryan commented Jun 24, 2019 •

edited

Loading

cstockton commented Jun 24, 2019 •

edited

Loading

tedsuo commented Jul 2, 2019 •

edited

Loading