Running multiple tracing implementations simultaneously #9

bhs · 2016-11-16T01:18:39Z

Continuing the discussion from opentracing/opentracing.io#66

lookfwd · 2017-01-03T18:02:23Z

Higher Order Tracers (HOT) & Migrating Between Vendors

I think there's a case for Higher Order Tracers i.e. Tracers that combine other tracers (at construction time) e.g.

import ot.hot
...

zipkinTracer = ZipkinTracer()
logTracer = LogTracer()
proprietaryTracer = ProprietaryTracer()

opentracing.tracer = hot.All(zipkinTracer, logTracer, proprietaryTracer)

Any in-band keys should be namespaced in order to enable >1 tracers working in parallel - I think this (impressive) might be the only extra constraint that has to hold true to enable HOTs. There will be the need to add some "magic" for handling error cases (i.e. vectorized)... but shouldn't be huge deal. Obviously there could be others like hot.Any or router-like i.e. depended on implementation specific run-time flags, span Attributes etc.

Something like this would allow one to switch vendors without downtime since it allows a transition period (could be days) while both the old and the new tracing systems are running. As far as I can see, it should be close to impossible to switch with 0-downtime unless you have such feature.

yurishkuro · 2017-01-03T18:25:33Z

Hopefully different tracers would be using different header keys anyway, so I would punt on the namespacing, especially since introducing namespacing is also incompatible with the existing deployment.

See opentracing/specification#9

lookfwd · 2017-01-03T19:18:36Z

@yurishkuro - I agree... no "formal" namespacing. Just that people should be aware of aliases when they design their protocols. One interesting corner-case is baggage where I can imagine
implementations that just normalize the key and push it directly to the carrier. If all implementations so happen to naively not change the value, this might be ok. Generally speaking bad things could happen - specially in the context of C++ where it could cause a crash.

lookfwd · 2017-01-03T19:37:45Z

From the previous discussion...

yes, the attributes is the only read operation in the whole API, so I think the composite tracer will have to pick one of the delegates, i.e. always the first, and the end users can configure delegate tracers in the proper order.

I think the only mathematically correct (i.e. functional) way to do this, is to not discard any information i.e. something like span.get_baggage_item(key) should return {'value1':['ZipkinTracer', 'LogTracer'], 'value2': ['ProprietaryTracer']} and let the app do whatever seems fit. Personally I would be more than happy with that but I'm sure most people would freak out. Could it return the first (non-null) majority and call an error callback-function mentioning "discarded value XYZ from that tracer?". Of course it could return first non-null but first doesn't sound very well defined :) Another option is to pass an (optional) lambda iteratable: ... to get_baggage_item() that lets the app hand-craft its policy (this makes more sense with functors in C++).

The same thinking applies on the return/error codes.

yurishkuro · 2017-01-03T19:44:25Z

if you assume that the tracers already use prefixed keys, then encoding of the baggage is no different from encoding of the span context.

Errors are harder. I would go with a configuration option for the multiplexing tracer to either fail fast or to do best effort and aggregate all errors into one.

lookfwd · 2017-01-03T19:46:18Z

Let's assume that I upgrade from a tracer that doesn't support baggage propagation to one that supports (or the reverse). I think this use case, could easily hint towards a good solution.

yurishkuro · 2017-01-03T19:46:59Z

I may have misread your point about baggage - so no issues for encoding/decoding, but just what to return from MultiplexinsSpan.getBaggage - I would go with a simple solution and return whatever the first tracer returns, and a more complicated solution is again internally configurable strategy in the tracer on what to do in case of conflicts (I wouldn't worry until there's a real need).

lookfwd · 2017-01-03T19:55:44Z

No - you are correct. There were two issues on baggage, one is the one you said above, about the prefixed-keys. i.e. if someone sets "hello": 12 as baggage, e.g. X-B3-hello:12 should go to the Carrier. There's now the second issue of what MultiplexinsSpan.getBaggage() returns (which should be the same with Span.getBaggage() in terms of type, otherwise we break the O(1) replace implementation rule). There are are also subtleties like what happens if we have 3 Mux'ed Spans and the second throws KeyError ... I think the Mutiplexer should mark it but continue and try to set on the third Mux'ed Span. Similar interesting cases exist on the Tracer.extract() and Tracer.inject(). We might be able to restore a MultiplexinsSpan with Mux'ed Spans for tracer 1 and 3 but not for 2 (e.g. due to UnsupportedFormatException). MultiplexinsSpan should optimistically survive, but the partial error shouldn't go completely un-noticed. I think that this discussion is quite crucial on the essence of OT and no matter how much pain, there should be a good answer. If OT is about switching between implementations it should be possible to do so without downtime... otherwise only relatively small companies can afford it.

yurishkuro · 2017-01-03T20:04:21Z

If the mux tracer can be configured with error handling strategies like FAIL_FAST and ACCUMULATE, I think it mostly solves the issue with extract(). I don't think there's a way to define the "correct" behavior of the mux, so give users a choice to pick the behavior and do best effort.

There's probably a need for a 3rd error strategy - to succeed if at least one tracer succeeds, because the user code might be checking the error and not using the returned span context as parent in case of an error.

lookfwd · 2017-01-03T20:07:10Z

What are your thoughts on ACCUMULATE?

lookfwd · 2017-01-03T20:12:21Z

There's probably a need for a 3rd error strategy - to succeed if at least one tracer succeeds, because the user code might be checking the error and not using the returned span context as parent in case of an error.

I agree on this a lot. Overall you want to be able to upgrade/downgrade gracefully in terms of Tracer features. If baggage or a specific type of carrier wasn't supported and it's supported with the new tracing system I guess you want to print an DEBUG/INFO log and continue using as many Tracer features as possible. If no tracer supports a feature, then an ERROR should be logged and someone look at it ASAP... because e.g. you are using a baggage value that is now e.g. null.

lookfwd · 2017-01-04T02:53:10Z

Bringing #28 (comment) here...

I think this approach will help upgrade or switch systems, though in practice it might require complete deployment across the entire architecture before you switch off the former.

Yes exactly. There should be some kind of metric and you can't switch off till you see no-hits.

Key concerns will be around aggregate (in and out of process) propagation and trace context provisioning I think.

I think that the first concerns mostly what we talk about above. I'm not quite clear on the second one...

wu-sheng · 2017-01-04T03:19:12Z

Error handling strategies should be the very important thing in multi-tracer-implementations. We should choose them carefully. If use the wrong strategy, a trace-implementation will failure, because of other implementation.

Such as: extract in tracerA.extract() -> tracerB.extract() -> tracerC.extract(), if tracerB.extract() fails, the strategy should make sure tracerC.extract() must be invoke. Otherwise, tracerC.context() will be missed from there. And the trace-C will be broken.

This bug will be hard to find out from the tracer-user-side or tracer-deveploer-side.

wu-sheng · 2017-01-04T03:23:37Z

enable >1 tracers working in parallel, I think parallel is not the accurate word, maybe. parallel means execution can be truly simultaneous, also almost means run in multi-threads, this is dangerous choice.

Such as: extract in tracerA.extract() -> tracerB.extract() -> tracerC.extract()

Most likely, multi-tracers execute like a chain as I said above. Of course, with some Error handling strategies.

wu-sheng · 2017-01-04T03:28:06Z

@yurishkuro , I prefer the strategies can isolate the tracer-implementations, we could provide listener to notify user, there are errors or exceptions in some of the multi-tracers.

bhs · 2017-01-04T20:05:30Z

@lookfwd sorry for the delay, having a crazy week.

The primary value prop around interoperation of Tracers is not actually so that an org can necessarily switch from one tracing vendor to another with zero hiccups, although that's another nice possible benefit :)

The main advantage I see is actually for 3rd-party OSS/etc projects that are used by many organizations and need to provide tracing instrumentation without making assumptions about the environments they're integrated into. E.g., take DropWizard: there are companies using DropWizard and Jaeger; companies using DropWizard and Zipkin; and companies using DropWizard and LightStep. By instrumenting DropWizard with OpenTracing, the vendor-neutrality is preserved for at the DW level and the vendors don't all need to independently instrument DW.

As for the idea of informally namespacing headers: sure. Though IMO this is redundant to discuss at the OT level since that's true for any possible HTTP header (given that HTTP headers are a shared namespace to begin with). I don't strongly object to making the docs longer / more complex to accommodate this advice, but I am skeptical that it's needed to achieve your desired result :)

codefromthecrypt · 2017-01-05T02:35:04Z

The main advantage I see is actually for 3rd-party OSS/etc projects that are used by *many* organizations and need to provide tracing instrumentation without making assumptions about the environments they're integrated into. E.g., take DropWizard: there are companies using DropWizard and Jaeger; companies using DropWizard and Zipkin; and companies using DropWizard and LightStep. By instrumenting DropWizard with OpenTracing, the vendor-neutrality is preserved for at the DW level *and* the vendors don't all need to independently instrument DW.

That would be possible if people could rely on the api. Currently, it is blind.. features are added without an inventory of which systems support them. It is limited to whoever is paying attention which usually is several people at most (out of hundreds of people actively designing tracing systems). If the goal is for people to rely on this abstraction for frameworks, users will need to know which features are risky and which aren't. In other words, that would be a completely independent issue which would track the viability of apis, for example encouraging use of more stable or more often implemented features when attempting to design generic framework instrumentation. Until then, the promise of instrumentation with no assumptions about environments is unlikely to be credible.

lookfwd · 2017-01-06T00:52:06Z

The main advantage I see is actually for 3rd-party OSS/etc projects...

@bensigelman Awesome - that wasn't clear for me

without making assumptions about the environments they're integrated into. E.g., take DropWizard: there are companies using DropWizard and Jaeger

Aren't they making the assumption there's only a single tracing framework? How can one use DropWizard with both Jaeger and Zipkin?

Frankly, adding instrumentation is adding quite some non-trivial code on a potentially large (e.g. N>>1000) number of applications, and hopefully their instrumentation code might stay in place (without the need for any changes) potentially for 5+ years. I believe that an application developer wants to make the most out of the added instrumentation code which means it's quite likely one might like to be running solutions from 2-3 different vendors + a few other potentially irrelevant smaller tracing-like applications that implement a limited part of OT API.

yurishkuro · 2017-01-06T00:58:43Z

Aren't they making the assumption there's only a single tracing framework?

It's a pretty reasonable assumption, but it's not a concern of Dropwizard instrumentation. If someone manages to make a Multiplexing tracer, as discussed on this thread, then a Dropwizard user can use two tracers simultaneously, and DW doesn't need to change anything in its instrumentation.

How can one use DropWizard with both Jaeger and Zipkin?

I don't think Ben was suggesting that. Unless OpenTracing 2.0 comes up with a common in-band format for context propagation, even if you use multiple tracers each one of them will still see only the slice of architecture that directly runs with that tracer.

lookfwd · 2017-01-06T01:07:00Z

If someone manages to make a Multiplexing tracer

Exactly - the question is - does OT impose some constraints that prevent such a tracer from being possible? My guess is that we wouldn't like that.

For example, as we mentioned previously, span.get_baggage_item(key) is assumed to either succeed or fail but not partially fail. I agree that in general this is Multiplexing tracer's concern mainly but it's possible that e.g. DW would be interested on the distinction between span.get_baggage_item(key, mode=strict) which fails if two tracers disagree for an "important" baggage item and the default span.get_baggage_item(key) that returns the first non-null baggage from the child-tracers. Similarly for partial success/failure of inject() etc.

lookfwd · 2017-01-06T01:22:57Z

P.S. I realize that it's mostly talking and thinking here. I think I will try to write/plug a mixing tracer next time I have time to play with Zipkin-Python-OT

lookfwd · 2017-01-25T21:24:28Z

There's a first ugly throw-away C++ draft of some concepts here.

bhs mentioned this issue Nov 16, 2016

Running multiple tracing implementations simultaneously opentracing/opentracing.io#66

Closed

bhs added the imported label Nov 16, 2016

yurishkuro mentioned this issue Jan 3, 2017

Duplicate #28

Closed

lookfwd added a commit to lookfwd/opentracing.github.io that referenced this issue Jan 3, 2017

Partially enabling Higher Order Tracers

7e298d7

See opentracing/specification#9

lookfwd mentioned this issue Jan 3, 2017

Partially enabling Higher Order Tracers opentracing/opentracing.io#179

Closed

lookfwd mentioned this issue Jan 25, 2017

[Not to Pull. Just discussion] adding some C++11 aspects and a HOT tracer jquinn47/opentracing-cpp#1

Open

cwe1ss mentioned this issue Feb 11, 2018

GlobalTracer ported from Java opentracing/opentracing-csharp#61

Merged

cce mentioned this issue Jun 23, 2020

Add OpenTracing multiplexing tracer solarwinds/appoptics-apm-go#151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multiple tracing implementations simultaneously #9

Running multiple tracing implementations simultaneously #9

bhs commented Nov 16, 2016

lookfwd commented Jan 3, 2017 •

edited

Loading

yurishkuro commented Jan 3, 2017

lookfwd commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

yurishkuro commented Jan 3, 2017

lookfwd commented Jan 3, 2017 •

edited

Loading

yurishkuro commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

yurishkuro commented Jan 3, 2017

lookfwd commented Jan 3, 2017

lookfwd commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 4, 2017

wu-sheng commented Jan 4, 2017

wu-sheng commented Jan 4, 2017 •

edited

Loading

wu-sheng commented Jan 4, 2017

bhs commented Jan 4, 2017

codefromthecrypt commented Jan 5, 2017 via email

lookfwd commented Jan 6, 2017 •

edited

Loading

yurishkuro commented Jan 6, 2017

lookfwd commented Jan 6, 2017 •

edited

Loading

lookfwd commented Jan 6, 2017 •

edited

Loading

lookfwd commented Jan 25, 2017 •

edited

Loading

Running multiple tracing implementations simultaneously #9

Running multiple tracing implementations simultaneously #9

Comments

bhs commented Nov 16, 2016

lookfwd commented Jan 3, 2017 • edited Loading

Higher Order Tracers (HOT) & Migrating Between Vendors

yurishkuro commented Jan 3, 2017

lookfwd commented Jan 3, 2017 • edited Loading

lookfwd commented Jan 3, 2017 • edited Loading

yurishkuro commented Jan 3, 2017

lookfwd commented Jan 3, 2017 • edited Loading

yurishkuro commented Jan 3, 2017 • edited Loading

lookfwd commented Jan 3, 2017 • edited Loading

yurishkuro commented Jan 3, 2017

lookfwd commented Jan 3, 2017

lookfwd commented Jan 3, 2017 • edited Loading

lookfwd commented Jan 4, 2017

wu-sheng commented Jan 4, 2017

wu-sheng commented Jan 4, 2017 • edited Loading

wu-sheng commented Jan 4, 2017

bhs commented Jan 4, 2017

codefromthecrypt commented Jan 5, 2017 via email

lookfwd commented Jan 6, 2017 • edited Loading

yurishkuro commented Jan 6, 2017

lookfwd commented Jan 6, 2017 • edited Loading

lookfwd commented Jan 6, 2017 • edited Loading

lookfwd commented Jan 25, 2017 • edited Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

yurishkuro commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

lookfwd commented Jan 3, 2017 •

edited

Loading

wu-sheng commented Jan 4, 2017 •

edited

Loading

lookfwd commented Jan 6, 2017 •

edited

Loading

lookfwd commented Jan 6, 2017 •

edited

Loading

lookfwd commented Jan 6, 2017 •

edited

Loading

lookfwd commented Jan 25, 2017 •

edited

Loading