-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-RPC spans and mapping to multiple parents #28
Comments
@adriancole I'm glad you brought this up – an important topic. I am going to dump out some ideas I have about this – no concrete proposals below, just food for thought.
I was careful to make sure that a dapper- or zipkin-like In the current model, multiple parents could be represented as Span tags or – I suppose – as log records, though that latter idea smells wrong. Trace Attributes do not seem like the right fit since parentage relationships are a per-Span rather than per-Trace concern. (On that note: IMO the Let me also throw out this other related use-case that I think about often: delays in "big" executor queues, e.g. the main Node.js event loop. If each such executor has a globally unique ID and spans make note of those unique IDs as they pass through the respective queue, a sufficiently dynamic tracing system can explain the root cause of queuing delays (which is an important problem that is usually inscrutable). To be more concrete, suppose the following diagram illustrates the contents of a FIFO executor queue:
Let's say that the Span that enqueued So, the big question: is Anyway, this example is just meant to provide a practical / common example of tricky causality and data modeling. There are analogous examples for coalesced writes in storage systems, or any time batching happens, really.
|
I think this should be another page on the main website - recipes for handing the scenarios mentioned above, and others we discussed on various issues, like marking a trace as "debug". The goal of OpenTracing is to give instrumenters a standard language to describe the computation graph shape, regardless of the underlying tracing implementation, so we cannot give answers like "this is implementation specific", or "this could be done like this" - the answer needs to be "this is done this way", otherwise instrumenters can walk away none the wiser. Of course, it is also helpful to know the exact use case the user is asking about. For example, it's not clear to me that the queueing/batching @bensigelman describes is a use case for multiple parents. The main answer the users want is why it took my span so long to finish. So the investigation could be done in two steps, first the span logs the time when it was enqueued and when it was dequeued and executed. If the gap is large, it already indicates a delay on the event loop. To investigate the delay, user can run another query in the system asking for spans that too long to actually execute once dequeued, and thus delayed everybody else. A very intelligent tracing system may be clever enough to auto capture the items ahead in the queue based on the global ID of the executor, but we still need to have a very precise recipe to the instrumentors of what exactly they need to capture regardless of the underlying tracing implementation. Going back to the multi-parent question, do we understand which scenarios actually require it? As for capturing multiple parents, I would suggest using span tags for that, i.e. we declare a special for parent in parent_spans:
span.set_tag(ext.tags.PARENT, parent.trace_context) (which is another reason I was lobbying for string->[]any, and it's still possible to do the above based on the current API and semantics). |
@yurishkuro yes, my point was not that the queue is a good fit for multiple-parentage (it's not) but more to redirect the conversation around motivating scenarios rather than a particular DAG structure. This is very much in line with your suggestion (which I heartily endorse) that we provide sensible, opinionated guidance for best-practice instrumentation of certain archetypical scenarios: RPC boundaries, high-throughput executor queues, coalescing write buffers, whatever. As for relying on the "intentionally undefined" semantics of multiple calls to parent_span_ids = map(lambda s: str(s.trace_context.span_id), parent_spans)
span.set_tag(ext.tags.MULTIPLE_PARENTS, strings.join(parent_span_ids, ",")) |
Why not just pass the array of parent trace contexts? span.set_tag(ext.tags.PARENTS, [parent.trace_context for parent in parent_spans]) Passing string IDs is not portable, we don't even have |
I was respecting the |
Not to beat a dead horse, but I agree that queue depth is not a good use-case for multiple parents. (It's tempting, for instance, to take that on a slippery slope all the way up to OS-level scheduling!) IMO distributed tracing is about understanding a single flow of control across a distributed system--concurrent work may factor into how that request was handled, but tracing should keep the unit of work being traced as the center of reported data, with info in a trace being "relative to" that trace. The use-cases I see for multiple parents are around asynchronous work done in a blocking context--in the service of handling a single request or unit of work (to distinguish from the event loop case above). It is tempting to say that the join is optional, because someone reading the trace can probably infer a blocking/nonblocking relationship from the trace structure. However, the join is valuable information for those unfamiliar with the system who are reading individual traces, or for a tracer which wants to do more sophisticated analysis on the corpus of traces, because it signals where asynchronous work is actually blocking critical path in a manner which is less open to interpretation. Some examples we see commonly in web apps are libcurl's In AppNeta's X-Trace implementation, multi-parent is also used to record the ID of a remote (server-side) span event when it replies to the client agent. This is largely because the methodology is based on events instead of spans. For instance, a remote call made with httplib in python would involve 2 events if the remote side is not instrumented ( I like the idea of supporting this type of behavior, but it seems less pressing in a span-oriented world. The main argument I can see is an understanding of blocking critical path vs not in analysis of traces. I'm curious: are there other arguments for multi-parent out there? What is this used for in HTrace world? (Also @bensigelman can you clarify your comment about not serializing parent_id? If the TraceContext is what is serialized going across the wire, shouldn't it hold a previous ID? I am probably missing something obvious here..) |
@dankosaur per your question about |
Thanks, that makes sense--the |
As Adrian mentioned, in HTrace, we allow trace spans to have multiple One example of where this was important is the case of writing data to an Another example is in HBase. HBase has a write-ahead log, where it does What both of these examples have in common is that they involve two or more We had a few different choices here: In a world where we're using less than 1% sampling, solution #1 would mean Solution #1 is simple to implement. As far as I can tell, most distributed #2. Denormalize. If two traced writes came in, we could create "separate #3. A more complex data model that had "extra edges" beyond the #4. Support multiple parents. This wasn't difficult at the model layer. I'm curious what you guys would suggest for solving cases like this one. best, On Mon, Jan 18, 2016 at 1:38 PM, Dan Kuebrich notifications@github.com
|
@cmccabe thanks for the write-up! The group commit is a really interesting use-case, and because I also have not seen much discussion around this, I'd love to hear broader thoughts. Particularly about solution 3 you present above, because that's the one AppNeta takes with regard to such work. The reasoning behind picking option 3, which results in what we call "meta-traces" which have parent and child trace relationships, is based in the desire to be able to do smart aggregate analysis on traces. If a request trace is always a blocking unit of work at the top level, then you can start to mine it for critical path, goal is optimizing for end-user performance (whether the end-user is a human or a machine doesn't matter). So we wanted a definition of a trace which had a blocking top-level span. However, there's plenty of workloads that exhibit chained work patterns like queue insertion with a quick ack followed by downstream processing. These are also very important to trace, but can't be modeled using the above definition of a trace. (This type of behavior sounds parallel to the group commit case: something is written to a log, then later processed.) For that reason, we decided a "meta-trace" is the path which holds the most semantic value: each "stage" of the pipeline/processing-graph can be analyzed as a separate application based on its traces, with its own dependencies, hot spots, etc. But also the entire meta-trace can be reconstructed for end-to-end tracing. This might include a many-to-one join in the case of things that batch processing (eg. writes), or a more simple waterfall and branching pattern for a lot of data pipelines. |
@dankosaur we are also considering using a model that sounds very much like your meta-trace, for capturing relationship between some real-time trace and work it enqueues for later execution. At minimum it requires a small extension of capturing a "parent trace ID". Does AppNeta expose a higher level API for users to instrument their apps to capture these relationships? |
Thanks for the insight, Dan. I agree that for the "async work queue" case, you probably want to create However, it would certainly be possible to model the HBase group commit as We've been trying to figure out the right model to represent things like best, On Mon, Jan 18, 2016 at 3:42 PM, Dan Kuebrich notifications@github.com
|
@yurishkuro yes, though this is something we're actively working on and it's only being used internally so far, so it's not documented externally. Our API is very "flat" and based almost entirely on semantics such that each event (~span) is a bag of key/value pairs. So the way to note one or more parents is simply to add one or more @cmccabe yeah, at risk of complicating this notion, but actually hoping to clarify it, I think there are two classes of use-case for multiple-parent we've seen in discussion so far:
In 1 we'd be looking at spans with multiple parents; in 2 we'd be looking at traces with multiple parent traces. I do think 1 becomes quite esoteric for span-based architectures, but is worth capturing if it's not too onerous to support API-wise (don't have a strong feeling on this--it is more important for event-based architectures than span-based ones). 2 is potentially dependent on a discussion about the scope of work to be included in a single trace, which I'm not sure has been discussed yet. |
Sorry if these are dumb questions. But there are still a lot of things The best argument I have heard for meta-spans is that regular spans don't Does it make sense to use terminology like "phase" or "job" rather than Rather than adding meta-spans, we could also add point events, and have the On the other hand, if we had something like meta-spans, perhaps we could Colin On Mon, Jan 18, 2016 at 4:38 PM, Dan Kuebrich notifications@github.com
|
@dankosaur Indeed, a higher level API may not be necessary if spans from another trace can be added as parents. I know people have concerns with String->Any tags, but I would be ok with relaxing String->BasicType restriction (since it won't be enforced at method signature level anyway) for tags in the parent_contexts = [span.trace_context for span in parent_spans]
span.set_tag(ext.tags.MULTIPLE_PARENTS, parent_contexts) @bensigelman ^^^ ??? |
There's a running assumption that each contribution of an rpc is a different span. While popular, this isn't the case in zipkin. Zipkin puts all sides sides in the same span, similar to how in http/2 there's a stream identifier used for all request and response frames in that activity. [ operation A ] <-- all contributions share a span ID If zipkin split these into separate spans, it would look like... [ operation A.client ], [ operation A.server ] <-- each contribution has a different a span ID Visually, someone could probably just intuitively see they are related. With a "kind" field (like kind.server, kind.client), you could probably guess with more accuracy that they are indeed the same op. Am I understanding the "meta-trace" aspect as a resolution to the problem where contributors to the same operation do not share an id (and could, if there's was a distinct parent)? ex. |
I don't think we need to conflate support of multiple parents with widening
the data type of the tag api, particularly this early in the game. For
example, what if no api that supports multiple parents actually implements
OT? We're stuck with the wide interface. I'd suggest folks encode into a
single tag and leave complaints around that as an issue to work on later.
|
My understanding of two spans per RPC approach is that the server-side span is a child of the client-side span. The main difference is in the implementation of the That is somewhat orthogonal to multi-parents issue. Any span can declare an additional parent span to indicate its casual dependency (a "join"). However, in case of two-spans per RPC it would be unexpected for a server-side span to declare more than one parent. |
Isn't it what this issue is about, how to record multiple parents? I don't mind if it's done via set_tag or with |
I'm more comfortable with set_parents or the like than changing the tag api
directly.
|
and to be clear, my original intent was to determine if and how this impacts trace attributes (propagated tags) vs tags (ones sent out of band). ex in both zipkin and htrace, the parent is a propagated field. In zipkin One binding concern was if "adding a parent" is a user function? Ex. in HTrace the first parent is set always. Since parents are complicated, it is api affecting how they are used in practice. |
Arguably, in-band propagated parent-span-id in Zipkin is not necessary, it could've been sent out of band. It sounds like in AppNeta the multiple parent IDs are "tags", not propagated attributes. Does anyone know why exactly Zipkin decided to propagate parent ID? |
|
Getting back to the original subject (which is something I've been interest in since forever ago): I'm personally most excited about use cases that – at some level – boil down to a shared queue. That certainly encompasses the buffered/flushed writes case as well as the event loop pushback I mentioned further up in the thread. In those cases, how much mileage can we get by giving the queue (or "queue": it may be a mysql server or anything that can go into pushback) a GUID and attaching those guids to spans that interact with them? It's different than marking a parent_id but seems (to me) to make the instrumentation easier to write and the tooling easier to build. Thoughts? (As for MapReduces, etc: I have always had a hard time getting monitoring systems that are built for online, interactive-latency applications to actually work well for offline, non-interactive-latency applications (like MR). The data models can seem so similar, yet the tuning parameters are often totally divergent. Maybe I just didn't try hard enough (or wasn't smart enough, etc, etc)! I do think it's academically interesting and am happy to keep hearing ideas.) |
I don't think the buffered writes case in HDFS is similar to a queue. A Here are examples of things in HDFS that actually are queues:
We haven't seen a reason to trace these things yet (of course we might in Consider the case of an HBase PUT. This is clearly going to require a
These things are all logically part of the same PUT request, so why would best, On Tue, Jan 19, 2016 at 9:46 PM, bhs notifications@github.com wrote:
|
@cmccabe buffered writes can have, well, queuing problems... the buffer is an "intermediary" between the operations trying to write and the final resting place of the data. I agree that it's not a simple push/pop sort of producer-consumer queue, and I think that's what you're saying. I'm interested by your comment that "queues have not been that interesting to us." Do you mean that HBase doesn't have queuing problems? And/or that users don't want to understand what's in the queue/intermediary when HBase is in pushback? Bigtable is admittedly a different system than HBase, but that was of great interest to me as a Bigtable user when the tabletserver my process was talking to became unresponsive. Were there tools that reliably helped in such scenarios? Not really. Would I have liked to use one? Absolutely. Back to your question of why we would "split" the PUT, group commit, and stream flush: logically, I would prefer not to split them... that's what this thread is about, of course. The DAG model in the abstract is sound. It is less clear in the presence of sampling, though... For instance, if sampling decisions are made at the root of a trace (i.e., when there's no inbound edge, regardless of whether it's a DAG or a tree), how do we expect to understand the history of the other PUTs/etc in our HBase group commit request if they weren't sampled? So, the other spans involved in the group commit are either all sampled or not-all-sampled. If they're all sampled, the tracing system needs to be able to handle high throughput. If they're not all sampled, the tracing system will not be able to tell a complete story about queuing problems or other slowness involving the group commit. For a tracing system that can afford to sample all requests, in my mind the presence of unique ids for specific queues opens the door to various useful UI features. If it would be helpful, I could try to describe such features... but IMO just assembling one gigantic DAG trace that includes everything in a batch as well as all of its downstream and upstream (transitive) edges is problematic from both a systems standpoint and a visualization standpoint without additional meta-information about the structure of the system and the various queues/intermediaries. |
On Wed, Jan 20, 2016 at 9:58 PM, bhs notifications@github.com wrote:
In particular, I think your solution of foreign keys is the right thing to
Conceptually, having an hdfs flush span that has "foreign keys" to write Colin —
|
Hey Colin, One final thing about "queue", the word: I don't much care what we call it, I'm just trying to find a word we can use to describe the concept. I guess I've often heard people talk about "queueing problems" in datastore workloads, but whatever term you want to use is fine by me. Anyway, re your last paragraph:
Yeah, so, I don't really have strong opinions about the "DAG of trees of spans" vs "DAG of spans" question per se. Both could work... I was more interested in avoiding what otherwise seems (?) like an
If we say that The unfortunate thing about what I'm proposing is that tracing systems need to be aware of a new sort of construct. But I was hoping it would offer a more "declarative" (for lack of a better word) way to describe what's going on. |
On Thu, Jan 21, 2016 at 4:04 PM, bhs notifications@github.com wrote:
With multiple parents, a parent span can end at a point in time before a With foreign keys, we can draw "a dotted line" of some sort between The other question is what the "foreign key" field should actually be. If Whether you choose multiple parents or foreign keys, you still have to Colin
|
To draw a conclusion on the impact on the API, can we agree on the following API? span.add_parents(span1, ...) The method takes parent spans via varargs, and the spans do not have to belong to the same trace (solving meta-trace issue). |
I am sorry to be a stick in the mud, but this still seems suspect to me... For one thing, we probably shouldn't assume we have actual span instances to add as parents. Also, the model described at http://opentracing.io/spec/ describes traces as trees of spans: we can talk about making traces into DAGs of spans, but I would rather we bite off something smaller for now. One idea would be to aim for something like
... or we could do something similar with Also happy to schedule a VC about this topic since it's so complex in terms of implications. Or wait for Wednesday, whatever. |
fair point, I am not married to the word "parent", we can pick more abstract causality reference, like "starts_after" I do prefer to provide "causal ancestors" a list of Spans. This has to be tracer-agnostic syntax, and the end user's code doesn't know what "span id" is, they can only know some serialized format of the span, and if the multi-parent span creation happens in a different process (like a background job caused by an earlier http request), then presumably the parent trace managed to serialize its tracer context before scheduling the job, so that the job may de-serialize it into a Span. Finally, for the method signature, I prefer a dedicated method, since this functionality is actually in the public APIs of some existing tracing systems (HTrace, and possibly TraceView). Delegating this particular feature to a simple key/value of log/setTag methods doesn't feel right, especially due to lack of type enforcement. btw, set_tag is not really an option since people were adamantly opposed to non-primitive tag values, as a result in Java we can't even pass an Object to setTag. |
Bogdan mentioned soft-links at the workshop, which would be a way to The tie-in to multiple parents are two-fold. Firstly, some are raising concern that we are linking trees, not spans in Secondly, there's concern about encoding. For example, in certain systems, I'm not sure this is a blocking concern. For example, in htrace, the For those who aren't following OpenTracing, it may be more important there. Long story short.. I think the OpenTracing encoding question is relevant to On Sat, Feb 20, 2016 at 6:44 AM, Yuri Shkuro notifications@github.com
|
@adriancole I don't follow this point. An end user does not have access to string representation of a span, the best they can do is use the injector to convert span to a map[string, string] and then do some concatenation of the result. If we propose this as the official way of recording casual ancestors, we're locking every implementation into that clunky format (which may not even be reversible without additional escaping, depending on the encoding). If we propose no official way, people would have to resort to vendor-specific APIs. Ben's suggestion of using log with standard msg string at least resolves the encoding problem, since log() accepts any payload, including the Span. But casual ancestors aren't logs, they have no time dimension. Plus, for tracers that do not capture multi-parents properly, |
@yurishkuro whoops.. sorry.. I was catching up on email and I mistook this thread for one in the distributed-tracing google group (which is why I said "For those who aren't following OpenTracing.."). |
in other words, my last paragraph wasn't targeted towards the OT stewards, rather towards those who author tracers in general. ex. they may just simply support this feature or not (ex htrace does already), as they can control their api, model, and everything else. That paragraph isn't relevant to OT discussion.. which makes me think maybe I should just delete the comments as they weren't written for OT debates. |
Like it
The sooner this gets done, the less likely there will be the need of opentracing 2.0. Tracing is a DAG problem. The algorithms on top of trees and DAGs should be of similar complexity. The visualizations should change significantly but what is out there right now is suitable mostly for web-flows where there's a request etc. which is a minor part of what people need to monitor. Here's a little example of an engine that matches streams of tweets. A tweet arrived, triggered 1000 subscriptions with their own spans that trigger events on several servers (sharded by subscription id). Notice that what you see above is just a single slice. We have 4 such slices (sets of servers) for resilience on an active/passive setup. What is worthy under those conditions is to take the DAG for each tweet from each of the slices and compare it with all the others in terms of latency and correctness. [Disclaimer: The example is artificial but similar to the one I'm working on] |
There may be a case for DAG in some fork/join concurrency models: it gives the tracer more information about blocking events by clarifying joins. If we can live without that, or find a way to back it in later (some sort of "barrier event"), then we don't need multi-parent/multi-precedes IMO. @lookfwd I think your example is very valid for tracing, but I wouldn't model it in the way you propose. A tweet is a triggering event which kicks off some processing. I'm not sure the same should be said of the subscription at that point, however. I'd argue the subscription is state: it's a user-defined configuration that exists in the system before tweet event. If you consider the tweet as the triggering event and the subscriptions as values read and acted on at tweet time, you're back to a tree: the number of subscriptions becomes a high fan-out of the trace, but it's still a tree. (The subscription create/modify may be its own trace-triggering event when it is first created and populated to whatever subsystems store the state.) It's tempting to want to associate the influence of a user's subscription on later processing of tweets--however, I don't think a single trace is a good way to do so. If you want to model the subscription as an ongoing event, your "traces" will never end. |
@dkuebric said:
Very well said. DAGs are great and everything, but if we truly want to consider the general causal-dependency DAG as "a single trace", that single trace quickly becomes absolutely enormous and consequently intractable from a modeling standpoint. Another very important consideration is sampling: if the sampling coin-flip (in Dapper-like systems) happens at the "DAG roots" (i.e., spans that have no incoming edge), then the @lookfwd I don't want to sound unsympathetic, and perhaps we will even proceed with this... but what you're advocating has huge implications for sampling (either that or it will be semantically broken in most production tracing systems I'm aware of) and I worry about adding something to OT1.0 that contemporary tracing systems can only support partially, and for fundamental reasons. Does that make sense? Or would we consider this an "aspirational feature" that's a bit caveat emptor? |
@dkuebric There are two type of subscriptions. Some are long-lived where indeed what you say applies. The second case is "live" subscriptions that last for however long the user is connected (from seconds to minutes). We have 10's of thousands of those per day. I don't see a problem with never ending traces by the way. I think it's valid to query "give me anything you know that happened between Frankly - for me there are two separate things:
I think that no one would argue that the reality is DAG and trees are a subset of use cases. These are the facts. This is what I want to have on my distributed trace logs even if I don't have the tools to visualize or create alarms on those facts right now. Now on the visualization part... I think that the common view is this: which is very valid and indeed it's the present. But I believe that equally well someone could extract timed sequence diagrams from traces: and in those DAGs can very well be expressed as well as rich casual relationships. On the contrary state is somewhat complex to express. No matter what the tools look like right now... no matter what we do with data... we should collect the facts with a mindset of what does really happen. @bensigelman I think I get what you mean with sampling. Yes, you need to trace everything :) I will have a much better look on the spec. The truth is that I've spent just a few hours trying to understand if it is suitable for our use case so I might well be missing many aspects / constraints! A little update. If you want to sample per shard e.g. anything where |
@lookfwd this is interesting stuff and I'm happy to hop on a call or VC to discuss in detail: probably more useful than just reading the spec, but up to you. As for the update at the end of your last message: there are plenty traces that span logical shards so I'm not sure what "sampling by shard" means in that context. Also, sampling is done both to reduce the load on the tracing system and on the host process: if the host process is in |
I believe that the |
@bensigelman I prefer to keep this open until we define more exotic |
Thinking more about this, the other thing we're missing (in addition to more exotic reference types) is the capacity to add Since both of these are backwards-compatible API additions, I'm going to remove the "1.0 Spec" milestone for this issue. |
a) so we're going with 1.1, not 2.0? |
@yurishkuro For me, it would be a group commit operation that does not know the ancestor commits before creating the span. It would need to capture the span contexts from the individual commit operations, put them in a set, then attach each one to the group commit span with a Another example (one that I'm currently working on) is integrating with an in-process instrumentation library. It's a lot more convenient to be able to attach references after creating a span; I don't have to worry about having all references in order from ancestor spans in the instrumentation library at the point where I need to create the descendant span. |
See opentracing/specification#5 to continue this discussion. |
One of my goals of working in OpenTracing is to do more with the same amount of work. For example, when issues are solved in OpenTracing, and adopted by existing tracers, there's a chance for less Zipkin interop work, integrations and maintenance. Zipkin's had a persistent interoperability issue around non-RPC spans. This usually expresses itself as multiple parents, though often also as "don't assume RPC".
In concrete terms, Zipkin V2 has a goal to support multiple parents. This would stop the rather severe signal loss from HTrace and Sleuth to Zipkin, and of course address a more fundamental concern: the inability to express joins and flushes.
In OpenTracing, we call certain things out explicitly, and leave other things implicit. For example, the existence of a span id at all is implicit, except the side-effect where we split the encoded form of context into two parts. We certainly call out features explicitly, like "finish", and of course these depend on implicit functionality, such as harvesting duration from a timer.
Even if we decide to relegate this to an FAQ, I think we should discuss multiple parents, and api impact. For example, are multiple parents tags.. or attributes? Does adding parents impact attributes or identity? Can an HTrace tracer be built from an OpenTracing one without signal loss? Are there any understood "hacks" which allow one to encode a multi-parent span effectively into a single-parent one? Even if we say "it should work", I'd like to get some sort of nod from a widely-used tracer who supports multiple parents.
The practical impact of this is that we can better understand in Zipkin whether this feature remains a zipkin-specific interop story with, for example HTrace, or something we leverage from OpenTracing.
For example, it appears that in AppNeta, adding a parent, or edge is a user-level task, and doesn't seem to be tied with in-band aka propagated fields? @dankosaur is that right?
In HTrace, you add events and tags via TraceScope, which manages a single span, which encodes into its id a single primary parent. You can access the "raw" span, and assign multiple parents, but this doesn't change the identity of the span, and so I assume doesn't impact propagation. @cmccabe is that right?
I'm sure there are other multiple-parent tracers out there.. I'd love to hear who's planning to support OpenTracing and how that fits in with multiple parents.
The text was updated successfully, but these errors were encountered: