Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-RPC spans and mapping to multiple parents #28

Closed
codefromthecrypt opened this issue Jan 17, 2016 · 47 comments
Closed

non-RPC spans and mapping to multiple parents #28

codefromthecrypt opened this issue Jan 17, 2016 · 47 comments

Comments

@codefromthecrypt
Copy link
Contributor

One of my goals of working in OpenTracing is to do more with the same amount of work. For example, when issues are solved in OpenTracing, and adopted by existing tracers, there's a chance for less Zipkin interop work, integrations and maintenance. Zipkin's had a persistent interoperability issue around non-RPC spans. This usually expresses itself as multiple parents, though often also as "don't assume RPC".

In concrete terms, Zipkin V2 has a goal to support multiple parents. This would stop the rather severe signal loss from HTrace and Sleuth to Zipkin, and of course address a more fundamental concern: the inability to express joins and flushes.

In OpenTracing, we call certain things out explicitly, and leave other things implicit. For example, the existence of a span id at all is implicit, except the side-effect where we split the encoded form of context into two parts. We certainly call out features explicitly, like "finish", and of course these depend on implicit functionality, such as harvesting duration from a timer.

Even if we decide to relegate this to an FAQ, I think we should discuss multiple parents, and api impact. For example, are multiple parents tags.. or attributes? Does adding parents impact attributes or identity? Can an HTrace tracer be built from an OpenTracing one without signal loss? Are there any understood "hacks" which allow one to encode a multi-parent span effectively into a single-parent one? Even if we say "it should work", I'd like to get some sort of nod from a widely-used tracer who supports multiple parents.

The practical impact of this is that we can better understand in Zipkin whether this feature remains a zipkin-specific interop story with, for example HTrace, or something we leverage from OpenTracing.

For example, it appears that in AppNeta, adding a parent, or edge is a user-level task, and doesn't seem to be tied with in-band aka propagated fields? @dankosaur is that right?

In HTrace, you add events and tags via TraceScope, which manages a single span, which encodes into its id a single primary parent. You can access the "raw" span, and assign multiple parents, but this doesn't change the identity of the span, and so I assume doesn't impact propagation. @cmccabe is that right?

I'm sure there are other multiple-parent tracers out there.. I'd love to hear who's planning to support OpenTracing and how that fits in with multiple parents.

@bhs
Copy link
Contributor

bhs commented Jan 17, 2016

@adriancole I'm glad you brought this up – an important topic. I am going to dump out some ideas I have about this – no concrete proposals below, just food for thought.

<ramble>

I was careful to make sure that a dapper- or zipkin-like parent_id is not reified at the OpenTracing level... that is a Tracer implementation concern. That said, the Span and TraceContext APIs show a bias for single-parentage traces (if this isn't obvious I can elaborate). OpenTracing docs even describe traces as "trees" rather than "DAGs".

In the current model, multiple parents could be represented as Span tags or – I suppose – as log records, though that latter idea smells wrong. Trace Attributes do not seem like the right fit since parentage relationships are a per-Span rather than per-Trace concern. (On that note: IMO the parent_id should never be a part of the TraceContext as there's no need to send it in-band over the wire... it can just be a Span tag.)

Let me also throw out this other related use-case that I think about often: delays in "big" executor queues, e.g. the main Node.js event loop. If each such executor has a globally unique ID and spans make note of those unique IDs as they pass through the respective queue, a sufficiently dynamic tracing system can explain the root cause of queuing delays (which is an important problem that is usually inscrutable). To be more concrete, suppose the following diagram illustrates the contents of a FIFO executor queue:

    [  C  D  E  F  G  H  I  J  K  L  ]
                                  ^-- next to dequeue and execute

Let's say that the Span that enqueued C ends up being slow because the items ahead of it in this queue were too expensive. In order to truly debug the root cause of that slowness (for C), a tracing system should be talking about items D-L... at least one of them took so long that the wait to get to the front of the executor queue was too long.

So, the big question: is C a parent for D-L? After all, it is blocked on them, right? And if C is a parent, what do we say about the more direct/obvious parents of D-L, whatever they are?

Anyway, this example is just meant to provide a practical / common example of tricky causality and data modeling. There are analogous examples for coalesced writes in storage systems, or any time batching happens, really.

</ramble>

@yurishkuro
Copy link
Member

I think this should be another page on the main website - recipes for handing the scenarios mentioned above, and others we discussed on various issues, like marking a trace as "debug". The goal of OpenTracing is to give instrumenters a standard language to describe the computation graph shape, regardless of the underlying tracing implementation, so we cannot give answers like "this is implementation specific", or "this could be done like this" - the answer needs to be "this is done this way", otherwise instrumenters can walk away none the wiser.

Of course, it is also helpful to know the exact use case the user is asking about. For example, it's not clear to me that the queueing/batching @bensigelman describes is a use case for multiple parents. The main answer the users want is why it took my span so long to finish. So the investigation could be done in two steps, first the span logs the time when it was enqueued and when it was dequeued and executed. If the gap is large, it already indicates a delay on the event loop. To investigate the delay, user can run another query in the system asking for spans that too long to actually execute once dequeued, and thus delayed everybody else. A very intelligent tracing system may be clever enough to auto capture the items ahead in the queue based on the global ID of the executor, but we still need to have a very precise recipe to the instrumentors of what exactly they need to capture regardless of the underlying tracing implementation.

Going back to the multi-parent question, do we understand which scenarios actually require it?

As for capturing multiple parents, I would suggest using span tags for that, i.e. we declare a special ext tag and do

for parent in parent_spans:
    span.set_tag(ext.tags.PARENT, parent.trace_context)

(which is another reason I was lobbying for string->[]any, and it's still possible to do the above based on the current API and semantics).

@bhs
Copy link
Contributor

bhs commented Jan 17, 2016

@yurishkuro yes, my point was not that the queue is a good fit for multiple-parentage (it's not) but more to redirect the conversation around motivating scenarios rather than a particular DAG structure. This is very much in line with your suggestion (which I heartily endorse) that we provide sensible, opinionated guidance for best-practice instrumentation of certain archetypical scenarios: RPC boundaries, high-throughput executor queues, coalescing write buffers, whatever.

As for relying on the "intentionally undefined" semantics of multiple calls to set_tag, I would perhaps prefer forcing the multiple parents to be set all at once and just string-joining them into a single tag value. This would be the "making the hard things possible" of the old make the easy things easy and the hard things possible adage (i.e., it's admittedly clumsy):

parent_span_ids = map(lambda s: str(s.trace_context.span_id), parent_spans)
span.set_tag(ext.tags.MULTIPLE_PARENTS, strings.join(parent_span_ids, ","))

@yurishkuro
Copy link
Member

Why not just pass the array of parent trace contexts?

span.set_tag(ext.tags.PARENTS, [parent.trace_context for parent in parent_spans])

Passing string IDs is not portable, we don't even have trace_context.span_id as a formal requirement in the API.

@bhs
Copy link
Contributor

bhs commented Jan 17, 2016

I was respecting the BasicType-ness of set_tag's second parameter... I was mainly just illustrating the concatenation/joining. (Coercing a TraceContext into a BasicType or string is a problem regardless of set_tag/add_tag)

@dkuebric
Copy link
Contributor

Not to beat a dead horse, but I agree that queue depth is not a good use-case for multiple parents. (It's tempting, for instance, to take that on a slippery slope all the way up to OS-level scheduling!) IMO distributed tracing is about understanding a single flow of control across a distributed system--concurrent work may factor into how that request was handled, but tracing should keep the unit of work being traced as the center of reported data, with info in a trace being "relative to" that trace.

The use-cases I see for multiple parents are around asynchronous work done in a blocking context--in the service of handling a single request or unit of work (to distinguish from the event loop case above). It is tempting to say that the join is optional, because someone reading the trace can probably infer a blocking/nonblocking relationship from the trace structure. However, the join is valuable information for those unfamiliar with the system who are reading individual traces, or for a tracer which wants to do more sophisticated analysis on the corpus of traces, because it signals where asynchronous work is actually blocking critical path in a manner which is less open to interpretation.

Some examples we see commonly in web apps are libcurl's curl_multi_exec (and the many libraries that wrap it), or libraries which are async by at the underlying implementation level but actually end up being used synchronously a lot of the time (spymemcached). Instrumenting these to capture both use-cases benefits from being able to distinguish between the two execution patterns.

In AppNeta's X-Trace implementation, multi-parent is also used to record the ID of a remote (server-side) span event when it replies to the client agent. This is largely because the methodology is based on events instead of spans. For instance, a remote call made with httplib in python would involve 2 events if the remote side is not instrumented (httplib entry, httplib exit), or 4+ if the remote side is instrumented. The httplib exit event would have edges to both the httplib entry and remoteserver exit in that case.

image

I like the idea of supporting this type of behavior, but it seems less pressing in a span-oriented world. The main argument I can see is an understanding of blocking critical path vs not in analysis of traces. I'm curious: are there other arguments for multi-parent out there? What is this used for in HTrace world?

(Also @bensigelman can you clarify your comment about not serializing parent_id? If the TraceContext is what is serialized going across the wire, shouldn't it hold a previous ID? I am probably missing something obvious here..)

@bhs
Copy link
Contributor

bhs commented Jan 18, 2016

@dankosaur per your question about parent_id: If we're using a span-based model, IMO an RPC is two spans, one on the client and one on the server. The client span's TraceContext is sent over the wire as a trace_id and span_id, and that client span_id becomes the parent_id of the server span. Even if a single span is used to model the RPC, as long as the client logs the span's parent_id there should be no need for the server to log it as well (so, again, no need to include it in-band with the RPC payload). Hope that makes sense... if not I can make a diagram or something.

@dkuebric
Copy link
Contributor

Thanks, that makes sense--the span_id becomes the parent id of the receiving span. It's the same way in X-Trace.

@cmccabe
Copy link

cmccabe commented Jan 18, 2016

As Adrian mentioned, in HTrace, we allow trace spans to have multiple
parents. They form a directed acyclic graph, not necessarily a tree.

One example of where this was important is the case of writing data to an
HDFS DFSOutputStream. The Java stream object contains a buffer. This
buffer will be flushed periodically when it gets too big, or when one of
the flush calls is made. The call to write() will return quickly if it is
just storing something to the buffer.

Another example is in HBase. HBase has a write-ahead log, where it does
"group commit." In other words, if HBase gets requests A, B, and C, it
does a single write-ahead log write for all of them. The WAL writes can be
time-consuming since they involve writing to an HDFS stream, which could be
slow for any number of reasons (network, error handling, GC, etc. etc.).

What both of these examples have in common is that they involve two or more
requests "feeding into" a single time-consuming operation. I think some
people in this thread are referring to this as a "join" since it is an
operation that joins several streams of execution (sometimes quite
literally, by using Executors or a fork/join threading model).

We had a few different choices here:
#1. Arbitrarily assign the "blame" for the flush to a single HTrace
request. In the DFSOutputStream, this would mean that we would ignore
DFSOutputstream buffer flushes unless the HTrace request had to wait for
them. In HBase, what we would do is rather less clear-- the requests that
are being coalsced into a "group WAL commit" don't necessarily have any
user-visible ordering, so the choice of which one to "blame" for the group
commit would be completely arbitrary from the user's point of view.

In a world where we're using less than 1% sampling, solution #1 would mean
that relatively few HDFS flushes would ever be traced. It also means that
if two traced writes both contributed to a flush, only one would take the
"blame." For HBase, solution #1 would mean that there would be a fair
number of requests that would be waiting for the group commit, but have no
trace spans to reflect that fact.

Solution #1 is simple to implement. As far as I can tell, most distributed
tracing systems took this solution. You can build a reasonable latency
outlier analysis system this way, but you lose a lot of information about
what actually happened in the system.

#2. Denormalize. If two traced writes came in, we could create "separate
trees" for the same flush. This solution is superficially attractive, but
there are a lot of practical difficulties. Clearly, it increases the
number of spans exponentially for each branching point. Since we had this
problem at multiple layers of the system, this was not an attractive
solution.

#3. A more complex data model that had "extra edges" beyond the
parent/child relationships we traditionally used. For example, HDFS
flushes could become top-level HTrace requests that were somehow associated
with other requests (perhaps by some kind of "extra ID". The problem with
this is that your tooling becomes much more complex and project-specific.
It's already hard enough to explain the current simple data model to people
without making it even more complex and domain-specific. We also have
multiple layers at which this problem happens, so it would become harder
for even experts to follow a single request all the way through the system.

#4. Support multiple parents. This wasn't difficult at the model layer.
It made some things more challenging at the GUI layer, but not by much.
Our programmatic interface for adding multiple parents is still a bit
awkward-- this is something we might want to work on in the future.

I'm curious what you guys would suggest for solving cases like this one.
We have tried to come up with something that was useful for Hadoop and
HBase, and hopefully the wider ecosystem as well. I didn't see a lot of
discussion about this in any of the tracing publications and discussions I
read-- perhaps I missed it.

best,
Colin

On Mon, Jan 18, 2016 at 1:38 PM, Dan Kuebrich notifications@github.com
wrote:

Thanks, that makes sense--the span_id becomes the parent id of the
receiving span. It's the same way in X-Trace.


Reply to this email directly or view it on GitHub
#28 (comment)
.

@dkuebric
Copy link
Contributor

@cmccabe thanks for the write-up! The group commit is a really interesting use-case, and because I also have not seen much discussion around this, I'd love to hear broader thoughts. Particularly about solution 3 you present above, because that's the one AppNeta takes with regard to such work.

The reasoning behind picking option 3, which results in what we call "meta-traces" which have parent and child trace relationships, is based in the desire to be able to do smart aggregate analysis on traces. If a request trace is always a blocking unit of work at the top level, then you can start to mine it for critical path, goal is optimizing for end-user performance (whether the end-user is a human or a machine doesn't matter). So we wanted a definition of a trace which had a blocking top-level span.

However, there's plenty of workloads that exhibit chained work patterns like queue insertion with a quick ack followed by downstream processing. These are also very important to trace, but can't be modeled using the above definition of a trace. (This type of behavior sounds parallel to the group commit case: something is written to a log, then later processed.)

For that reason, we decided a "meta-trace" is the path which holds the most semantic value: each "stage" of the pipeline/processing-graph can be analyzed as a separate application based on its traces, with its own dependencies, hot spots, etc. But also the entire meta-trace can be reconstructed for end-to-end tracing. This might include a many-to-one join in the case of things that batch processing (eg. writes), or a more simple waterfall and branching pattern for a lot of data pipelines.

@yurishkuro
Copy link
Member

@dankosaur we are also considering using a model that sounds very much like your meta-trace, for capturing relationship between some real-time trace and work it enqueues for later execution. At minimum it requires a small extension of capturing a "parent trace ID". Does AppNeta expose a higher level API for users to instrument their apps to capture these relationships?

@cmccabe
Copy link

cmccabe commented Jan 19, 2016

Thanks for the insight, Dan.

I agree that for the "async work queue" case, you probably want to create
multiple HTrace requests which you can then associate back together later.
However, this case seems a little different than the "synchronous join"
case that motivated us to use multiple parents. After all, in the async
case, you are probably going to be focused more on things like queue
processing throughput. In the "synchorous join" case, you need to focus on
the latency of the work done in the joined part. In the specific example
of HBase, if group commit has high latency, all the HBase requests that
depend on that particular group commit will also have high latency.

However, it would certainly be possible to model the HBase group commit as
a separate top-level request, and associate it back with whatever PUT or
etc. HBase request triggered it. I guess we have to think about the
advantages and disadvantages of that more, compared to using multiple
parents.

We've been trying to figure out the right model to represent things like
Hive jobs, where a SQL query is broken down into MapReduce or Spark jobs,
which then break down further into executors, and so forth. It does seem
like we will end up splitting spans quite a lot, and potentially using
foreign keys to knit them back together. In that case, it definitely makes
sense. The most basic level of support would be tagging HDFS / HBase spans
with the ID of the current MapReduce or Spark job.

best,
Colin

On Mon, Jan 18, 2016 at 3:42 PM, Dan Kuebrich notifications@github.com
wrote:

@cmccabe https://github.com/cmccabe thanks for the write-up! The group
commit is a really interesting use-case, and because I also have not seen
much discussion around this, I'd love to hear broader thoughts.
Particularly about solution 3 you present above, because that's the one
AppNeta takes with regard to such work.

The reasoning behind picking option 3, which results in what we call
"meta-traces" which have parent and child trace relationships, is based in
the desire to be able to do smart aggregate analysis on traces. If a
request trace is always a blocking unit of work at the top level, then you
can start to mine it for critical path, goal is optimizing for end-user
performance (whether the end-user is a human or a machine doesn't matter).
So we wanted a definition of a trace which had a blocking top-level span.

However, there's plenty of workloads that exhibit chained work patterns
like queue insertion with a quick ack followed by downstream processing.
These are also very important to trace, but can't be modeled using the
above definition of a trace. (This type of behavior sounds parallel to the
group commit case: something is written to a log, then later processed.)

For that reason, we decided a "meta-trace" is the path which holds the
most semantic value: each "stage" of the pipeline/processing-graph can be
analyzed as a separate application based on its traces, with its own
dependencies, hot spots, etc. But also the entire meta-trace can be
reconstructed for end-to-end tracing. This might include a many-to-one join
in the case of things that batch processing (eg. writes), or a more simple
waterfall and branching pattern for a lot of data pipelines.


Reply to this email directly or view it on GitHub
#28 (comment)
.

@dkuebric
Copy link
Contributor

@yurishkuro yes, though this is something we're actively working on and it's only being used internally so far, so it's not documented externally. Our API is very "flat" and based almost entirely on semantics such that each event (~span) is a bag of key/value pairs. So the way to note one or more parents is simply to add one or more ParentID values to the root of a new trace.

@cmccabe yeah, at risk of complicating this notion, but actually hoping to clarify it, I think there are two classes of use-case for multiple-parent we've seen in discussion so far:

  1. Tracking join of parallel work in a blocking top-level request, which I argue above is a single-trace use-case vs
  2. Tracking join of multiple work-streams which may not be blocking top-level requests, which I argue is a meta-trace use-case.

In 1 we'd be looking at spans with multiple parents; in 2 we'd be looking at traces with multiple parent traces.

I do think 1 becomes quite esoteric for span-based architectures, but is worth capturing if it's not too onerous to support API-wise (don't have a strong feeling on this--it is more important for event-based architectures than span-based ones). 2 is potentially dependent on a discussion about the scope of work to be included in a single trace, which I'm not sure has been discussed yet.

@cmccabe
Copy link

cmccabe commented Jan 19, 2016

Sorry if these are dumb questions. But there are still a lot of things
about the "meta-trace" or "meta-span" concept I don't understand. spans
have a natural and elegant nesting property; do meta-spans nest, or do I
need a meta-meta-span? Also, if meta-spans are forking and joining, then
it seems like we have the multiple parent discussion we had with spans all
over again, with the same set of possible solutions.

The best argument I have heard for meta-spans is that regular spans don't
get sent to the server until the span ends (at least in HTrace), which is
impractical if the duration of the span is minutes or hours.

Does it make sense to use terminology like "phase" or "job" rather than
"meta-span"? "meta-span" or "meta-trace" seems to define it terms of what
it is not (it's not a span) rather than what it is.

Rather than adding meta-spans, we could also add point events, and have the
kicking off of some big job or phase generate one of these point events.
And similarly, the end of a big job or phase could be another point event.
At least for systems like MapReduce, Spark, etc. we can use job ID to
relate spans with system phases.

On the other hand, if we had something like meta-spans, perhaps we could
draw high-level diagrams of the system's execution plan. These would look
a lot like the execution plan diagrams generated by something like Apache
Drill or Apache Spark. It would be informative to put these on the same
graph as some spans (although the GUI challenges are formidable.)

Colin

On Mon, Jan 18, 2016 at 4:38 PM, Dan Kuebrich notifications@github.com
wrote:

@yurishkuro https://github.com/yurishkuro yes, though this is something
we're actively working on and it's only being used internally so far, so
it's not documented externally. Our API is very "flat" and based almost
entirely on semantics such that each event (~span) is a bag of key/value
pairs. So the way to note one or more parents is simply to add one or more
ParentID values to the root of a new trace.

@cmccabe https://github.com/cmccabe yeah, at risk of complicating this
notion, but actually hoping to clarify it, I think there are two classes of
use-case for multiple-parent we've seen in discussion so far:

  1. Tracking join of parallel work in a blocking top-level request, which I
    argue above is a single-trace use-case vs
  2. Tracking join of multiple work-streams which may not be blocking
    top-level requests, which I argue is a meta-trace use-case.

In 1 we'd be looking at spans with multiple parents; in 2 we'd be looking
at traces with multiple parent traces.

I do think 1 becomes quite esoteric for span-based architectures, but is
worth capturing if it's not too onerous to support API-wise (don't have a
strong feeling on this--it is more important for event-based architectures
than span-based ones). 2 is potentially dependent on a discussion about the
scope of work to be included in a single trace, which I'm not sure has been
discussed yet.


Reply to this email directly or view it on GitHub
#28 (comment)
.

@yurishkuro
Copy link
Member

@dankosaur Indeed, a higher level API may not be necessary if spans from another trace can be added as parents. I know people have concerns with String->Any tags, but I would be ok with relaxing String->BasicType restriction (since it won't be enforced at method signature level anyway) for tags in the ext.tags namespace (in lieu of special-purpose API as in #18), so that we could register multiple parents with:

parent_contexts = [span.trace_context for span in parent_spans]
span.set_tag(ext.tags.MULTIPLE_PARENTS, parent_contexts)

@bensigelman ^^^ ???

@codefromthecrypt
Copy link
Contributor Author

There's a running assumption that each contribution of an rpc is a different span. While popular, this isn't the case in zipkin. Zipkin puts all sides sides in the same span, similar to how in http/2 there's a stream identifier used for all request and response frames in that activity.

[ operation A ] <-- all contributions share a span ID
[ [cs] [sr] [ss] [cr] ]

If zipkin split these into separate spans, it would look like...

[ operation A.client ], [ operation A.server ] <-- each contribution has a different a span ID
[ [cs] [cr] ], [ [sr] [ss] ]

Visually, someone could probably just intuitively see they are related. With a "kind" field (like kind.server, kind.client), you could probably guess with more accuracy that they are indeed the same op.

Am I understanding the "meta-trace" aspect as a resolution to the problem where contributors to the same operation do not share an id (and could, if there's was a distinct parent)?

ex.
[ operation A ]
[ operation A.client ], [ operation A.server ] <-- both add a parent ID of the above operation?
[ [cs] [cr] ], [ [sr] [ss] ]

@codefromthecrypt
Copy link
Contributor Author

codefromthecrypt commented Jan 19, 2016 via email

@yurishkuro
Copy link
Member

My understanding of two spans per RPC approach is that the server-side span is a child of the client-side span. The main difference is in the implementation of the join_trace function - Zipkin v.1 implementation would implement join_trace by creating a span with the same trace_context it reads off the wire, while a "two-spans" tracer will implement join_trace by creating a child trace_context.

That is somewhat orthogonal to multi-parents issue. Any span can declare an additional parent span to indicate its casual dependency (a "join"). However, in case of two-spans per RPC it would be unexpected for a server-side span to declare more than one parent.

@yurishkuro
Copy link
Member

I don't think we need to conflate support of multiple parents with widening the data type of the tag api, particularly this early in the game.

Isn't it what this issue is about, how to record multiple parents? I don't mind if it's done via set_tag or with set_parents(trace_contexts_list), but if we don't offer an API to do it, those existing systems with multi-parent support will have nothing to implement. FWIW, at Uber we're starting work right now to trace relationships from realtime requests to enqueued jobs, which is a multi-parent (meta-trace) use case, and it can be done with Zipkin v.1, mostly with some UI enhancements.

@codefromthecrypt
Copy link
Contributor Author

codefromthecrypt commented Jan 19, 2016 via email

@codefromthecrypt
Copy link
Contributor Author

and to be clear, my original intent was to determine if and how this impacts trace attributes (propagated tags) vs tags (ones sent out of band).

ex in both zipkin and htrace, the parent is a propagated field. In zipkin X-B3-ParentSpanId, and in HTrace, half of the span id's bytes.

One binding concern was if "adding a parent" is a user function? Ex. in HTrace the first parent is set always. Since parents are complicated, it is api affecting how they are used in practice.

@yurishkuro
Copy link
Member

Arguably, in-band propagated parent-span-id in Zipkin is not necessary, it could've been sent out of band. It sounds like in AppNeta the multiple parent IDs are "tags", not propagated attributes. Does anyone know why exactly Zipkin decided to propagate parent ID?

@codefromthecrypt
Copy link
Contributor Author

Does anyone know why exactly Zipkin decided to propagate parent ID?

Only a guess, but perhaps it is to ensure out-of-band spans don't need to
read-back to figure out their parent id. I'm sure the answer can be
discovered.

@bhs
Copy link
Contributor

bhs commented Jan 20, 2016

Getting back to the original subject (which is something I've been interest in since forever ago):

I'm personally most excited about use cases that – at some level – boil down to a shared queue. That certainly encompasses the buffered/flushed writes case as well as the event loop pushback I mentioned further up in the thread. In those cases, how much mileage can we get by giving the queue (or "queue": it may be a mysql server or anything that can go into pushback) a GUID and attaching those guids to spans that interact with them? It's different than marking a parent_id but seems (to me) to make the instrumentation easier to write and the tooling easier to build.

Thoughts?

(As for MapReduces, etc: I have always had a hard time getting monitoring systems that are built for online, interactive-latency applications to actually work well for offline, non-interactive-latency applications (like MR). The data models can seem so similar, yet the tuning parameters are often totally divergent. Maybe I just didn't try hard enough (or wasn't smart enough, etc, etc)! I do think it's academically interesting and am happy to keep hearing ideas.)

@cmccabe
Copy link

cmccabe commented Jan 20, 2016

I don't think the buffered writes case in HDFS is similar to a queue. A
queue typically has events going in and events coming out. The buffered
writes case just has a buffer which fills and then gets emptied all at
once, which is not the way a queue typically works. The HBase case doesn't
even necessarily have ordering between the elements that are being
processed in the join, which makes it even less like a queue.

Here are examples of things in HDFS that actually are queues:

  • the queue of sockets that we have accept()ed but not read the message
    from yet
  • the queue of requests (threads) waiting to take the FSN lock

We haven't seen a reason to trace these things yet (of course we might in
the future). It is fair to say that so far, queues have not been that
interesting to us.

Consider the case of an HBase PUT. This is clearly going to require a
group commit, and that group commit is going to require an HDFS write and
flush. If you create a new request on every "join," you would have to look
at three different "HTrace requests" to see why this had high latency.

  • The PUT request
  • The HBase group commit request
  • The HDFS stream flush request

These things are all logically part of the same PUT request, so why would
we split them? And if we did, how would the users get from one request to
the next? The GUI tooling understands how to follow parents to chidren,
but not how to look up arbitrary foreign keys. The DAG model of execution
is closer to reality than the tree model, so why should we force a tree on
things that aren't tree-like?

best,
Colin

On Tue, Jan 19, 2016 at 9:46 PM, bhs notifications@github.com wrote:

Getting back to the original subject (which is something I've been
interest in since forever ago):

I'm personally most excited about use cases that – at some level – boil
down to a shared queue. That certainly encompasses the buffered/flushed
writes case as well as the event loop pushback I mentioned further up in
the thread. In those cases, how much mileage can we get by giving the queue
(or "queue": it may be a mysql server or anything that can go into
pushback) a GUID and attaching those guids to spans that interact with
them? It's different than marking a parent_id but seems (to me) to make the
instrumentation easier to write and the tooling easier to build.

Thoughts?

As for MapReduces, etc: I have always had a hard time getting monitoring
systems that are built for online, interactive-latency applications to
actually work well for offline, non-interactive-latency applications (like
MR). The data models can seem so similar, yet the tuning parameters are
often totally divergent. Maybe I just didn't try hard enough (or wasn't
smart enough, etc, etc)! I do think it's academically interesting and am
happy to keep hearing ideas.


Reply to this email directly or view it on GitHub
#28 (comment)
.

@bhs
Copy link
Contributor

bhs commented Jan 21, 2016

@cmccabe buffered writes can have, well, queuing problems... the buffer is an "intermediary" between the operations trying to write and the final resting place of the data. I agree that it's not a simple push/pop sort of producer-consumer queue, and I think that's what you're saying.

I'm interested by your comment that "queues have not been that interesting to us." Do you mean that HBase doesn't have queuing problems? And/or that users don't want to understand what's in the queue/intermediary when HBase is in pushback? Bigtable is admittedly a different system than HBase, but that was of great interest to me as a Bigtable user when the tabletserver my process was talking to became unresponsive. Were there tools that reliably helped in such scenarios? Not really. Would I have liked to use one? Absolutely.

Back to your question of why we would "split" the PUT, group commit, and stream flush: logically, I would prefer not to split them... that's what this thread is about, of course.

The DAG model in the abstract is sound. It is less clear in the presence of sampling, though... For instance, if sampling decisions are made at the root of a trace (i.e., when there's no inbound edge, regardless of whether it's a DAG or a tree), how do we expect to understand the history of the other PUTs/etc in our HBase group commit request if they weren't sampled?

So, the other spans involved in the group commit are either all sampled or not-all-sampled. If they're all sampled, the tracing system needs to be able to handle high throughput. If they're not all sampled, the tracing system will not be able to tell a complete story about queuing problems or other slowness involving the group commit.

For a tracing system that can afford to sample all requests, in my mind the presence of unique ids for specific queues opens the door to various useful UI features. If it would be helpful, I could try to describe such features... but IMO just assembling one gigantic DAG trace that includes everything in a batch as well as all of its downstream and upstream (transitive) edges is problematic from both a systems standpoint and a visualization standpoint without additional meta-information about the structure of the system and the various queues/intermediaries.

@cmccabe
Copy link

cmccabe commented Jan 21, 2016

On Wed, Jan 20, 2016 at 9:58 PM, bhs notifications@github.com wrote:

@cmccabe https://github.com/cmccabe buffered writes can have, well,
queuing problems... the buffer is an "intermediary" between the operations
trying to write and the final resting place of the data. I agree that it's
not a simple push/pop sort of producer-consumer queue, and I think that's
what you're saying.

Maybe my view of queues is too narrow. But when I think of a queue, I
think of a data structure with a well-defined ordering, where I take out
exactly the same elements that I put in, not some combination. Queuing
also has a strong suggestion that something is going to be processed in an
asynchronous fashion (although strictly speaking that isn't always true).
None of those things always hold true for the examples we've been
discussing, which makes me a little reluctant to use this nomenclature. Do
you think "shared work" is a better term than "queuing"?

In particular, I think your solution of foreign keys is the right thing to
do for asynchronous deferred work (which is the first thing that pops into
my mind when I think of a queue) but I'm not so sure about shared work that
is done synchronously.

I'm interested by your comment that "queues have not been that interesting
to us." Do you mean that HBase doesn't have queuing problems? And/or that
users don't want to understand what's in the queue/intermediary when HBase
is in pushback? Bigtable is admittedly a different system than HBase, but
that was of great interest to me as a Bigtable user when the
tabletserver my process was talking to became unresponsive. Were there
tools that reliably helped in such scenarios? Not really. Would I have
liked to use one? Absolutely.

I agree that when things get busy, it is interesting to know what else is
going on in the system. I (maybe naively?) assumed that we'd do that by
looking at the HTrace spans that were going on in the same region or tablet
server around the time the "busy-ness" set in. I suppose we could attempt
to establish a this-is-blocked-by-that relationship between various
requests... perhaps someone could think of cases where this would be useful
for HBase? I wonder what advantages this would this have over a time-based
search?

Back to your question of why we would "split" the PUT, group commit, and
stream flush: logically, I would prefer not to split them... that's what
this thread is about, of course.

The DAG model in the abstract is sound. It is less clear in the presence
of sampling, though... For instance, if sampling decisions are made at the
root of a trace (i.e., when there's no inbound edge, regardless of
whether it's a DAG or a tree), how do we expect to understand the history
of the other PUTs/etc in our HBase group commit request if they weren't
sampled?

So, the other spans involved in the group commit are either all sampled or
not-all-sampled. If they're all sampled, the tracing system needs to be
able to handle high throughput. If they're not all sampled, the tracing
system will not be able to tell a complete story about queuing problems or
other slowness involving the group commit.

Certainly the group commit, by its very nature, combines together work done
by multiple top-level requests. You can make the argument that it is
misleading to attach that work to anything less than the full set of
requests. But I think in practice, we can agree that it is much more
useful to be able to associate the group commit with what triggered it,
than to skip that ability. Also, this criticism applies equally to foreign
key systems-- if the user can somehow click through from the PUT to the
hdfs flush, doesn't that suggest a 1:1 relationship to the user even if one
doesn't exist?

For a tracing system that can afford to sample all requests, in my mind
the presence of unique ids for specific queues opens the door to various
useful UI features. If it would be helpful, I could try to describe such
features... but IMO just assembling one gigantic DAG trace that includes
everything in a batch as well as all of its downstream and upstream
(transitive) edges is problematic from both a systems standpoint and a
visualization standpoint without additional meta-information about the
structure of the system and the various queues/intermediaries.

If the shared work is "gigantic" that will be a problem in both the
multi-parent and foreign key scenarios. Because I assume that you want the
shared work to be traced either way (I assume you are not proposing just
leaving it out). In that case we need to explore other approaches such as
intra-trace sampling or somehow minimizing the number of spans used to
describe what's going on.

Conceptually, having an hdfs flush span that has "foreign keys" to write
requests A, B, and C seems very similar to having an hdfs flush span that
has parents of A, B, and C. I don't understand why having a DAG of trees
of spans is acceptable but just having a DAG of spans is not. Does this
simplify the GUI?

Colin

Reply to this email directly or view it on GitHub
#28 (comment)
.

@bhs
Copy link
Contributor

bhs commented Jan 22, 2016

Hey Colin,

One final thing about "queue", the word: I don't much care what we call it, I'm just trying to find a word we can use to describe the concept. I guess I've often heard people talk about "queueing problems" in datastore workloads, but whatever term you want to use is fine by me.

Anyway, re your last paragraph:

Conceptually, having an hdfs flush span that has "foreign keys" to write requests A, B, and C seems very similar to having an hdfs flush span that has parents of A, B, and C. I don't understand why having a DAG of trees of spans is acceptable but just having a DAG of spans is not. Does this simplify the GUI?

Yeah, so, I don't really have strong opinions about the "DAG of trees of spans" vs "DAG of spans" question per se. Both could work... I was more interested in avoiding what otherwise seems (?) like an O(N^2) edge proliferation... Again, looking at a fictitious queue:

tail--> [C D E F G H I J K] <--head

If we say that C depends on D, E, ..., K, doesn't D depend on E, F, ..., K? I liked the idea of creating a guid for the flush buffer / queue / whatever-we-want-to-call it because each span would have the single reference to that guid and a tracing system could infer the dependency relationships between the various buffered items.

The unfortunate thing about what I'm proposing is that tracing systems need to be aware of a new sort of construct. But I was hoping it would offer a more "declarative" (for lack of a better word) way to describe what's going on.

@cmccabe
Copy link

cmccabe commented Jan 26, 2016

On Thu, Jan 21, 2016 at 4:04 PM, bhs notifications@github.com wrote:

Hey Colin,

One final thing about "queue", the word: I don't much care what we call
it, I'm just trying to find a word we can use to describe the concept. I
guess I've often heard people talk about "queueing problems" in datastore
workloads, but whatever term you want to use is fine by me.

I'm still unsure whether "a queue" is the right term for the generic
concept of shared work we are talking about here. Wikipedia defines a
queue as a "a particular kind of abstract data type or collection in which
the entities in the collection are kept in order and the principal (or
only) operations on the collection are the addition of entities to the rear
terminal position, known as enqueue, and removal of entities from the front
terminal position, known as dequeue." This doesn't seem like a very good
description of something like a group commit, where you add a bunch of
elements in no particular order and flush them all at once. It's not
really a good description of something like an HDFS flush either, where you
accumulate N bytes in a buffer and then do a write of all N. It's not like
there are processes on both ends pulling individual items from a queue.
It's just a buffer that fills, and then the HDFS client empties it all at
once, synchronously.

Anyway, re your last paragraph:

Conceptually, having an hdfs flush span that has "foreign keys" to write
requests A, B, and C seems very similar to having an hdfs flush span that
has parents of A, B, and C. I don't understand why having a DAG of trees of
spans is acceptable but just having a DAG of spans is not. Does this
simplify the GUI?

Yeah, so, I don't really have strong opinions about the "DAG of trees of
spans" vs "DAG of spans" question per se. Both could work... I was more
interested in avoiding what otherwise seems (?) like an O(N^2) edge
proliferation... Again, looking at a fictitious queue:

tail--> [C D E F G H I J K] <--head

If we say that C depends on D, E, ..., K, doesn't D depend on E, F, ..., K
?

We use multiple parents in HTrace today, in the version that we ship in
CDH5.5. It does not cause an O(N^2) edge proliferation. Flush spans have
a set of parents which includes every write which had a hand in triggering
the flush. I don't see any conceptual reason why the individual writes
should depend on one another. One write is clearly not the parent of any
other write, since the one didn't initiate the other.

I liked the idea of creating a guid for the flush buffer / queue /
whatever-we-want-to-call it because each span would have the single
reference to that guid and a tracing system could infer the dependency
relationships between the various buffered items.

The unfortunate thing about what I'm proposing is that tracing systems
need to be aware of a new sort of construct. But I was hoping it would
offer a more "declarative" (for lack of a better word) way to describe
what's going on.

Hmm. Maybe we need to get more concrete about the advantages and
disadvantages of multiple parents vs. foreign keys.

With multiple parents, a parent span can end at a point in time before a
child span. For example, in the case of doing a write which later triggers
a flush, the write might finish long before the flush even starts. This
makes it impossible to treat spans as a flame graph or traditional stack
trace, like you can in a single-parent world. This may make writing a GUI
harder since you can't do certain flame-graph-like visualizations.

With foreign keys, we can draw "a dotted line" of some sort between
requests. For example, if the write is one request and the flush is
another, there might be some sort of dotted line between them in GUI
terms. It's a bit unclear how to make this connection, though.

The other question is what the "foreign key" field should actually be. If
it is a span ID, then it is easy for a GUI to follow it to the relevant
"related request." It also makes more sense to use span IDs for things
like HDFS flushes, that have no actual system-level ID. To keep things
concrete, let's consider the HDFS flush case. In the "foreign key as span
ID" scheme, the flush would switch from having write A, write B, and write
C as parents to having all those spans as "foreign keys" (or maybe "related
request spans"?) Aside from that, nothing would change.

Whether you choose multiple parents or foreign keys, you still have to
somehow deal with the "sampling amplification" issue. That is, if 20
writes on average go into each flush, each flush will be 20x as likely to
be traced as any individual write operation. That is, assuming that you
really make a strong commitment to ensuring that writes can be traced all
the way through the system, which we want to do in HTrace.

Colin


Reply to this email directly or view it on GitHub
#28 (comment)
.

@bhs bhs added the not urgent label Feb 10, 2016
@yurishkuro
Copy link
Member

To draw a conclusion on the impact on the API, can we agree on the following API?

span.add_parents(span1,  ...)

The method takes parent spans via varargs, and the spans do not have to belong to the same trace (solving meta-trace issue).

@bhs
Copy link
Contributor

bhs commented Feb 19, 2016

I am sorry to be a stick in the mud, but this still seems suspect to me... For one thing, we probably shouldn't assume we have actual span instances to add as parents. Also, the model described at http://opentracing.io/spec/ describes traces as trees of spans: we can talk about making traces into DAGs of spans, but I would rather we bite off something smaller for now.

One idea would be to aim for something like

span.log_event(ext.CAUSED_BY, payload=span_instance_or_span_id)

... or we could do something similar with set_tag and avert our eyes about the multimap issue (Yuri, I know that you lobbied for the "undefined" semantics so that multiple calls to set_tag may yield a multimap in some impls).

Also happy to schedule a VC about this topic since it's so complex in terms of implications. Or wait for Wednesday, whatever.

@yurishkuro
Copy link
Member

fair point, I am not married to the word "parent", we can pick more abstract causality reference, like "starts_after"

I do prefer to provide "causal ancestors" a list of Spans. This has to be tracer-agnostic syntax, and the end user's code doesn't know what "span id" is, they can only know some serialized format of the span, and if the multi-parent span creation happens in a different process (like a background job caused by an earlier http request), then presumably the parent trace managed to serialize its tracer context before scheduling the job, so that the job may de-serialize it into a Span.

Finally, for the method signature, I prefer a dedicated method, since this functionality is actually in the public APIs of some existing tracing systems (HTrace, and possibly TraceView). Delegating this particular feature to a simple key/value of log/setTag methods doesn't feel right, especially due to lack of type enforcement.

btw, set_tag is not really an option since people were adamantly opposed to non-primitive tag values, as a result in Java we can't even pass an Object to setTag.

@codefromthecrypt
Copy link
Contributor Author

Bogdan mentioned soft-links at the workshop, which would be a way to
establish causality between traces, and calculate the time that a message
was in-flight.

The tie-in to multiple parents are two-fold.

Firstly, some are raising concern that we are linking trees, not spans in
the same trace. Perhaps there's relevance on that.

Secondly, there's concern about encoding. For example, in certain systems,
propagated tags have constraints, like being a basic type (sampled flag) or
a binary (trace id struct). Also, there are folks who have clearly wanted
to keep "tags" contained.. for example, quite a few tracing systems
reference these as simple string->string dicts.

I'm not sure this is a blocking concern. For example, in htrace, the
multiple parents is actually a separate field in the span.. i.e. it isn't
stored in the tags dict. In other words, tracing systems are not required
to store everything as tags, and today they certainly don't (ex
annotations/logs are not tags).

For those who aren't following OpenTracing, it may be more important there.
There's currently no intent to extend the data structure to support fields
besides tags or logs (annotations). In this case, if someone was using
OpenTracing only, they'd need to stuff multiple parents (or soft-links)
into something until such a feature was formally supported. This would lead
to comma or otherwise joining (if a tag), or stuffing them into logs (which
can repeat).

Long story short.. I think the OpenTracing encoding question is relevant to
OT as of February 20, but not a blocking concern for tracing systems who
have extended their model, are ok to extend their model, or do not see
encoding with commas or otherwise an immediate concern.

On Sat, Feb 20, 2016 at 6:44 AM, Yuri Shkuro notifications@github.com
wrote:

fair point, I am not married to the word "parent", we can pick more
abstract causality reference, like "starts_after"

I do prefer to provide "causal ancestors" a list of Spans. This has to be
tracer-agnostic syntax, and the end user's code doesn't know what "span id"
is, they can only know some serialized format of the span, and if the
multi-parent span creation happens in a different process (like a
background job caused by an earlier http request), then presumably the
parent trace managed to serialize its tracer context before scheduling the
job, so that the job may de-serialize it into a Span.

Finally, for the method signature, I prefer a dedicated method, since this
functionality is actually in the public APIs of some existing tracing
systems (HTrace, and possibly TraceView). Delegating this particular
feature to a simple key/value of log/setTag methods doesn't feel right,
especially due to lack of type enforcement.

btw, set_tag is not really an option since people were adamantly opposed
to non-primitive tag values, as a result in Java we can't even pass an
Object to setTag.


Reply to this email directly or view it on GitHub
#28 (comment)
.

@yurishkuro
Copy link
Member

I think the OpenTracing encoding question is relevant to OT as of February 20, but not a blocking concern for tracing systems who have extended their model, are ok to extend their model, or do not see encoding with commas or otherwise an immediate concern.

@adriancole I don't follow this point. An end user does not have access to string representation of a span, the best they can do is use the injector to convert span to a map[string, string] and then do some concatenation of the result. If we propose this as the official way of recording casual ancestors, we're locking every implementation into that clunky format (which may not even be reversible without additional escaping, depending on the encoding). If we propose no official way, people would have to resort to vendor-specific APIs.

Ben's suggestion of using log with standard msg string at least resolves the encoding problem, since log() accepts any payload, including the Span. But casual ancestors aren't logs, they have no time dimension. Plus, for tracers that do not capture multi-parents properly, span.log('caused_by', other_span) might lead to peculiar side effects.

@codefromthecrypt
Copy link
Contributor Author

@yurishkuro whoops.. sorry.. I was catching up on email and I mistook this thread for one in the distributed-tracing google group (which is why I said "For those who aren't following OpenTracing..").

@codefromthecrypt
Copy link
Contributor Author

in other words, my last paragraph wasn't targeted towards the OT stewards, rather towards those who author tracers in general. ex. they may just simply support this feature or not (ex htrace does already), as they can control their api, model, and everything else. That paragraph isn't relevant to OT discussion.. which makes me think maybe I should just delete the comments as they weren't written for OT debates.

@lookfwd
Copy link
Contributor

lookfwd commented Feb 20, 2016

span.add_parents(span1,  ...)

Like it

we can talk about making traces into DAGs of spans

The sooner this gets done, the less likely there will be the need of opentracing 2.0.

Tracing is a DAG problem. The algorithms on top of trees and DAGs should be of similar complexity. The visualizations should change significantly but what is out there right now is suitable mostly for web-flows where there's a request etc. which is a minor part of what people need to monitor.

events_

Here's a little example of an engine that matches streams of tweets. A tweet arrived, triggered 1000 subscriptions with their own spans that trigger events on several servers (sharded by subscription id).

Notice that what you see above is just a single slice. We have 4 such slices (sets of servers) for resilience on an active/passive setup.

What is worthy under those conditions is to take the DAG for each tweet from each of the slices and compare it with all the others in terms of latency and correctness.

[Disclaimer: The example is artificial but similar to the one I'm working on]

@dkuebric
Copy link
Contributor

There may be a case for DAG in some fork/join concurrency models: it gives the tracer more information about blocking events by clarifying joins. If we can live without that, or find a way to back it in later (some sort of "barrier event"), then we don't need multi-parent/multi-precedes IMO.

@lookfwd I think your example is very valid for tracing, but I wouldn't model it in the way you propose. A tweet is a triggering event which kicks off some processing. I'm not sure the same should be said of the subscription at that point, however. I'd argue the subscription is state: it's a user-defined configuration that exists in the system before tweet event.

If you consider the tweet as the triggering event and the subscriptions as values read and acted on at tweet time, you're back to a tree: the number of subscriptions becomes a high fan-out of the trace, but it's still a tree.

(The subscription create/modify may be its own trace-triggering event when it is first created and populated to whatever subsystems store the state.)

It's tempting to want to associate the influence of a user's subscription on later processing of tweets--however, I don't think a single trace is a good way to do so. If you want to model the subscription as an ongoing event, your "traces" will never end.

@bhs
Copy link
Contributor

bhs commented Feb 20, 2016

@dkuebric said:

If you want to model the subscription as an ongoing event, your "traces" will never end.

Very well said. DAGs are great and everything, but if we truly want to consider the general causal-dependency DAG as "a single trace", that single trace quickly becomes absolutely enormous and consequently intractable from a modeling standpoint.

Another very important consideration is sampling: if the sampling coin-flip (in Dapper-like systems) happens at the "DAG roots" (i.e., spans that have no incoming edge), then the add_parent call is likely ("almost always" with Dapper-style 1/1024 sampling, in fact) to point to a sub-DAG that's not even being recorded up the stack.

@lookfwd I don't want to sound unsympathetic, and perhaps we will even proceed with this... but what you're advocating has huge implications for sampling (either that or it will be semantically broken in most production tracing systems I'm aware of) and I worry about adding something to OT1.0 that contemporary tracing systems can only support partially, and for fundamental reasons. Does that make sense?

Or would we consider this an "aspirational feature" that's a bit caveat emptor?

@lookfwd
Copy link
Contributor

lookfwd commented Feb 20, 2016

@dkuebric There are two type of subscriptions. Some are long-lived where indeed what you say applies. The second case is "live" subscriptions that last for however long the user is connected (from seconds to minutes). We have 10's of thousands of those per day.

I don't see a problem with never ending traces by the way. I think it's valid to query "give me anything you know that happened between @tstart till @tend for that event". I think this is what always happens where @tend=now.

Frankly - for me there are two separate things:

  • What happened (go to the crime scene and collect facts)
  • What can I do with it (we will see if we use them in the court)

I think that no one would argue that the reality is DAG and trees are a subset of use cases. These are the facts. This is what I want to have on my distributed trace logs even if I don't have the tools to visualize or create alarms on those facts right now.

Now on the visualization part... I think that the common view is this:

zipkin

which is very valid and indeed it's the present. But I believe that equally well someone could extract timed sequence diagrams from traces:

r054bnh

and in those DAGs can very well be expressed as well as rich casual relationships. On the contrary state is somewhat complex to express.

No matter what the tools look like right now... no matter what we do with data... we should collect the facts with a mindset of what does really happen.

@bensigelman I think I get what you mean with sampling. Yes, you need to trace everything :) I will have a much better look on the spec. The truth is that I've spent just a few hours trying to understand if it is suitable for our use case so I might well be missing many aspects / constraints!

A little update. If you want to sample per shard e.g. anything where *ID % 1000 == 0 there shouldn't be a problem with DAGs and sampling.

@bhs
Copy link
Contributor

bhs commented Feb 20, 2016

@lookfwd this is interesting stuff and I'm happy to hop on a call or VC to discuss in detail: probably more useful than just reading the spec, but up to you.

As for the update at the end of your last message: there are plenty traces that span logical shards so I'm not sure what "sampling by shard" means in that context. Also, sampling is done both to reduce the load on the tracing system and on the host process: if the host process is in shard_id % 1000 == 0, then everything therein will be sampled and there can be an observer effect / perf degradation.

@bhs
Copy link
Contributor

bhs commented Jul 8, 2016

I believe that the SpanContext and Reference concepts deal with this issue cleanly. I'll close in a few days if there are no objections.

@bhs bhs added this to the 1.0GA milestone Jul 8, 2016
@yurishkuro
Copy link
Member

@bensigelman I prefer to keep this open until we define more exotic SpanReferenceTypes that can actually represent the scenarios mentioned here.

@bhs
Copy link
Contributor

bhs commented Jul 8, 2016

Thinking more about this, the other thing we're missing (in addition to more exotic reference types) is the capacity to add References during the lifetime of a Span, and not just at start time.

Since both of these are backwards-compatible API additions, I'm going to remove the "1.0 Spec" milestone for this issue.

@yurishkuro
Copy link
Member

a) so we're going with 1.1, not 2.0?
b) can you give an example where it's necessary to attach span reference after the span has started?

@tinkerware
Copy link

@yurishkuro For me, it would be a group commit operation that does not know the ancestor commits before creating the span. It would need to capture the span contexts from the individual commit operations, put them in a set, then attach each one to the group commit span with a Groups reference type. Without the ability to attach span references later on, the instrumentation gets unwieldy; you have to capture the starting timestamp of the group commit, then start the span with the explicit timestamp. This breaks the typical code flow where you can use a try-with-resources clause or equivalent to surround the instrumented block, more so if you are also trying to capture faults/exceptions.

Another example (one that I'm currently working on) is integrating with an in-process instrumentation library. It's a lot more convenient to be able to attach references after creating a span; I don't have to worry about having all references in order from ancestor spans in the instrumentation library at the point where I need to create the descendant span.

@bhs
Copy link
Contributor

bhs commented Nov 16, 2016

See opentracing/specification#5 to continue this discussion.

@bhs bhs closed this as completed Nov 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants