Adds advice about bounding spans #35

codefromthecrypt · 2016-06-01T01:37:05Z

Over the last year I've noticed a problem where instrumentors accidentally send too much data.

For example, things bubble up as frame size exceeded (like 64MiB) etc. While a lot of systems can
handle loads, there's often an error when something like this is going on. For example, it might be
someone replicating the entire http message in a trace, or storing all headers, etc.

At the moment, only Finagle's tracer bounds the size of a span before placing it on the transport. For
example, it makes sure a scribe message flushes at 5MiB. I've noticed others mention they alert
when their instrumentation report very large spans.

Getting to transport errors is the worst place to help folks. I figured adding some docs here might
obviate this sort of problem, or at least give folks a good chance to start with better practice.

Over the last year I've noticed a problem where instrumentors accidentally send too much data. For example, things bubble up as frame size exceeded (like 64MiB) etc. While a lot of systems can handle loads, there's often an error when something like this is going on. For example, it might be someone replicating the entire http message in a trace, or storing all headers, etc. At the moment, only Finagle's tracer bounds the size of a span before placing it on the transport. For example, it makes sure a scribe message flushes at 5MiB. I've noticed others mention they alert when their instrumentation report very large spans. Getting to transport errors is the worst place to help folks. I figured adding some docs here might obviate this sort of problem, or at least give folks a good chance to start with better practice.

codefromthecrypt · 2016-06-01T01:40:00Z

cc'ing a bunch of tracer authors (noting I'm sure I'm missing many):
@Horusiath @dawallin @marcingrzejszczak @kristofa @yurishkuro @sveinnfannar @jplock @jcarres-mdsol @prat0318

virtuald · 2016-06-01T02:00:58Z

pages/instrumenting.md

 Spans contain identifying information such as traceId, spandId, parentId, and
 RPC name.

+Spans are usually small. For example, the serialized form is often measured in


How about "spans are expected to be typically small" instead?

or maybe "Zipkin's design assumes spans are small (orders of kilobytes or less)"

ps @mosesn this is a fair statement?

Main idea I'm trying to relay is that being space efficient was a constant in a lot of Zipkin's design, ex how ip addresses are serialized into numbers, how finagle's tracer only picks a couple fields so that it ends up with quite small (hundreds of byte) spans, etc. If higher orders (like MiB) were intended, people wouldn't mess with things like this.

Yes, I think that's right. Even with all of these optimizations, the write throughput is often enormous.

codefromthecrypt · 2016-06-01T03:21:55Z

Thinking maybe I could also add a counter-advice here..

"Zipkin integrates with general purpose logging, as opposed to replacing it."

Below details need work (cc @abesto @anuraaga @clehene @klingerf particularly for help tidying this part), I suppose I could pull it into a separate PR.

Zipkin instrumentation routinely offer integration with logging systems, usually adding trace identifiers to the logging context. This allows free formed logging to be correlated with a trace, and enables users to leverage the higher sophistication of log queries.

Zipkin (v1)'s query and indexing system was designed to help pinpoint traces based on known values and categories. For example, you can search by duration, or tags like http path, but not fuzzy queries like regular expressions.

codefromthecrypt · 2016-06-01T03:43:09Z

sorry if this overloads this issue, so yay|nay on inclusion, or tell me to punt it.

The below is a "day 2" problem that relates to this topic. cc'ing one of our ops advocates @gena01

One less obvious thing about span size is that consistent span size leads to easier operations. For example, I've seen in many occurrences spans/minute or similar as the health signal of the tracing system. While some instrumentation report span size metrics (histograms etc), count over time is simple. In other words, the system will perform differently if span size range covers several orders of magnitude, certainly span/time metrics would be less effective. Moreover, anything that layers on this topic is more complex, for example scaling or capacity planning.

jcarres-mdsol · 2016-06-01T07:13:54Z

I think it is a good idea. Small spans are benefits all over

codefromthecrypt · 2016-09-09T07:38:15Z

thx for the feedback, folks

abesto · 2016-09-11T09:38:01Z

Never considered span sizes as a potential problem before, this is an important addition. Thank you.

yurishkuro · 2016-09-11T16:40:45Z

pages/instrumenting.md

 the URI of the call will help with later analysis of requests coming into the
 service.

+The primary use case of binary annotations is exact match search. That said, it is


don't think this is accurate statement. I would even say the opposite, the majority of binary annotations are never used for search, they are used to add contextual data to the span.

Will reword this, though I wouldn't go so far as to say it is opposite even
if in your practice search isnt used.

On 12 Sep 2016 00:40, "Yuri Shkuro" notifications@github.com wrote:

In pages/instrumenting.md
#35 (comment)
:

@@ -58,12 +58,31 @@ information about the RPC. For instance when calling an HTTP service, providing
the URI of the call will help with later analysis of requests coming into the
service.

+The primary use case of binary annotations is exact match search. That said, it is

don't think this is accurate statement. I would even say the opposite, the
majority of binary annotations are never used for search, they are used to
add contextual data to the span.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/openzipkin/openzipkin.github.io/pull/35/files/eeb91ad188ac5257ec2123b9cf557674c74f3d7c#r78298213,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAD610FzBG_JGFGAQ42rhneJYKrnqyb9ks5qpC8NgaJpZM4IrHCO
.

codefromthecrypt · 2016-09-12T00:23:17Z

NP!

@yurishkuro

Thanks, @yurishkuro

virtuald reviewed Jun 1, 2016
View reviewed changes

codefromthecrypt merged commit bb60298 into master Sep 9, 2016

codefromthecrypt deleted the bounded-spans branch September 9, 2016 07:38

yurishkuro reviewed Sep 11, 2016
View reviewed changes

codefromthecrypt pushed a commit that referenced this pull request Sep 12, 2016

Post-merge feedback on #35

86745af

Thanks, @yurishkuro

dancer1325 mentioned this pull request Jul 25, 2024

doc(pages.instrumenting.md): Which do two endpoints exception refer? #181

Open

Adds advice about bounding spans #35

Adds advice about bounding spans #35

Uh oh!

Conversation

codefromthecrypt commented Jun 1, 2016

Uh oh!

codefromthecrypt commented Jun 1, 2016

Uh oh!

virtuald Jun 1, 2016

Choose a reason for hiding this comment

Uh oh!

codefromthecrypt Jun 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mosesn Jun 1, 2016

Choose a reason for hiding this comment

Uh oh!

codefromthecrypt commented Jun 1, 2016

Uh oh!

codefromthecrypt commented Jun 1, 2016

Uh oh!

jcarres-mdsol commented Jun 1, 2016

Uh oh!

codefromthecrypt commented Sep 9, 2016

Uh oh!

abesto commented Sep 11, 2016

Uh oh!

yurishkuro Sep 11, 2016

Choose a reason for hiding this comment

Uh oh!

codefromthecrypt Sep 12, 2016

Choose a reason for hiding this comment

Uh oh!

codefromthecrypt commented Sep 12, 2016 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

codefromthecrypt Jun 1, 2016 •

edited

Loading