Skip to content

Conversation

@codefromthecrypt
Copy link
Member

Over the last year I've noticed a problem where instrumentors accidentally send too much data.

For example, things bubble up as frame size exceeded (like 64MiB) etc. While a lot of systems can
handle loads, there's often an error when something like this is going on. For example, it might be
someone replicating the entire http message in a trace, or storing all headers, etc.

At the moment, only Finagle's tracer bounds the size of a span before placing it on the transport. For
example, it makes sure a scribe message flushes at 5MiB. I've noticed others mention they alert
when their instrumentation report very large spans.

Getting to transport errors is the worst place to help folks. I figured adding some docs here might
obviate this sort of problem, or at least give folks a good chance to start with better practice.

Over the last year I've noticed a problem where instrumentors accidentally send too much data.

For example, things bubble up as frame size exceeded (like 64MiB) etc. While a lot of systems can
handle loads, there's often an error when something like this is going on. For example, it might be
someone replicating the entire http message in a trace, or storing all headers, etc.

At the moment, only Finagle's tracer bounds the size of a span before placing it on the transport. For
example, it makes sure a scribe message flushes at 5MiB. I've noticed others mention they alert
when their instrumentation report very large spans.

Getting to transport errors is the worst place to help folks. I figured adding some docs here might
obviate this sort of problem, or at least give folks a good chance to start with better practice.
@codefromthecrypt
Copy link
Member Author

cc'ing a bunch of tracer authors (noting I'm sure I'm missing many):
@Horusiath @dawallin @marcingrzejszczak @kristofa @yurishkuro @sveinnfannar @jplock @jcarres-mdsol @prat0318

Spans contain identifying information such as traceId, spandId, parentId, and
RPC name.

Spans are usually small. For example, the serialized form is often measured in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "spans are expected to be typically small" instead?

Copy link
Member Author

@codefromthecrypt codefromthecrypt Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe "Zipkin's design assumes spans are small (orders of kilobytes or less)"

ps @mosesn this is a fair statement?

Main idea I'm trying to relay is that being space efficient was a constant in a lot of Zipkin's design, ex how ip addresses are serialized into numbers, how finagle's tracer only picks a couple fields so that it ends up with quite small (hundreds of byte) spans, etc. If higher orders (like MiB) were intended, people wouldn't mess with things like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's right. Even with all of these optimizations, the write throughput is often enormous.

@codefromthecrypt
Copy link
Member Author

Thinking maybe I could also add a counter-advice here..

"Zipkin integrates with general purpose logging, as opposed to replacing it."

Below details need work (cc @abesto @anuraaga @clehene @klingerf particularly for help tidying this part), I suppose I could pull it into a separate PR.

Zipkin instrumentation routinely offer integration with logging systems, usually adding trace identifiers to the logging context. This allows free formed logging to be correlated with a trace, and enables users to leverage the higher sophistication of log queries.

Zipkin (v1)'s query and indexing system was designed to help pinpoint traces based on known values and categories. For example, you can search by duration, or tags like http path, but not fuzzy queries like regular expressions.

@codefromthecrypt
Copy link
Member Author

sorry if this overloads this issue, so yay|nay on inclusion, or tell me to punt it.

The below is a "day 2" problem that relates to this topic. cc'ing one of our ops advocates @gena01

One less obvious thing about span size is that consistent span size leads to easier operations. For example, I've seen in many occurrences spans/minute or similar as the health signal of the tracing system. While some instrumentation report span size metrics (histograms etc), count over time is simple. In other words, the system will perform differently if span size range covers several orders of magnitude, certainly span/time metrics would be less effective. Moreover, anything that layers on this topic is more complex, for example scaling or capacity planning.

@jcarres-mdsol
Copy link

I think it is a good idea. Small spans are benefits all over

@codefromthecrypt codefromthecrypt merged commit bb60298 into master Sep 9, 2016
@codefromthecrypt codefromthecrypt deleted the bounded-spans branch September 9, 2016 07:38
@codefromthecrypt
Copy link
Member Author

thx for the feedback, folks

@abesto
Copy link
Member

abesto commented Sep 11, 2016

Never considered span sizes as a potential problem before, this is an important addition. Thank you.

the URI of the call will help with later analysis of requests coming into the
service.

The primary use case of binary annotations is exact match search. That said, it is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this is accurate statement. I would even say the opposite, the majority of binary annotations are never used for search, they are used to add contextual data to the span.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reword this, though I wouldn't go so far as to say it is opposite even
if in your practice search isnt used.

On 12 Sep 2016 00:40, "Yuri Shkuro" notifications@github.com wrote:

In pages/instrumenting.md
#35 (comment)
:

@@ -58,12 +58,31 @@ information about the RPC. For instance when calling an HTTP service, providing
the URI of the call will help with later analysis of requests coming into the
service.

+The primary use case of binary annotations is exact match search. That said, it is

don't think this is accurate statement. I would even say the opposite, the
majority of binary annotations are never used for search, they are used to
add contextual data to the span.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/openzipkin/openzipkin.github.io/pull/35/files/eeb91ad188ac5257ec2123b9cf557674c74f3d7c#r78298213,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAD610FzBG_JGFGAQ42rhneJYKrnqyb9ks5qpC8NgaJpZM4IrHCO
.

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Sep 12, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants