Make it easy to code failure cases when using core annotations #52

codefromthecrypt · 2016-09-27T01:44:02Z

Right now, most instrumentation do not model spans properly on error. For example, they log "sr" and "ss", but don't know what to do, or worse don't handle the error case. While we have a new "error" annotation, I think we should help people understand what to do with this, ideally learning from the past, or even adapting towards it.

Right now, Finagle addresses failure annotations consistently. They have a simple design that looks like this:

before: Annotation <- added at the beginning of a task
after: Annotation <- added unconditionally after a task
afterFailure: Annotation + ": $error.class: $error.message" <- additionally added when the task failed

These work for client and server spans, as well tasks within a span, such as wire send or wire fragment send. Notably, the failure annotation is based on the context. ex there's a different annotation prefix for a failed server span vs a failed wire send attempt within a span. In finagle, all error annotations include the word "Error".

For example, for a server span, the variables apply like this:

before: "sr"
after: "ss"
afterFailure: "Server Send Error: $error.class: $error.message"

Using a simplified format for annotations as {timestamp, annotation}
Say a client-server span succeeded, you have the following in finagle:

[
  {1, "cs"},
  {2, "sr"},
  {3, "ss"},
  {4, "cr"}
]

Say the server-side failed, but was able to send a failed response to the client. I'm making up exceptions here..

[
  {1, "cs"},
  {2, "sr"},
  {3, "Server Send Error: TimeoutException: socket timed out"}
  {4, "ss"},
  {5, "Client Receive Error: SocketException: socket closed"}
  {6, "cr"}
]

Notice! the bookend annotations "ss" "cr" are always added, this allows the spans to be marked complete in error case.

This is even more helpful when annotation bookends that are not modeled as a span are used, for example when a wire fragment failed to send.

I believe that we should encourage Finagle's approach. Logging a useful formatted string as opposed to the word "error" as an annotation, and always close core spans with the bookends.

This mostly requires no change to zipkin, except backfilling some constants that are in finagle's version of the thrift, and changing the ui to stain spans yellow when an annotation contains "error" vs being exactly "error". We can leave the binary annotation key of "error" alone.

Backfilling error constants could help in @rogeralsing's work, where he's trying to color spans according to annotations.

cc @mosesn @sveinnfannar @eirslett @kristofa @virtuald @openzipkin/cross-implementation-team-zipkin-pinpoint-htrace-etc

See openzipkin/zipkin#1140

Longer note on why not make this json!

Zipkin's model was designed for simple human-readable string annotations. While tempting to try and structure the error into its components (type, class, etc) in a json ball, doing so would literally be json inside json.

Even if we changed the ui to accommodate json embedded in a string, the raw json logged into storage (like elasticsearch or cassandra) would be garbled, making other tooling (like kibana, cql queries etc) worse.

Please resist the temptation to structure errors as escaped json and instead format them in some readable way.

Example impact of reporting an formatted string as opposed to a json encoded value

Here's an example real span as most zipkin libraries would report it:

{
  "traceId": "f66529c8cc356aa0",
--snip--
    {
      "timestamp": 1474934400200000,
      "value": "Client Receive Error: SocketException: socket closed",
      "endpoint": {
        "serviceName": "app",
        "ipv4": "172.17.0.2",
        "port": 8080
      }
    }
  ]
}

And here's how terrible it looks when encoding json in json string, something that some OpenTracing libraries are starting to do.

{
  "traceId": "f66529c8cc356aa0",
--snip--
    {
      "timestamp": 1474934400200000,
      "value": "{\n  \"type\": \"error\",\n  \"kind\": \"client\",\n  \"class\": \"SocketException\",\n  \"message\": \"socket closed\"\n}",
      "endpoint": {
        "serviceName": "app",
        "ipv4": "172.17.0.2",
        "port": 8080
      }
    }
  ]
}

What to do when a tracer doesn't accept strings, and instead accepts a raw map

Using normal text formatting for annotation values allow existing deployed zipkin installations to have the best experience, as they allow users to easily read json directly via api or the UI, and allow storage queries and tooling to work in the simplest manner. Tracers should always format annotations in a storage and user-friendly String before reporting to zipkin.

Dapper-based tracers only log strings, so there's no concern if using one of those. For example, finagle, brave, htrace, zipkin-csharp, pyzipkin etc don't accept structured logs (or require users or libraries to format them before sending to zipkin!)

Notably, OpenTracing recently changed to prefer logging raw maps while luckily still retaining the string log function. Unfortunately, users often don't act on tracers directly, so it isn't in their direct control which api to use.

OpenTracing zipkin bindings should include a formatting hook so that Map annotations can be stringified before reporting to zipkin. If they don't, they should mark clearly in their README the practices used to avoid json wrapped in a string. Either approach puts tradeoffs in the hands of the caller, before they accidentally start filling zipkin with data like this.

openzipkin-attic/zipkin-ui#11

The text was updated successfully, but these errors were encountered:

codefromthecrypt · 2016-09-27T02:01:02Z

made up a new distribution for those who own zipkin instrumentation. feel free to ignore invite if I mis-tagged you, or ask for an invite if I missed you @openzipkin/instrumentation-owners

codefromthecrypt · 2016-09-27T02:18:08Z

some things that might not be immediately obvious to folks who don't use normal zipkin spans

In a shared span, you can attribute errors to one side or the other via the endpoint of an annotation (but not by timestamp).
In a shared span, annotation values encode which role the host is taking, ex a client or server role. (ex "cs" is client and so could be "^Client.*Error" pattern)

Particularly, this flags existing finagle formatted errors, like so: "Server Send Error: TimeoutException: socket timed out" Fixes #1472 See openzipkin/openzipkin.github.io#52

codefromthecrypt · 2017-06-24T07:44:22Z

FYI, summary of right now is most use normal core annotations, and just add an "error" tag/binary annotation when there's an error. Use of timeline annotations to add transient errors isn't common, but not discouraged either.

This was referenced Oct 7, 2016

Start adding error annotation spring-cloud/spring-cloud-sleuth#384

Closed

Document "error" annotation/binary annotation #36

Closed

how do we idiomatically send cs, cr annotations Yelp/py_zipkin#13

Closed

codefromthecrypt mentioned this issue Dec 7, 2016

Allow run a trace in a closure openzipkin/brave#273

Closed

codefromthecrypt mentioned this issue Jan 5, 2017

UI: flag yellow for annotations that match case insensitive error openzipkin/zipkin#1472

Closed

codefromthecrypt mentioned this issue Jan 7, 2017

Flags yellow for annotations that match case insensitive error openzipkin/zipkin#1475

Merged

praveenbarli mentioned this issue Jun 27, 2017

data model changes to add name, id and dependency for sr annotaiton praveenbarli/zipkin-azure#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easy to code failure cases when using core annotations #52

Make it easy to code failure cases when using core annotations #52

codefromthecrypt commented Sep 27, 2016 •

edited

Loading

codefromthecrypt commented Sep 27, 2016

codefromthecrypt commented Sep 27, 2016 •

edited

Loading

codefromthecrypt commented Jun 24, 2017

Make it easy to code failure cases when using core annotations #52

Make it easy to code failure cases when using core annotations #52

Comments

codefromthecrypt commented Sep 27, 2016 • edited Loading

Longer note on why not make this json!

Example impact of reporting an formatted string as opposed to a json encoded value

What to do when a tracer doesn't accept strings, and instead accepts a raw map

codefromthecrypt commented Sep 27, 2016

codefromthecrypt commented Sep 27, 2016 • edited Loading

codefromthecrypt commented Jun 24, 2017

codefromthecrypt commented Sep 27, 2016 •

edited

Loading

codefromthecrypt commented Sep 27, 2016 •

edited

Loading