Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easy to code failure cases when using core annotations #52

Open
codefromthecrypt opened this issue Sep 27, 2016 · 3 comments
Open

Comments

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Sep 27, 2016

Right now, most instrumentation do not model spans properly on error. For example, they log "sr" and "ss", but don't know what to do, or worse don't handle the error case. While we have a new "error" annotation, I think we should help people understand what to do with this, ideally learning from the past, or even adapting towards it.

Right now, Finagle addresses failure annotations consistently. They have a simple design that looks like this:

before: Annotation <- added at the beginning of a task
after: Annotation <- added unconditionally after a task
afterFailure: Annotation + ": $error.class: $error.message" <- additionally added when the task failed

These work for client and server spans, as well tasks within a span, such as wire send or wire fragment send. Notably, the failure annotation is based on the context. ex there's a different annotation prefix for a failed server span vs a failed wire send attempt within a span. In finagle, all error annotations include the word "Error".

For example, for a server span, the variables apply like this:

before: "sr"
after: "ss"
afterFailure: "Server Send Error: $error.class: $error.message"

Using a simplified format for annotations as {timestamp, annotation}
Say a client-server span succeeded, you have the following in finagle:

[
  {1, "cs"},
  {2, "sr"},
  {3, "ss"},
  {4, "cr"}
]

Say the server-side failed, but was able to send a failed response to the client. I'm making up exceptions here..

[
  {1, "cs"},
  {2, "sr"},
  {3, "Server Send Error: TimeoutException: socket timed out"}
  {4, "ss"},
  {5, "Client Receive Error: SocketException: socket closed"}
  {6, "cr"}
]
  • Notice! the bookend annotations "ss" "cr" are always added, this allows the spans to be marked complete in error case.

This is even more helpful when annotation bookends that are not modeled as a span are used, for example when a wire fragment failed to send.

I believe that we should encourage Finagle's approach. Logging a useful formatted string as opposed to the word "error" as an annotation, and always close core spans with the bookends.

This mostly requires no change to zipkin, except backfilling some constants that are in finagle's version of the thrift, and changing the ui to stain spans yellow when an annotation contains "error" vs being exactly "error". We can leave the binary annotation key of "error" alone.

Backfilling error constants could help in @rogeralsing's work, where he's trying to color spans according to annotations.

cc @mosesn @sveinnfannar @eirslett @kristofa @virtuald @openzipkin/cross-implementation-team-zipkin-pinpoint-htrace-etc

See openzipkin/zipkin#1140

Longer note on why not make this json!

Zipkin's model was designed for simple human-readable string annotations. While tempting to try and structure the error into its components (type, class, etc) in a json ball, doing so would literally be json inside json.

Even if we changed the ui to accommodate json embedded in a string, the raw json logged into storage (like elasticsearch or cassandra) would be garbled, making other tooling (like kibana, cql queries etc) worse.

Please resist the temptation to structure errors as escaped json and instead format them in some readable way.

Example impact of reporting an formatted string as opposed to a json encoded value

Here's an example real span as most zipkin libraries would report it:

{
  "traceId": "f66529c8cc356aa0",
--snip--
    {
      "timestamp": 1474934400200000,
      "value": "Client Receive Error: SocketException: socket closed",
      "endpoint": {
        "serviceName": "app",
        "ipv4": "172.17.0.2",
        "port": 8080
      }
    }
  ]
}

And here's how terrible it looks when encoding json in json string, something that some OpenTracing libraries are starting to do.

{
  "traceId": "f66529c8cc356aa0",
--snip--
    {
      "timestamp": 1474934400200000,
      "value": "{\n  \"type\": \"error\",\n  \"kind\": \"client\",\n  \"class\": \"SocketException\",\n  \"message\": \"socket closed\"\n}",
      "endpoint": {
        "serviceName": "app",
        "ipv4": "172.17.0.2",
        "port": 8080
      }
    }
  ]
}

What to do when a tracer doesn't accept strings, and instead accepts a raw map

Using normal text formatting for annotation values allow existing deployed zipkin installations to have the best experience, as they allow users to easily read json directly via api or the UI, and allow storage queries and tooling to work in the simplest manner. Tracers should always format annotations in a storage and user-friendly String before reporting to zipkin.

Dapper-based tracers only log strings, so there's no concern if using one of those. For example, finagle, brave, htrace, zipkin-csharp, pyzipkin etc don't accept structured logs (or require users or libraries to format them before sending to zipkin!)

Notably, OpenTracing recently changed to prefer logging raw maps while luckily still retaining the string log function. Unfortunately, users often don't act on tracers directly, so it isn't in their direct control which api to use.

OpenTracing zipkin bindings should include a formatting hook so that Map annotations can be stringified before reporting to zipkin. If they don't, they should mark clearly in their README the practices used to avoid json wrapped in a string. Either approach puts tradeoffs in the hands of the caller, before they accidentally start filling zipkin with data like this.

openzipkin-attic/zipkin-ui#11

@codefromthecrypt
Copy link
Member Author

made up a new distribution for those who own zipkin instrumentation. feel free to ignore invite if I mis-tagged you, or ask for an invite if I missed you @openzipkin/instrumentation-owners

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Sep 27, 2016

some things that might not be immediately obvious to folks who don't use normal zipkin spans

  • In a shared span, you can attribute errors to one side or the other via the endpoint of an annotation (but not by timestamp).
  • In a shared span, annotation values encode which role the host is taking, ex a client or server role. (ex "cs" is client and so could be "^Client.*Error" pattern)

codefromthecrypt pushed a commit to openzipkin/zipkin that referenced this issue Jan 7, 2017
Particularly, this flags existing finagle formatted errors, like so:
"Server Send Error: TimeoutException: socket timed out"

Fixes #1472
See openzipkin/openzipkin.github.io#52
codefromthecrypt pushed a commit to openzipkin/zipkin that referenced this issue Jan 8, 2017
Particularly, this flags existing finagle formatted errors, like so:
"Server Send Error: TimeoutException: socket timed out"

Fixes #1472
See openzipkin/openzipkin.github.io#52
@codefromthecrypt
Copy link
Member Author

FYI, summary of right now is most use normal core annotations, and just add an "error" tag/binary annotation when there's an error. Use of timeline annotations to add transient errors isn't common, but not discouraged either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant