2018 08 06 Zipkin and Armeria at LINE Fukuoka

This is a capture of a few days workshop from the 6-8 August at LINE HQ Fukuoka and thereabouts.

Introduction

LINE uses armeria server side with RxJava2 and sometimes Kotlin (coroutines). Looking at other tech such as Reactor subscribeContext, but seems a bit hard to use (because it is an explicit api).

When working with Armeria and RxJava2, every observable uses observeOn to ensure it uses the correct scheduler (in this case to use Armeria event loop). This is because code primarily needs armeria and also zipkin. They don't need to use zipkin context directly as armeria's executor is already instrumented (with brave). https://github.com/line/armeria/pull/1322

One part of the app has a L2 redis cache implemented with Caffeine. Currently there is some trouble with brave-context-rxjava2, and caffeine. They cache the completable future, where multiple requests re-use the same future. The could make some nesting occur, particularly in Armeria, as it has a feature: RequestContext.onChild which will extend a variable from the original one. https://github.com/line/armeria/pull/1262"

We get to hack and explore the crazy world of hard tracing. Please contribute topics or things you'd like to focus on here. We can make it prettier later.

Hard things we might work on

RxJava propagation

So, we have this new RxJava propagation thing in brave.. does it work? Ok if it does where does it not work? What's an idiomatic way to layer span lifecycle over this.. for example, starting a span in one callback and finishing it in another? What are the tracing use cases for Rx?

flux has subscribeContext.

Blocking queue with consumers on another thread!

Was asked in beijing about something like a blocking queue where a normal POJO enters a queue and there's another thread on the other side blocking on it. How do you trace such a thing? approaches including wrapping proxying, secondary channel for trace context or identity map to trace context.. below we can try to flesh this out.

ListenableFuture

Do we love or hate ListenableFuture? Do completion phases mess us up? If not why?

Caffeine

https://github.com/ben-manes/caffeine Caffeine caches future, the owner of the request.

Kafka Streams

Now that Kafka 2.0 version is released, headers are accessible in KStreams. All streaming operations occurs in the same thread. We are starting the thinking around the tracing implementation for such use case.

Current questions:

Should we use a wrapper around the StreamBuilder to retrieve headers context when the stream building is called and close spans / inject headers when the to() is called ?

builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), Serdes.String()))
.mapValues(val -> val + "_foo")
.to(OUTPUT_TOPIC);
TracingBuilder tracingBuilder = TracingBuilder.wrap(builder);
KafkaStreams streams = new KafkaStreams(tracingBuilder.build(), props);

Or create custom Transformers (input / output) which will be responsible of span creation / end ? Not sure it's possible to inject headers like that but could be useful for operations without kafka send.

builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), Serdes.String()))
 .transform(() -> new TracingInputTransformer<>(tracing))
 .mapValues(val -> val + "_foo")
 .transform(() -> new TracingOutputTransformer<>(tracing))
 .to(OUTPUT_TOPIC);

It also seems we can't instrument KTables as they are stateful and don't accept .transform() operations (need to verify this).

Day 1 notes

so armeria is an RPC thing like grpc and when writing an app you can safely assume that there will always be a request context this context has attributes and it can be implicitly accessed as well (RequestContext.current()) so in a system like this, you have the ability to completely skip from using a normal thread local (instead using the already existing context provided by armeria)

re-jigging to use that wasn't much effort, what was interesting was trying to make the app fail fast if the brave context hook wasn't installed. To achieve this we tried a few things, but they were all horrible the problem is that scoping can be wrapped (ex with MDC) so you can't just check "is instance of RequestContextScope"

@anuraaga had a pretty interesting idea of a ping-pong where we can add a ping to a trace context, and see if something pong'ed it (in this case our armeria implementation) since brave has this TraceContext.extra thing anyway, we could borrow it to do the state check, and by doing so we can check well before the setup would be used. Otherwise we could check, but if things were incorrectly setup, it would fail at a high rate (like every request)

Outcomes

ScopeDecorator

Brave 5.2 also introduces ScopeDecorator which is a way to add things like log4j2 trace ID correlation without affecting which thread locals are used by Brave.

Formerly we did log4j2 integration by wrapping our thread local with something to synchronize log4j2's thread locals:

    currentTraceContext = ThreadContextCurrentTraceContext.create(CurrentTraceContext.Default.create());

Now, we can do the same via decorating plugins, which is more elegant at least:

    currentTraceContext = ThreadLocalCurrentTraceContext.newBuilder() // use thread local trace context
          .addScopeDecorator(ThreadContextScopeDecorator.create()) // with log4j2 trace ID correlation
          .build();

More importantly, you can swap out the thread-local backend without affecting your decorators:

    currentTraceContext = RequestContextCurrentTraceContext.newBuilder() // use Armeria's request context
          .addScopeDecorator(ThreadContextScopeDecorator.create()) // with log4j2 trace ID correlation
          .build();

And flexibly, you can now weave in multiple aspects, such as our strict checker:

    currentTraceContext = RequestContextCurrentTraceContext.newBuilder() // use Armeria's request context
          .addScopeDecorator(ThreadContextScopeDecorator.create()) // with log4j2 trace ID correlation
          .addScopeDecorator(StrictScopeDecorator.create()) // complain if closed on the wrong thread
          .build();

The major design contributor on this was @anuraaga with lots of review help by @kojilin. Thanks both!

Sidebar on Zipkin Fukuoka workshop

The below discusses advanced trace context code developed at a Zipkin workshop in Fukoka Japan at LINE's office. Main contributions were from folks who work both on Brave and Armeria, an asynchronous RPC library. This all led to what you can do above, done by volunteers.. some on holiday! As a treat, we've also put some diagrams in.

Armeria's RequestContext

First, let's review request and response processing. In the case of Armeria, a RequestContext is setup to scope data about a request/response exchange. We use >>> and <<< to indicate the direction of the network, and to note that the same context is used for both directions.

┌──────────────────────────────────────────┐┌────────────────────────────────────────────┐ 
│          >>> Request Processing          ││          <<< Response Processing           │ 
└──────────────────────────────────────────┘└────────────────────────────────────────────┘ 
┌>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<┐
│  Armeria Server Request Context                                                         │
└──────────────────────────────────────╦──────────────────────────────────────────────────┘
                                       ║                                                   
                                       ║                                                   
      .───────────────.                ║                                                   
  _.─'                 `──.      ┌────▶║                                                   
 ╱    A client request     ╲     │     ║┌───────────────────────┐┌───────────────────────┐ 
(   context is a copy of    )────┘     ║│>>> Request Processing ││<<< Response Processing│ 
 `.  the server request   ,'           ║└───────────────────────┘└───────────────────────┘ 
   `──.               _.─'             ▽>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<┐
       `─────────────'                 │  Armeria Client Request Context                  │
                                       └──────────────────────────────────────────────────┘

As noted above, in the relevant setup, a client context can fork certain values from the server's context. Regardless of if any values are copied, it is important to note that client contexts are separate from server contexts: changes made to the client context do not affect the server.

Teaching Brave to use Armeria's RequestContext

By default, Brave uses thread local storage to hold the current span's trace context. Formerly, Armeria used lifecycle hooks to coordinate these two thread local stores. Starting in Armeria v0.69, Brave uses the above RequestContext model to store its TraceContext. Here's how it works:

When Brave scopes a trace context, for example a server span, it now writes an entry in Armeria's current RequestContext. When a client request context is created, it forks the last value seen on the server context. This allows asynchronous commands later to be associated with the correct position in the trace.

  ┌>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<┐
  │  Armeria Server Request Context                                                         │
  └──△111111111111111111△222222222222222222△111△11111111111111111111111111111111111111111111┘
     │                  │                ║ │   │◀─────┐             .───────────────.        
    ┌○──────────────────┼────────────────╬─┼───●┐     │        _.──'                 `───.   
    │newScope(server 1) │                ║ │    │     │       ╱    The server's trace     ╲  
    └───────────────────┼────────────────╬─┼────┘     └──────(  context is retained even   ) 
              ▲        ┌◇────────────────╬─◆┐                 `.  after it is descoped   ,'  
              │        │newScope(local 2)║  │                   `───.               _.──'    
              │        └─────────────────╬──┘                        `─────────────'         
              │               ▲          ▽>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<┐
              │               │          │  Armeria Client Request Context                  │
      .───────┴───────.    ┌──┘          └222△3333333333333333△22222222222222222222222222222┘
  _.─'                 `──.│                 │                │                              
 ╱  Each trace scope sets  ╲                ┌□────────────────■┐                             
(   trace identifiers and   ) ────────────▶ │newScope(client 3)│                             
 `. reverts them on close ,'                └──────────────────┘                             
   `──.               _.─'                                                                   
       `─────────────'

You'll see above numbers like 22222211111 This is just showing what span ID is present, in a way highlighting values in the context that change over time. For example, 222 is saying it remained span ID 2, and 22211 says it changed from 2 to 1. In other words, this is a scrappy timeline diagram.

The more interesting part is while scoping restores the prior ID on close, this is not the case on the initial server span. Since an armeria RequestContext is provisioned per server request, there's no need to revert the trace IDs associated with that request. By not reverting, it also allows any response callbacks to be associated with the correct server span, too!

Closing notes

Most people won't need to understand the mechanics described above, but it is helpful to those trying to understand a library in its own semantics. By having Brave use Armeria's context, it is making the integration more natural to the maintainers, so less error prone. Please look forward to Armeria 0.70 which will have even more integration!

This integration was the result of significant brainstorming, design and review by @anuraag @kojilin and @trustin. Please thank them directly if you use any of this.. or even just like reading about it. If you have any questions or feedback, feel free to contact us on https://gitter.im/openzipkin/zipkin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly