2018 10 29 Zipkin UI at LINE Tokyo

LINE has 1.5 engineers on Zipkin and through experience have created their own UI internally

Our goal is to understand how this work can be open sourced, or merged with our existing UI in a way that leverages LINE's work and keeps it maintained.

Date

29 Oct - 1 Nov 2018 during working hours in JST (UTC +9)

Location

LINE HQ JR Shinjuku Miraina Tower, 23rd Floor

Folks attending will need to coordinate offline

Output

We will add notes at the bottom of this document including links to things discussed and any takeaways. Attendees

The scope of this workshop assumes attendees are intimately familiar with Zipkin, as we have to discuss details of UI design.

Being physically present is welcome, but on location constrained. Remote folks can join via Gitter and attend a TBD video call.

Attending on-site

Huy, LINE
Igarashi, LINE
Adrian, Pivotal

Homework

Please review the following before the meeting

Major issues:

UI renovation

Alternative Zipkin UIs:

React UI - active, but single person and incomplete
Angular UI - cautionary tale as it was abandoned

UIs made for others, but work against Zipkin Api:

Haystack UI - some functionality works against a stock zipkin server Agenda

Agenda

If any segment is bold, it will be firmly coordinated for remote folks. Other segments may be lax or completely open ended.

Monday, Oct 29


2:00pm	introductions
2:30pm	Overview zipkin's current UI
4:00pm	Overview internal UI and motivation
5:30pm	Overview of dependency work

Tuesday, Oct 30


1:00pm	Vinay on haystack-ui
3:00pm	Discussion on shared spans and impact to UI

Wednesday, Oct 31


12:00pm	LINE Team lunch and social
2:30pm	UI hacking

Thursday, Nov 1


10:00am	UI hacking
5:30pm	round up
6:00pm	Beers at The Griffon Shinjuku

Outcomes

Planning

Came away with a plan to improve the UI and integrate LINE functionality:

Adrian will refactor the existing UI to v2 model so that migration is easy, reviewing with Igarashi and Raja for knowledge transfer
Huy will start the process to open source internal work, this is paperwork related.
Igarashi will lead design of a revamp or replacement of the Zipkin dependencies screen
We will integrate the two UIs once both are at v2 model and use normal github issues for new features like auto-complete site-specific tags.

We agree some important features will be led by LINE once the basic UI replacement is OSS:

contextual search like Haystack UI and AWS X-Ray (ex completion and hold constant across screens like trace search and dependencies)
site-specific auto-complete tags (this implies indexing api, but that is not too bad)

Social

Good connection between Huy, Igarashi and Adrian for following up UI work
Met the Expedia Haystack UI team and learned how it works and their motivations<
Learned about Zipkin usage from users at LINE

Code improvements

Added Zipkin service graph integration to Haystack UI
Backfilled UI tests around clock skew so we don't break in v2 conversion
Fixed the visualization of relative span duration on index page

Documentation improvements

zipkin-ui proxy explained

Notes

10-29

Monday, we met and overviewed the various connection points about the zipkin data model and UI. Most importantly, the "service" concept. For example, at LINE they had users who would put "junk" into the service name field. Their new UI would dodge this problem by using a similarly purposed "InstanceID" tag (which was indexed separately in elasticsearch). Over a while of discussion, I believe we (Huy, Igarashi and Adrian) came to the conclusion that sites should use the most appropriate value for "service", especially one that has clear guidance. As such, if "InstanceID" is a site-specific service identifier, it may be the best value to use for zipkin service as this causes less work and confusion.

We also discussed the knock-on effects of bad services names, namely that the service graph can become useless as well any future tooling that builds on aggregations. LINE were already aware of some of the efforts to mitigate bad services names, such as the blog by soundcloud.

LINE desires a service graph functionality, but does not necessarily buy into a view that shows the entire architecture. We discussed visualization problems and that maybe 15-25 elements of interest might be max a user can understand at a time. They've thought about a profile aspect to the UI where you can tell which services the user owns and so only show perhaps 1 service up and 3 down from that. To extend the network could be implicit, for example, favoriting traces could hint at services of interest.

We also discussed the trace view can be large and hard to tell if is the same as another trace due to cardinality concerns such as the count of redis calls. Extending the trace view to include a trace-specific aggregation could help compare traces visually as their service aggregates should appear the same.

Another feature discussed is more indexing flexibility. For example, a configuration to choose which tags can be built-in search controls, possibly affecting indexing directly or reflective of custom templates. The key here is to declare a mapping such as "Instance Id" to "instance.id" and a way to populate the values for auto-completion which is similar in nature to how service/span names work.

10-30

haystack UI

Attendees:

Magesh (Haystack Lead) and Vinay (UI/UX lead) Abhishek (UI eng) Jason (UI eng)
Adrian Huy Igarashi

Authentication isn't in the OSS Haystack UI by default, but it is integrated internal to Haystack (Passport). OSS can be anything like SAML or Passport. App revolves around the search bar, which are effectively tags. The service graph is based on netflix vizceral. The colors of the nodes reflect the state. data is fed independent from trace stuff. UI has config/base.js and "connectors" are reflective of modules such as serviceGraph. They can be deployed individually.

Haystack has ES (search attributes, tags) Cassandra (spans) Kafka (backbone). Because we have ES, we can use the same for things like configuration (convenience).

If deploying the whole haystack via terraform, where you decide what is there or not, right now the UI config is decoupled, but there could be a future where the config is materialized at the same time.

Inside a connector, there are feature switches used until it matures.

Context of universal search is the "service" for example this can help you look across Traces, Trends, Alerts and Service Graph. Tags are whitelisted for search. you can predefine which tags are available for search.

In trace research, we are thinking about a "spans view" This flattens out the spans into events which can be easier for filtering spans.

The trace also has a service aggregation for (average) latency purposes. It is also color coded for quick understanding if something crosses datacenters. There is another view which puts the system aggregate numbers instead. It was originally plotted side by side but it didn't work in practice due to busy'ness.

A third stage is to show the service performance for each service in the trace (Trends). This is used much more for operations folks. Haystack UI is intending to allow you to transition from a pure tracing into an APM. Related traces could be an option for finding like ones from the same user

Trends: only interested in count, duration and success ratio. you can easily transition into a trace search from this. For example, this can reduce the time spent finding anomalies such as a price scraping bot. The trends are about the downstream calls except the first row which is the incoming requests.. the other rows are for each operation called (any calls, not just remote ones).

data only comes from three places haystack, graphite, or business metrics: anomaly detection comes runs over all of this.

service graph is one level up and one level down.

for customer support (one tier up that are tech savvy) all requests and response payloads are captured with a linked tag to a blob service. this is the most frequently used feature in troubleshooting. customer service index is retained longer and only a few calls. and the names are sanitized to be easier to understand

Q: how did all of this happen?
- A: traditionally (2011), we had 4 systems: splunk, graphite (now gorrilla/metrictank), telemetry (graphite). Now, core data still exists in individual sources (like metric tank), and haystack presents the same data. We got comments trends and alerts view, the operations team love having everything in one place. Our intent is to reduce the time people spend in different splunk indexes.

Q4 wil include service graph snapshots every few minutes to help understand graph anomalies.

Q: how does whitelisting work? terraform has an initial list of tags and you can also update at runtime (k8s job) to change the whitelist (and also index accordingly)
- (Magesh) In Q1 there will a control plane functionality.
Q: how often are whitelist changes?<
- (Magesh) Used to be frequent, but less now. These are targeted towards core attributes, like well known site IDs. Some teams ask for different fields such as PNR for flights. One large deployment for multiple brands (expedia orbitz travelocity etc) and then another separate one for homeaway.<

Haystack is used also in support use cases, such as duplicate charges. To see if something happened or not for customer support. There is some temporary indexing.

Q: what about the provenance use case?
- A: that's one reason we have a separate index for customer service, but we are still figuring out how to deal with the delineation between authoritative transaction info vs trace info.
Q: how do you deal with bad quality service names
- A: there's automation that is required before launching a service. everything must match service name to DNS everything. This doesn't solve the business name though.. as they can identify the downstream as something not authoritative.

Discussed a user based context in the past, but currently going with contextualized views. wouldn't be hard one simple way is to track history of searches.

30TB in protobuf is 4 days of spans in cassandra. s3 is json format (28 days) 300-400K spans/second
one system is sampled 3.6B spans/day so sampled down 1%
largest costs are from storage nodes (cassandra) 1/3rd the cost. experimenting with hosted aurora to reduce that cost. also interested in query
Expedia: tenancy is something to help with in the future, for example only expedia, only travelocity etc (branding). in that case the root span could tag and clarify the primary brand of a trace.
Line: we have an "unrelated services" problem and being able to only see data tagged with a specific project can help filter out noise.

How to proceed the zipkin UI

To converge internal work with external, it seems first step is to make both v2 internal. There's a lot of hard code only there to satisfy v1 (like constant conventions etc). We can also introduce a stub api endpoint to defer some harder code like clock skew correction.

raise small pull requests to make the normal UI v2 internally
add the ability to use a mock api, with 5 example traces like RPC messaging etc.
As soon as both versions use V2, start integration discussions.

Feedback on UI

Want search form to be site-specific. For example, whitelisted tags that can have auto-complete data pre-populated. This can include fields like user countries.
Want a different UI control for date calculation such as grafana's expression now-6hr or date
- (should watch that this control can't offer an unsolvable query like data before TTL)

10-31

Several members of LINE met for lunch and had a chat. All of them have used Zipkin, Stewart at his previous company. Xu JunJian asked about how beginners can understand the difference between trace propagation and trace lifecycle. Adrian tried to walk through this, noting propagation is similar in difficultly making something like a user ID flow through a system, and be accessible to arbitrary code. Stewart noted that in scala, this is typically addressed through implicits. Since the question was about brave, we highlighted that ScopedSpan is simpler but only works for synchronous activity. Normal span allows you to split so that async interceptors can be invoked under the right context. However, it is all admittedly not really a beginner topic, except that to begin async tracing this must be understood.

After lunch, Xu JunJian came and chatted a while in follow-up with Adrian. He noted that particularly Kotlin may become a problem for asynchronous trace context propagation due to the way coroutines work. This is not something we handle at the moment, so opened an issue. Also discussed was how to do sampling, notably how an "important user" such as one with 500 followers can get more traces than others. This is a form of "parameterized sampling" which is possible by inspecting the request. We overviewed existing tech such as the HttpSampler, noting that Armeria might not support the HttpSampler yet.

Following that, Igarashi-san and Adrian debugged whether or not it was necessary to do clock skew adjustment for the index page. This was more difficult than expected due to lack of test coverage, so we fixed that before saying for sure "yes". However, we are not confident in the algorithm used for the service percentages. This will have to be revisited.

11-1

Options for service graph revamp

Igarashi, Adrian, Huy and Raja brainstormed a bit on some options for refreshing the dependency graph. Thinking includes replacing the current code with a newer version of dagre-d3, using a different d3 library like cytoscape, or something different like vizceral. We decided to play with Vizceral via a Haystack integration experiment.

Experiment with Haystack for service graph

Igarashi and Adrian spent time integrating zipkin with the haystack UI

We found something that may not be obvious, which is that haystack UI includes a separate proxy server which mediates between apis like our dependency graph, and the UI code's api. Incidentally, zipkin used to have this approach (zipkin-web), but we switched to pure javascript in the client. The split between code in the browser vs code in the server made some things a little hard to debug. At any rate it is interesting to know there is a proxy element.

Through this proof of concept we learned that the haystack graph splits each connected network onto a new pane. It also has some animations related to "requests per second".

Consider the following zipkin test data: including messaging data from Ascend and RPC data from netflix

Haystack would split these into two tabs like this:

This seems to be useful when you have one zipkin environment used for many applications which operate in silos. However, Huy noticed that many zipkin sites may not have cleanly named services, so graphs may be so numerous that a tab approach could be bad. In this case, a drop-down could be more effective (if splitting).

The other thing interesting is that the links are associated with a rate of requests per second, something like this:

The layout is clean and neat, though it is likely an average rate will be misleading, especially as Zipkin data is often sampled. In fact we used to have rate-like info on our links but we removed it as it was misleading. In Expedia, it can make sense as they emit 100% data. Even if some sites might have "firehose" data into the dependency graph aggregator, we are still a bit suspicious about the rate. OTOH, error ratio is something we already support.. The tension is mainly about daily buckets and how unlikely rates would be helpful in that context.

As expedia focus on operation → operation links, we can also not really get a feel for how that would play out in Zipkin where often sites have even worse cardinality problems in span names than they do in service names. For now, we just use a span name "unknown" to make it work.

An overall impression is that it does look clean and can feel more modern than our dependency graph. However, we also have heard someone at Netflix call Vizceral "conference only" code, and this is concerning. That said, it is maintained and seems like at least Expedia like it.

Closing thoughts at LINE are that we are grateful to be able to experiment and it is nice to be able to accomplish something in a day, and that be likely to work in haystack. This may let people experiment with a vizceral type approach while design is thought through on our side, what to build into the current UI, or whether to just refresh it. From a design POV, we will lean on Igarashi to comeup with a proposal.

notes will run below here as the day progresses. Please don't write anything that shouldn't be publicly visible!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly