Skip to content

2019 03 11 Tapping into Zipkin data

Zoltán Nagy edited this page Jun 18, 2019 · 1 revision
  1. ZIPKIN
  2. Home
  3. Workshops

ZIPKIN : 2019-03-11 Tapping into Zipkin data

Created by Adrian Cole, last modified on Mar 12, 2019

Goal

Recent tooling surfaced to put into practice patterns discussed at the last SF workshop, notably around re-use of data emitted by Zipkin instrumentation. For example, in January we proved the concept that Expedia Haystack can decorate adaptive alerting capabilities with minimal intrusion (via "firehose handlers" discussed at the last SF workshop).

The goal of this workshop is to round-up practice and introduce tooling aimed to bolt-on telemetry or alerting capabilities to existing Zipkin installations. The first day will be more knowledge sharing in nature and the second a spike towards any practical outcomes.

Date

11-12 March 2019 in working hours PST (UTC -8).

Location

At Salesforce Tower:  415 Mission Street, San Francisco, CA 94105

Folks signed up to attend will sign in, at the reception desk, and meet in the Lobby at 9:00am. This will require a light weight NDA to be signed in order to be admitted into the offices. Chris Wensel will be the contact person to use when signing in.

Output

We will add notes at the bottom of this document including links to things discussed and any takeaways.

Attendees

The scope of this workshop assumes attendees are designers or implementors of Zipkin architecture and invited guests. Being physically present is welcome, but on location constrained. Remote folks can join via Gitter and attend the Join Hangouts Meet.

Attending on-site

  1. Chris Wensel, Salesforce (host)

  2. Adrian Cole, Pivotal

  3. Magesh Chandramouli, Expedia

  4. Suman Karumuri

  5. Jordi Polo, Medidata

  6. Guilherme Hermeto, Netflix

Total space is 10! Let’s see how this goes!

Homework

Please review the following before the meeting

2019-01-15 Expedia Haystack and Adaptive Alerting

2018-09-20 Filtering, Aggregations and Proxies at Salesforce

Agenda

If any segment is highlighted in bold, it will be firmly coordinated for remote folks. Other segments may be lax or completely open ended.

Monday, March 11

9:30am introductions

10:00-11:00am

Overview of community works in the last 6 months Adrian
11:00-12:00 zipkin-voltdb  experiment for late sampling adrian
12:00-1pm lunch
1pm-2pm Haystack (integration with zipkin data) Magesh
2-3pm Async modeling and non-traditional tracing Deepa
3-4pm Zipkin JS - discuss 2.x design Guilherme
5pm Zipkin project work, what's buy hold sell (what's investing what's at risk) Adrian

5:30pm

round up


Tuesday, March 12

9:30am overview technical goals
11:30-12:00 Travel/Lunch
12:00-1:00pm Adrian to give presentation on Zipkin to SFDC performance community Adrian https://meet.google.com/gux-spot-vrf
1:00pm-2pm Tangents arising from site presentation
5:30pm round up

Notes

notes will run below here as the day progresses. Please don't write anything that shouldn't be publicly visible!

Intros

Tommy

involved in zipkin for a couple years, previously Rakuten doing tracing and metrics etc, Now a commiter on Zipkin and working at Pivotal, but also responsibilities are on Micrometer.

  • check on what updates are around things like late sampling

Deepa introduces Argus

https://github.com/salesforce/Argus

The frontend is a client-side application that analyzes zipkin data to produce statistical information. It shows summary information in a row-based view with stats. It also shows the default zipkin UI below as well other data such as what's from splunk. argus backend provides metrics aggregation data, but other information gets directly from zipkin api. Transaction is defined as root service+span of the trace, this is used to do dependency linking.

Magesh: who is the user of this data?

Chris: less teams are using this as Zipkin adoption is early stages

Suman: we also had a separate UI, which used data from zipkin query as an overlay. This looks same as the scatterplot in argus

Jordi: we once tried to do dependency link aggregation partitioned by root. However our background jobs start in the middle of the graph, but this caused too many potential roots. So, even if it looks nice to partition in this way, we couldn't use it.

Suman: the dependency graph wasn't too useful as too many services. Would be nicer to search in scatterplot, then aggregate based on that service. IOTW for this path, what's the upstream and downstream

Magesh talks about Haystack

https://github.com/ExpediaDotCom/haystack-docker https://expediadotcom.github.io/haystack/docs/about/architecture.html

Haystack is a tracing system which can emit trends and do anomaly detection, but the architecture is modular such that you can turn things off. The architecture is connected via Kafka and much of it is KStreams apps. It can run native, and use zipkin for trace depot and add trends processing to it as well. tables is similar to what was shown in argus and blobs is request/response logging... these will be available soon.

Eventhough there are 4 boxes in the arch, there are many processes in the architecture. This allows certain features to be swapped or scaled independently.

In the UI, there's a universal search bar. This was added because people were used to splunk. When zipkin is used, its logo will show in the UI. The context carries between tabs, for example, the error query will show the trends around that error.

The data coming in and out of pieces is via Kafka. The anomaly detection learns over minutes then creates anomalies which can have subscriptions like alerts added to them. Internally there are some learning mechanisms that are being developed to strengthen the anomaly detection.

The computed trend is stored in ktables. There are two stages, adding to an anomaly (AA subsystem) and creating an alert (the history stored in elasticsearch)

Magesh: We were experimenting with MySQL for management, but recently the tools around cassandra management have improved. We may stick with cassandra.

Suman: Columnar data stores Pinot Clickhouse are ideally suited for trace data and naturally partition.

Magesh: there's some buffering that eventually writes to cassandra after a timeout. So indexing is done only in elasticsearch

Suman: We standardized on elasticsearch because it is good for the problems you need. Because ES manages the replication of its dataset and also the dataset of the storage system, this replication overhead starts breaking down. For example, if your I/O is blown for indexing it is also blown for search. In pinot, you have an indexing layer, but storage in indexing is separated explicitly. If you have too many tags pinot could be better than lucene. Better to be slow for recent data is my opinion.

Magesh: We address this today with whitelisting. This prevents junk from clogging the indexed. We also sync timing between indexing in elasticsearch and indexing in cassandra. Older data we use this "pipes" which uses technology such as parquet, searching via athena mounted S3.

Magesh talks Blobs

https://github.com/mchandramouli/haystack-blobs/wiki/Logging-Request--&-Response-objects

blobs allows request/response storage, added into trace context as tags in a span.  It is the most widely used feature. The tag value is an encrypted s3 url. This is also synchronized with the logging context. The sampling policy of blobs is different than the sampling policy of traces (though in haystack at expedia, traces are always sampled). The value prop is reducing time to troubleshoot prod issues was 40+ days to 6... after the fact. There is a agent that accepts both span and blob data, which isn't quite ready for OSS yet.

There is a debug mode in haystack UI which is available when in VPN, verified by a token server. This will allow troubleshooting query screens (only booking screens are blobbed by default). We also have a service availability tool which shows a heatmap over second increments (built from trends data)

Jordi: what about sensitive data?

Magesh: we scrub in the agent and after the fact, and also encrypt the blobs. In haystack-pipes, there are scrubbers built-in, but we have more sophisticated scrubbing internally.

Sidebar on the UI.. when you select a trace or other context, the other views like trends are only relevant for that context

Guilherme: does the UI work the same if there's a zipkin backend?

Magesh: some things like trends may not work exactly the same.

Suman: When did this start?

Magesh: Current development started in 2017. We created a KPI mean time to know, the root cause, which is decoupled from MTTR.

Future work:

haystack tables: allows you to create a view over span (smaller parquet file, more simple than athena). Allows query time aggregation.

 

Guilherme: tracing is pluggable, but is the metrics/trends pluggable?

Magesh: the reading/writing is from metrictank (with metrics 2.0, an improved graphite format). We'd need to test this with an adapter.

Suman: it would be nice to plug in to pull from prometheus

Adrian: even if tech is there, you still need to match dimensions, like for example url normalization may be different between different tools

Magesh: we have a filtering step in front of metrictank now to adjust for some of this.

Deepa talks Async

We have problems with async work where we don't know when it completes and we don't want to have flushed data between nodes.

Magesh: we have retrospans, which adds a synthetic span representing the workflow.

Chris: challenge that we are working under an assumption of hierarchy. Flows don't always work like that, which makes it more challenging. There's probably 4 core models of understanding, so we need to know what doesn't fit. For example, batch model has no callback.

Magesh: if you instrument Kafka with jorge's instrumentation.. https://github.com/sysco-middleware/kafka-interceptors you can backfill a fake span given you know the name.

Deepa: currently we handle the workflow time in aggregation only.

Suman: how much time do I spend in a queue in interesting question to ask.

Magesh: how much duration in a service, and how much is network (unallocated time) is reported in haystack ui.

Deepa: How do we know when a trace is finished

Magesh: How often aggregation? do you re-write?

Deepa: 5 minutes.. currently we don't re-write?

Adrian: we might be able to enroll a learning loop to figure out if something is done. Meanwhile, it might be a stage that requires site-specific hard coding which implies some code beyond what producer/consumer relationships can tell you.

Deepa: If some tags are added in child spans how do we ensure we can aggregate? We propagate upwards in post-processing

Suman: how many child requests are going to a different AZ? we wrote a spark job for post-processing (jupiter notebook)

Magesh: haystack agent adds some tags based on where you are running. then in every root span we look at that data, and we compare parent and child metadata.

Suman: doesn't this imply a linear join which is super slow?

Magesh: it is applied to a smaller partition of data

Suman: maybe you can leverage partitioning my trace ID to do this work, to ensure processing is local

Magesh: yeah we can consider doing this as a part of the service graph

Suman: it would be nice to be able to solve certain problems in a turn-key fashion, for example, the modeling problems.

Adrian: Site tags is some area of work towards that. We try to push work which work towards that in hopes that these will steer us in priority order.

Suman: Vendors sell shiny experiences which could be hard for execs to parse what's good marketing vs good choices

Adrian: One way is to look at the sites document where we can share what sites are doing Sites

Magesh: we have decks that we used to present all the way to the CEO level. This shows cost savings in terms of cost and time to know the problem. We are now trying to work on the first start experience, but are now building a control plane to help measure haystack and scale things individually.

Guilherme talks zipkin-js

What worked and what didn't work in v2 and what should we do in zipkin-js? For example, in javascript, we have pieces that evolve as functional, like monads, but some users will completely ignore it and this can cause problems.

We have about 10 instrumentations, by different people with different styling some for frontend and others for node.js. There are patterns such that you can plugin composable things, like plugging in kafka vs grpc etc. There is also a pain creating those integrations. For example, the logging style of annotation which accumulates. You need to know which things are compatible with what, and the assembly details can be difficult as the defaults are not great. How can I hit the ground running with this without having to know do much. One thing we at netflix is sampling at the proxy level.. how can I do a late decision on sampling. Some things have been challenging such as migrating header types. Hopefully this work is re-usable for later types.

One thing that happens in javascript is that the first thing that happens is it creates an ID. Then, it reads the headers, then throws away and has to rescope the ID it now knows is the correct one.

What to do about zipkin4net

Jordi: we have a scenario where people are asked to drop their private library for a common one, which is then later dropped. what to do about it? Also, there are competing libraries that happen.  https://github.com/petabridge/Petabridge.Tracing.Zipkin

Suman: there could be a way to create a base level project that only focuses on encoding details. Most people will want to use the stock.

Jordi: yes, but also there will be people will write their own

What do we do?

  • reassess later? (when)
  • attic repo
  • attic repo and recommend something else

Let's re-assess later as MdSol are working on this and may be able to afford low impact changes. We completely remove the burden to take over >10K line pull requests

Takeaways

Chris Wensel

  • glad to see lens as it helps reduce the friction around adoption
  • not necessarily able to to use voltdb but will look at some of the code for our internal utilities
  • tomorrow: is there a neutral way to provide open source components that enrich spans in flight?

Suman

  • lot of haystack stuff was a good takeaway. ex if we knew kstreams we wouldn't have needed a spark job
  • good to know what's happening in Zipkin
  • tomorrow: present our thinking about traces in general, for example show causally linked events and how it could be leveraged by mobile and web contexts

Guilherme

  • v2 design discussion on zipkin-js
  • interested in the haystack UI to propose to deploy alongside possibly using atlas for the trends backend
  • tomorrow: see how different sites are using zipkin, get more context on the ecosystem

Adrian

  • liked stories about messaging and workflow models
  • zipkin4net forward plan
  • good communication between the team
  • tomorrow: feedback on sites

Jordi

  • mesmerized by haystack
  • interested to hear that everyone has the same marketing problems as we have
  • hackstack blobs seems like a good idea to restart interest in Zipkin

Automated data quality approaches

About zipkin data in graph database by Chris:

Using other tools like Neo4J can show relationship structure glitches easier than the UI as the UI compensates for hierachical errors in a way less easy to detect.

When you have a graph DB, you can do a lot more structural behavior which can help when the data is good.. For example, detecting requests that go into the wrong connection pool

Problem is that it takes a while to load up a graph DB, so you want a tail filter instead of sending everything into it.

The general idea is to use structural language to detect structural anomalies not just latency ones.

adrian: how difficult was it to integrate zipkin data into Neo4J?

chris: you have to integrate with their query language, write an adapter to do this.

Overview of cascading neo4j loader

it uses neo4j to load nodes while also creating edges (to the parent). this may not work well if the parent data is set later. This all has to do with how nodes are materialized on demand for parents. The tool we use internally that makes the mapping of fields such as parentId to the edge PARENT as a text config file. In other words no need for code modifications.

Overview of data cleanup

Chris: we have standards in tags and we'd like a method to conform to those. data is dirty, elasticsearch hates value type changes either. the following allow you to bulk change data to do things like change all tag values into a number https://github.com/Heretical/pointer-path Other things are minor variations of valid values.. https://github.com/Heretical/mini-parsers can adjust for this.

Adrian: this could be helpful for things like how to increase quality of site-specific tags

Chris: we can't control the libraries and the data format. We can produce some transformations to common trouble spots, like dealing with "error=false" tags defined by tracing libraries, or url normalization. URL normalization could also help port forward data which now has different shape but previously didn't (ex required request parameters). We need tools like this to be agnostic to tool sets like KStreams vs other options.

We don't want analytics to have to deal with things like data formatting differences. So we'd want a stage to do normalization prior to that. We need a stage that can be used for things like GDPR, normalization, deployment drift, and have a way to test them (ex run through kafka), but not be opinionated such that it only works with spark (or another similar technology not likely used at all sites).

Nara: What are the use cases of tracing. When there is sampling there's a fraction of use cases that can be served.

Ryan: today tracing is 1/10% in two of datacenters

Nara: 0.1% you can draw the service topology and that's about it, which leads streaming path to rely more on metrics.

Magesh: we were able to add on features where 100% tracing was present (ex booking originated requests). If you combine metrics anomaly detection and tracing, you get more value. We are moving towards graph-based anomaly detection. We don't store raw traces for a long term.. less than 1/10% of traces are looked at. We ship collected traces to long term storage so that they can be churned cheaper with athena. Expedia group has many brands, some do 100% some do 20% some otherwise. Hotels.com and homeaway use brave/sleuth expedia orbitz travelocity is a mix of brave and haystack instrumentation.

Nara: what about gc pauses?

Magesh: will have to ask team about it

Nara: there are two types of services mid-tier and edge. Edgar is similar to zipkin but it is 100%, so that you can diagnose questions like why 4K film is not showing.

Magesh: we have use cases where certain carriers are not booking but the carriers are low volume, so don't show up on alerts. 100% sampling helps here

Ryan: seems like this 100% is a forensic tool. For our low sample rate, it can be the entrypoint, not the end point of debugging. We also have tools to visualize a dependency graph for one request. We still heavily rely on logging and metrics.

Magesh: tracing is used as an entrypoint, and there's a fallback to logs, but haystack helps you get to that point. request/response logging can be a differentiator for operational concerns around booking. More teams are turning from logging as first place to tracing as first point.

Nara: some people get excited about tracing but then go away when they realize it is technically difficult.

Magesh: is it because heterogeneity?

Nara: we have instrumentation for runtime libraries, but there's old stuff and new stuff and it has to interop. If this can't integrate quickly, it is harder. metrics is easier roll-out as there's no context dependency. For example, I want to get a subset trace, but it is not easy to roll out because it requires updates to applications to carry context needed to do this. This makes it seem convenient to do 100% and handle everything in the backend.

Magesh: we have the same conflict

Ryan: what are your experiences with the volume of data in 100% sampling?

Magesh: it is there, but we educate people. we have old and new applications.. some are small, like just identifiers and a couple tags. edge services spit more data like device identifiers and request parameters.. so this means certain layers have different volume. We use a control plane to attribute costs, which helps let applications know how much percentage of haystack bandwidth they use. It is not true billing, but showback.

Ke: how does the identity mapping work?

Magesh: we have a service that enrolls apps into cost centers during onboarding.

Nara: do you use opentracing, how does propagation work? we have some request scoped variables we use.

Magesh: we have applications that use zipkin and some applications that use openzipkin. propagation we use a different set of headers, an extractor for this. some of our old libraries are a little clunky

Nara: how are you doing tracing in reactive libraries. ex it is hard to get a handle on the tracer from akka actor

Magesh: tracing in akka is not simple. we have a scalatra + akka, the requests are traced but the actor is tricky. Request context is mandatory to propagate in some contexts, ex language, currency etc. old trace propagation was tied into this system. So if they work to propagate currency, they should be able to do trace.

Ryan: we have a team that tried. tracing is a community effort and it takes one team to make it break apart. 

Magesh: there's a movement towards templated applications, which can roll out patterns that work with tracing.

Ryan: for those doing 100% collection, do you shard by datacenter?

Magesh: we have from 5 datacenters in different streams, the backend, cassandra is in a ring. it is deployed as multi-AZ, but not multi-region. 

Adrian: yelp have AZ local

Nick: Can you explain how users work with local developer zipkin.. how do you accomodate developers needing to run.

Adrian:  we have junit rule, server can start as a file. this can be used on workstation but also other types of test environments. Some things like immediate consistency and span count limiter help people have less questions using in dev. 

Nick: we do have the ability to spin things up on the workstation, and some varied architectures like mini-pods. We strive to use similar tools across the stack.

Jordi: how do you handle the dillema of the showback, does this lead to reverse adoption

Magesh: it is a critical mass problem, also the peer pressure helps with this (breaking downstream). My original goal was get the booking path 100% instrumented.

Adrian: how do you get the peer pressure to work? seems people would blame platform instead of their upstream

Magesh: It was hard work to get first set of teams, then easier later. self-solving has to do with internal chats like slack. One team has 600RPS, and they want more than we can accept.. so we have opposite problem of more capacity requests than what we can serve.

Ryan: Did you have any trouble selling executives on the origin story?

Magesh: we started back in 2011 with a SQL server, then into splunk app per index, which broke what we had in SQL. Then we hacked zipkin with a chrome plugin. without the chrome plugin, it was really hard to show the value.

from last workshop Magesh - distributed systems long time and streaming for 6-7 years. 2010 tracing in Expedia, but not distributed.. this header propagation is what was the UUID integrated with now. Initially SQL Server 2011-12 then splunk, which was better but teams had different indexes. Led to desire to reduce time.. 2013 tried to "black box" tracing (zipkin) 2014-15 made doppler which was a hacked zipkin to work with IDs etc already in use.. which allowed trend analysis, then led to funding of Haystack

Ke: do you have benchmarking numbers

Magesh: I can give a better answer in june. we are looking to know the answer to how much resources per million spans. Also pre-scaling based on. We currently know the size and scaling factor, but not a formula. We also want to do some initial rate limit.

Chris: can you explain what blobs are?

Magesh: an expedia person can troubleshoot booking with a token authenticated debug request. When in this mode, the trace will include tags which link to the request and response. booking requests are already 100%. teams pay directly for this service so we don't limit usage arbitrarily.

Ke: how long is retension

Magesh: we go close to 14-15Tb data 17-20B spans per day

Ke: storage?

Magesh: span storage is cassandra (short term) s3 (long term) index is elasticsearch.. we don't store data in elasticsearch only a few whitelisted fields.

Ke: in general in OSS zipkin consumes span from kafka there can be problems on ingest and data loss

Magesh: because we know we are running at a specific volume and specific opinions which allow us to tune. I would assume that in kafka

Ke: We have numbers where producer and consumer miss on numbers.

Magesh: we have another learning.. we have an agent which is optional for us. You can send data to the local agent which ships data into the infrastructure. The advantage (2017) was you get out of the misconfiguration problem as you can control that. The tracing team manages that to get applications out of that.

Takeaways

Adrian

  • liked the neo4j and data hygiene things

Nathan

  • was neat to hear options around sampling, sparked ideas around a class of requests

Jordi

  • was nice to have magesh to have so many answers for problems people face
  • I feel I have ideas and answers because more sense about how the end would work
  • good to have a sense of confidence around 100%

Chris Wensel

  • looking into how to get enough confidence without 100% data, including graph models
  • next 6 months would like to circle back about pipeline arch and what a project would look like for inline transformations
    • under guise of GDPR, impedence between libraries and drifts, and some guidance about what example architectures could be used (like KStreams)
  • there's an opportunity for standardising more formally around the zipkin format for a hub-spoke model.

Comments:

This is awesome!! Thanks for sharing!

Posted by jeqo at Mar 14, 2019 14:43

Document generated by Confluence on Jun 18, 2019 18:50

Atlassian

Clone this wiki locally