Skip to content
Adrian Cole edited this page May 8, 2020 · 4 revisions

Introduction and Scope

Yelp has been using Zipkin in production since 2014 and over time has integrated it in most of our systems and services. We have around 250 services, mostly written in Python plus a few Java and NodeJS services.

All of our python services are instrumented using the py_zipkin and pyramid_zipkin client libraries. Yelp has developed and open-sourced both of these libraries and is still maintaining them.

System Overview

Yelp has a different zipkin setup than most other companies as we actually run 2 zipkin pipelines in parallel.

The first one is a standard zipkin pipeline. Services log sampled zipkin spans to kafka and we then use openzipkin's kafka collector to ingest them in a cassandra cluster with a 1 week TTL. As these traces are stored for a long period of time and flow through kafka, we heavily sample them (at about 0.1% probability).

Zipkin Firehose

The second pipeline is what we call "zipkin firehose". This pipeline is not sampled, so it contains traces for 100% of the requests we receive in production. This obviously generates a lot of traffic so we've had to heavily customize our architecture to save on costs.

  • Instead of writing to kafka, services send 100% of their spans to a local golang daemon called "zipkin-proxy". This daemon buffers spans locally and then writes them directly to Cassandra asynchronously.
  • The secondary indexes and tables in the default zipkin cassandra schema cause a big write explosion. Each span gets written both to the span table and to each of the indexes used to search span names by service and traces by span name. To avoid this, we've made these extra tables optional in Zipkin. If you start your zipkin server with SEARCH_ENABLED=false, it'll show you a UI where searches are disabled. The only way to access a trace is by traceId. That's enough for us since we log the traceId in every access and error log, so we have an easy way to find traceIds for requests we care about.
  • Amazon charges for cross-az traffic, so we've launched a completely separate cassandra cluster in each availability zone. This saves a lot of money on network costs, but it means that the UI now needs to query multiple clusters to be able to reconstruct the full trace.
  • Rather than changing the zipkin server to support multiple clusters (which is hard to do if you have to support all possible datastores) we added a small proxy called zipkin_mux. It intercepts all api calls that the zipkin UI makes, forwards them to all the clusters we have running in production and then merges the results. This is completely invisible to the zipkin UI or backend, so it didn't require any change in the java code.
  • Lastly, since this pipeline handles a huge amount of data, we've limited the TTL to 1 hour. That's more then enough in most cases as zipkin firehose is mostly used during outages when developers are trying to debug what's happening in that exact moment.

Splunk

We also ingest all our zipkin spans from the regular pipeline in Splunk. That gives us the ability to run complex queries on our zipkin data across multiple requests. It also allows to build smarter dashboards and alerts because now we can use zipkin's dependency tree to understand which service is calling another and so only show relevant errors or metrics.

Data Conventions

Service name

We use the zipkin service name to represent a Kubernetes deployment. It is coherent with logging (Splunk) and metrics (Prometheus and statsd) as they all source the same host ENV variable. Zipkin data is also ingested into Splunk, so the service name is a primary correlation field in logs.

Span name

For local root spans (kind=SERVER), the span name is the same as the Pyramid endpoint. This is not guaranteed fixed cardinality, and there are over a thousand HTTP endpoints.

Site-specific tags

There are some processes that add quite a lot of tags. For example, the monolith adds two dozen tags. In practice, a dozen or less have practical value. The following are span tags we frequently use in search, aggregation or support.

"local root" means it is added once per entry point span (ex span.kind=SERVER) instead of copying to all children of that. This permits request + process constants to be queryable, but with less overhead than copying to all resulting local and CLIENT spans.

Tag Description Details
host hostname of localEndpoint primary aggregation field, local root, sourced from ENV
region AWS region of the pod primary aggregation field, local root, sourced from ENV
version_SHA version of the application primary aggregation field, local root, sourced from ENV
request_count count of served web requests local root, first few requests since startup are often slow
query MySQL query fingerprint helps database team cause of problematic queries in logs
calling_method source file and line number help in MySQL when an ORM is in use and hides the query

We don't use Zipkin's auto-complete tag functionality as the indexes would be way too expensive and redundant to indexing already in Splunk.

Current status

Our zipkin firehose pipeline ingested more than 1 million spans/second on average as of January 2020. However we still haven't finished instrumenting all of our services as we don't yet have firehose support for PHP and NodeJS one.

Clone this wiki locally