Releases: openzipkin/zipkin
Zipkin 1.11
Zipkin 1.11 allows you to see instrumented clients in the dependency view. It also fixes a search collision problem.
Before, the dependency view (ex http://your_host:9411/dependency
) presented a server-centric diagram. This worked well enough as traces usually start at the first server. Especially with new projects like zipkin-js, client-originated traces are becoming more common. For example, the trace could start in your web browser instead of on a server. Zipkin's dependency linker is now trained to look for client send annotations in the root span, and if present, add them to the far-left of the dependency graph. Thanks to @rogeralsing for reporting.
We also fixed a search bug where a query like http.method=GET
matched against any service in a trace as opposed to the service specified in the UI. This affected all storage types except cassandra and is now fixed.
Note: While seemingly simple, this smoked out a latent problem in our Elasticsearch indexing template. Please re-index at your earliest convenience, or drop the index and let Zipkin recreate it.
Zipkin 1.10
Zipkin 1.10 addresses a couple long-term problems relating to span timestamp and duration.
Firstly, we no longer attempt to support duration queries on the "cassandra" storage type. Cassandra 2.2+ doesn't support SASI indexing, and trying to work around that resulted in a feature most couldn't use. @michaelsembwever from The Last Pickle has a more sustainable solution in mind that uses Cassandra 3.8+. Please look for announcements on the experimental cassandra3 storage type.
Next is something that applies to all storage types. When trace instrumentation don't record Span.timestamp and duration, the Zipkin server tries to guess by looking at annotations. Previously, when we guessed wrong, the trace would render strangely. We now guess much more conservatively so as to avoid this.
Here's the impact:
- Span duration is no longer derived by collectors, as it is often wrong. Duration queries won't work unless traces reported to zipkin include duration.
- Span timestamp is derived only when needed, usually to support indexing
- Span timestamp and duration are still backfilled at query time, as otherwise the UI wouldn't work.
Note: The Span.timestamp and duration fields were added a year ago, but many tracers still don't record them. We hope our documentation on how to record timestamp and duration will help ease the task of updating them. If you use a tracer that doesn't yet record Span.timestamp and duration, please raise an issue or PR to the corresponding repository so that it is eventually fixed.
Zipkin 1.8
Zipkin 1.8 is a library change focused on encoding performance. If you are instrumenting apps and use Zipkin's Codec, you'll want to upgrade.
Span encoding has been completely rewritten in order to get common-case overhead in microsecond or less range.
Zipkin 1.7 Codec.writeSpan() vs libthrift (pace car)
CodecBenchmarks.writeClientSpan_json_zipkin avgt 15 17.131 ± 0.446 us/op
CodecBenchmarks.writeClientSpan_thrift_libthrift avgt 15 1.952 ± 0.043 us/op
CodecBenchmarks.writeClientSpan_thrift_zipkin avgt 15 0.996 ± 0.021 us/op
CodecBenchmarks.writeLocalSpan_json_zipkin avgt 15 10.124 ± 0.177 us/op
CodecBenchmarks.writeLocalSpan_thrift_libthrift avgt 15 1.168 ± 0.016 us/op
CodecBenchmarks.writeLocalSpan_thrift_zipkin avgt 15 0.593 ± 0.010 us/op
CodecBenchmarks.writeRpcSpan_json_zipkin avgt 15 43.495 ± 1.086 us/op
CodecBenchmarks.writeRpcSpan_thrift_libthrift avgt 15 4.878 ± 0.046 us/op
CodecBenchmarks.writeRpcSpan_thrift_zipkin avgt 15 2.666 ± 0.018 us/op
CodecBenchmarks.writeRpcV6Span_json_zipkin avgt 15 49.759 ± 0.867 us/op
CodecBenchmarks.writeRpcV6Span_thrift_libthrift avgt 15 5.390 ± 0.073 us/op
CodecBenchmarks.writeRpcV6Span_thrift_zipkin avgt 15 3.147 ± 0.026 us/op
Zipkin 1.8 Codec.writeSpan() vs libthrift (pace car)
CodecBenchmarks.writeClientSpan_json_zipkin avgt 15 1.445 ± 0.036 us/op
CodecBenchmarks.writeClientSpan_thrift_libthrift avgt 15 1.951 ± 0.014 us/op
CodecBenchmarks.writeClientSpan_thrift_zipkin avgt 15 0.433 ± 0.011 us/op
CodecBenchmarks.writeLocalSpan_json_zipkin avgt 15 0.813 ± 0.010 us/op
CodecBenchmarks.writeLocalSpan_thrift_libthrift avgt 15 1.191 ± 0.016 us/op
CodecBenchmarks.writeLocalSpan_thrift_zipkin avgt 15 0.268 ± 0.004 us/op
CodecBenchmarks.writeRpcSpan_json_zipkin avgt 15 3.606 ± 0.068 us/op
CodecBenchmarks.writeRpcSpan_thrift_libthrift avgt 15 5.134 ± 0.081 us/op
CodecBenchmarks.writeRpcSpan_thrift_zipkin avgt 15 1.384 ± 0.078 us/op
CodecBenchmarks.writeRpcV6Span_json_zipkin avgt 15 3.912 ± 0.115 us/op
CodecBenchmarks.writeRpcV6Span_thrift_libthrift avgt 15 5.488 ± 0.098 us/op
CodecBenchmarks.writeRpcV6Span_thrift_zipkin avgt 15 1.323 ± 0.014 us/op
Why encoding speed matters
Applications that report to Zipkin typically record timing information and metadata on the calling thread. After the operation completes, this is encoded into a Span and scheduled to go out of process, usually via http or Kafka. When the encoding overhead is measurable, it can confuse timing information, particularly when operations are in single-digit or less milliseconds.
For example, if a local operation takes 400us, and your encoding overhead is 40us, there will be a 10% gap between the end of one span and the start of the next. This will notably skew the duration of the parent, particularly if there are a lot of spans like this. When encoding overhead in single-digit microseconds or less, this problem is less noticeable.
Zipkin 1.7
Zipkin 1.7 has a lot to offer, thanks to users for telling us what they'd like.
@dragontree101 wanted to be able to know which version of zipkin his server was running. @shakuzen landed the /info endpoint, which prints out something like this:
{
"zipkin": {
"version": "1.7.0"
}
}
@mikewrighton wants to run zipkin-ui from a different host than zipkin-server. @hyleung spiked a new variable you can use to control cross-origin policy. For example, you can export ZIPKIN_QUERY_ALLOWED_ORIGINS=http://foo.bar.com
, if you are the lucky owner of foo.bar.com!
@dan-tr uses Zipkin with Elasticsearch, but found our microsecond timestamps didn't work out-of-box with Kibana. He suggested we add a field timestamp_millis
, and we did! because it was a smart idea.
@ivansenic works on an APM called inspectIT. He rightly noted there's still a ton of Java 6 VMs out there that need to be traceable by Java agents. Now, zipkin.jar is an agent-friendly, 152k jar full of Java 6 bytecode (still with no dependencies!).
We're occasionally asked where javadocs are published. Thanks to @abesto's automation expertise, historical javadocs can now be found here http://zipkin.io/zipkin/
Finally, we're looking for incremental and compatible ways to improve zipkin's model, particularly for asynchronous activity (like tracing Kafka). If you are interested in steering us, please comment on..
Thanks for keeping with us,
OpenZipkin
Zipkin 1.6
Zipkin 1.6 server has been updated to use Spring Boot 1.4.
We've also corrected default values around the UI, which should lead to better search performance. Most notably, startTs defaults to 1 hour back instead of 7 days back. #1212
- Note: You can reset the lookback value to whatever you like. For example, you might set
JAVA_OPTS="-Dzipkin.ui.default-lookback=86400000"
for 1 day. Settings like this are documented in the README
Zipkin 1.5
Zipkin 1.5 is all about the dependency view in the UI.
Many of you may have seen the dependency tab, and never any data in it. This would be the case if you were running Cassandra or Elasticsearch.
What you should have seen is a diagram showing the relative amount of calls between services, something like this (except with your services present!):
Zipkin 1.5 includes support to populate the data under this screen for all storage options (mysql, cassandra and elasticsearch).
The job that produces this data is called zipkin-dependencies. Zipkin Dependencies aggregates links between services into a daily bucket. This means you should run it daily, like a batch job (eventhough underneath it is spark). In fact, our docker image includes cron setup to do that for you!
For example, here's a run against a small cassandra DB using spark standalone (default):
$ STORAGE_TYPE=cassandra CASSANDRA_CONTACT_POINTS=192.168.99.100 java -jar zipkin-dependencies.jar
Running Dependencies job for 2016-07-23: 1469232000000000 ≤ Span.timestamp 1469318399999999
11:05:09.653 [main] WARN o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11:05:09.706 [main] WARN org.apache.spark.util.Utils - Your hostname, acole resolves to a loopback address: 127.0.0.1; using 192.168.1.10 instead (on interface en0)
11:05:09.706 [main] WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address
11:05:11.078 [main] WARN com.datastax.driver.core.NettyUtil - Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead.
Saved with day=2016-07-23
Dependencies: [{"parent":"brave-resteasy-example","child":"brave-resteasy-example","callCount":1}, {"parent":"zipkin-server","child":"cassandra","callCount":14}]
Upgrading
If you are using cassandra or elasticsearch, you should upgrade to zipkin 1.5, but there's no schema-related change required.
If you are using mysql, you'll need to add a new table for this to work. Here's a copy/paste of the DDL for your convenience.
CREATE TABLE IF NOT EXISTS zipkin_dependencies (
`day` DATE NOT NULL,
`parent` VARCHAR(255) NOT NULL,
`child` VARCHAR(255) NOT NULL,
`call_count` BIGINT
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED;
ALTER TABLE zipkin_dependencies ADD UNIQUE KEY(`day`, `parent`, `child`);
Credits
The spark job was originally written by @yurishkuro, based on a hadoop job originally written by @eirslett years ago. IOTW, the job itself isn't new, rather the accessibility of it. Before, it only worked with cassandra and wasn't published to maven central or integrated with docker. Now, it should be easy for anyone to include this functionality into their deployment.
Zipkin 1.4
Zipkin 1.4 most notably includes the ability to store and show IPv6 addresses associated with services.
Endpoint.ipv6
Zipkin span data can now include an ipv6 address of an Endpoint, binary encoded in thrift or text-encoded in json. If using MySQL, you need to add a column to store this. No action is needed in Cassandra or Elasticsearch. See #1178
Operational Improvements
- Adds
SCRIBE_ENABLED
: set to false to disable scribe - Adds
SELF_TRACING_SAMPLE_RATE
: set to a low value like 0.001 to safely self-trace production
Zipkin 1.3
Zipkin 1.3 includes highlighting of spans in error state and improvements to the Cassandra storage component.
Error annotations
Inspired by recent work in OpenTracing, we've added a new annotation "error". When an annotation value, this indicates when a potentially transient error occurred. When a binary annotation key, the value is a human readable message associated with a error resulting in a failed span. See #1140 for details.
Thanks to @virtuald the UI acts according to these rules, highlighting degraded spans yellow, and failed ones red.
Instrumentation (like Brave, zipkin-tracer etc) need to change to support this. Please help if you have time!
Span.timestamp, duration 0 coerce to null
We've noticed some instrumentation log invalid timestamp and duration of 0, when they meant to log null. Timestamp or duration of 0 microseconds are invalid or don't explain latency. We now coerce these 0s to null. For cases where a sub-microsecond span duration occurred, you should round up to 1. See #1155 and #1176
Elasticsearch daily bucket fix
We found and fixed a concurrency bug that could put spans into the wrong daily buckets. See #1175
Cassandra
Schema bug fix
We found a bug where traces against the same service in the same millisecond weren't indexed. This affects indexes only (trace data itself wasn't lost). For example, you might find a trace that exists in cassandra, but you can't query it using the api.
Specifically, the following indexes now have trace_id
added to their PRIMARY_KEY definitions.
- service_span_name_index
- service_name_index
- annotations_index
There's no automatic data migration available. The most straight-forward way to address this in an existing cluster is to drop the following indexes and restart a zipkin server (which will recreate them as long as CASSANDRA_ENSURE_SCHEMA=true
). You can also update the indexes manually based on the schema
Tuning
We've done a lot of work tuning the amount of data written to indexes on a per-span basis. Those using Cassandra should see a significant drop in index size due to reasons documented in the tuning section of the README.
Query logging
Those supporting zipkin may need to debug query latency. We now use the QueryLogger which is enabled when the log category "com.datastax.driver.core.QueryLogger" is at debug or trace level. Trace level includes bound values. See #1156
Zipkin 1.2
Zipkin 1.2.1 includes Prometheus metrics and Elasticsearch bug fixes.
Prometheus metrics are enabled by default, under the /prometheus
endpoint.
Many thanks to Kristian from Iterate for developing this feature!