Collector surge and error handling

A number of github issues have been raised, usually about Elasticsearch being over capacity and what to do about it. This topic will explore various options and what they can or cannot solve.

Call Semaphore

As of October 2017, the Elasticsearch implementation of storage includes a Semaphore to limit requests. When Zipkin is configured to collect span data from a polling source such as Kafka or RabbitMQ this Semaphore should be relatively transparent and not affect collection rates in any capacity.

However, when using an RPC collector (such as HTTP), Zipkin can’t control the ingestion rate natively. As a result, requests to store span information will be eagerly dropped in order to avoid excessively queueing results and eventually causing an OutOfMemoryError. An alternative problem that can come up with this approach would be: if 200 servers send span information at the same time and the Semaphore is set to 64 requests, even if the number of spans is small, 136 of them will drop. That much data wouldn’t cause Zipkin itself to die. Probably worse, the StorageComponent being written too may have also been able to handle that load without issue. Increasing the Semaphore count seems like an obvious solution but if Zipkin is running in a cluster then it is hard to know how to tune the Semaphore appropriately so requests aren’t excessively dropped and storage isn’t overburdened.

SpanConsumer queue

In application instrumentation, if each application thread would directly report span information then you would have a surge in traffic to Zipkin. To avoid this, we have an AsyncReporter with a buffer of fixed length and only a single clearing thread to send data to Zipkin. Usually, the backlog can be processed by a blocking flush loop. even if only one thread is doing that.

The AsyncReporter helps ensure load to Zipkin that is created from several clients is evenly applied so as to minimize DDOS-like issues. However, two situations can arise that would cause issue:

The AsyncReporters some how report in lock-step; meaning all of them get on the same cycle of reporting at the same instant. This is a highly unlikely scenario but can intermittently happen as buffers are flushed due to exceeding their size bound instead of a time bound.
In situations where there are many short-lived JVM processes (instead of long-lived web services), the AsyncReporter doesn’t have time to really manage the sending of span data. This entirely nullifies its usefulness.

When reporting to Zipkin via a buffered protocol such as a message queue (e.g. Kafka), the above scenarios shouldn’t be problems as Zipkin can happily poll at its own rate. The issue, again, arises when reporting over RPC (e.g. HTTP) where there is no buffer in the middle. In these situations, wrapping the SpanConsumer with a small, bounded queue (possibly the same one AsyncReporter uses) can help avoid the tip of spike requests being dropped due to aggressive measures to keep Zipkin alive. This queue wouldn’t alleviate all issues as it could still fill up. But, assuming the Storage is operating in optimal conditions, there is a distinct possibility it would be able to ingest all the spans without issue. To a certain extent, a queue like this would change the problem of HTTP into one similar to how we address Kafka, except that the queue is not durable or checkpointed etc.

Overload Errors

Certain storage implementations such as Cassandra and Elasticsearch can return exceptions related to an overload status. If we normalized this type of exception, it is possible that buffered collectors can decide to push back failed spans onto their queue. Care needs to be taken with Elasticsearch specifically, as it can partially succeed with bulk requests!

Design Concerns

Do not block push clients

The most common push transport is HTTP. It may seem handy to block a client until capacity is available for storage, but this can recreate problems we avoided with a lot of work in the past. This prior art included more efficient decoders, async storage options, and returning 202 accordingly. Here are some concerns to keep in mind and why blocking clients can be problematic:

Not all span reporters are written well. In some cases they block production threads. Blocking clients all the way until storage would cause more damage here.
Even async reporters are not routinely blocked for a long time. A large client-side backlog can in itself cause some side effects, eventhough these should be minimal.
Blocking can cause concerns such as thundering herd once capacity is restored, depending on how things are implemented.

It is possible to design things in such a way that these and other concerns not mentioned are minimized, but anything that suggests blocking clients should be considered very carefully. One goal of observability is to do no harm. That's why spans are allowed to drop. It is likely a better optimization to buffer differently as opposed to blocking.

On Disk Queue

At some point the collector may not able to keep up. One example noted was RocksDB (rarely LevelDB) used as an on-disk buffer in cases when memory buffers are getting overflowed. It may give the collector a bit of breathing room before starting to drop the spans.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly