Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Streaming SDK proposal #2
I work on the OpenTracing C++ specification and the instrumentations for NGINX and Envoy. Recently, I've been working with LightStep to improve the performance of their C++ tracer. Our efforts to reduce instrumentation cost and improve collection throughput led us to a more efficient and streamlined design that I propose we adopt as one of the SDKs for OpenTelemetry. The key components of the design are 1) we remove intermediate storage on span objects and instead serialize eagerly as methods are called and 2) we use a domain-specific load balancing algorithm built upon non-blocking sockets, vectored-io, and io multiplexing.
Note: This only discusses the design as it releates to tracing. I plan to updated it to include metrics as well as the proposal progresses.
At a high level, the design looks like this
Here's a diagram of the architecture.
LightStep's C++ tracer provides an example implementation of this design. These are the main components
Serializing eagerly in (1) instead of storing data in intermediary structures that get serialized later when the span is finished, eliminates unnecessary copying and allows us to avoid small heap allocations. This leads to much lower instrumentation cost. As part of my work on the LightStep tracer, I developed microbenchmarks that compare the eager serialization approach to a more traditional approach that stores data in protobuf-generated classes and serializes later when traces are uploaded. For a span with 10 small key-values attached, I got these measurements for the cost of starting then finishing a span that show much better performance for the eager serialization approach:
Using a lock-free circular buffer in (2) allows the tracer to remain performant in high concurrency scenarios. We've found that mutex-protected span buffering causes signficant contention when multiple threads finish spans concurrently. I benchmarked the creation of 4000 spans partioned evenly across a varying number of threads and got these results for a mutex-protected buffer:
The mutex-protected buffer shows slower performance when we use more than a single thread. By comparison, lock-free buffering doesn't have any such degradation
By using multiple load-balanced endpoint connections, the transport in (3) allows for spans to be uploaded at a high rate without dropping data. The domain-specific load balancer takes advantage of the property that spans can be routed to any collection endpoint and naturally adapts to back-pressure from the endpoints to route data to where its most capable of being received. Consider what happens as a collection endpoint starts to reach its capacity to process spans:
A goal of the SDK is to have as minimal a set of dependencies as possible. Because the eager serialization approach uses manual serialization code instead of the protobuf-generated classes, we can avoid requiring protobuf as a dependency. Lightstep's current implementation uses libevent and c-ares for portable asynchronous networking and dns resolution -- but those parts of the code could be hid behind an interface in a way that would allow alternative libraries to be used or platform-specific implementations.
Because this approach never generates a SpanData-like structure with accessors, the customization points are different than the traditional exporter approach. The main point of customization point would be the serialization functions where a vendor could provide alternative implementations to write to a different wire format.
The LightStep implementation could be adopted into a default OpenTelemetry tracer and SDK that uses the opentelemetry-proto format but provides a customization point for the serialization. This SDK would prioritize efficiency and high-throughput. While it might not be the right choice for all use cases, I would expect the OpenTelemetry C++ API to be flexible enough to support a variety of different impelmentations, so other use cases could be serviced by either an alternative impelmentation of the OpenTelemetry API or a different SDK (if we decide to support more than one).
This is an interesting proposal. A few questions:
The protobuf encoding allows a lot of flexibility in the ordering of the fields, so for the most part you can write the constituent parts in any order and have them parsed out to correctly form the full message.
For an overwritten field, it serializes the field twice. Protobuf will take the last field encoded:
I'm not sure I understand why this distinction is important. If the collection point is overloaded, it should stop reading data from client connections. From a clients perspective, why does it need to know that the throughput for a socket is limited via bad network or endpoint overloading? Either way, it needs to redirect output to other sockets or start dropping spans if the total capacity is exceeded.
That would be interesting, but I didn't explore it because it would have required us to expose a new protocol from the backend at LightStep.
I had to lookup Protobuf docs for this. There is a nuance for embedded fields:
Since Span attributes are scalar values this should be fine. I don't know if this merging logic may cause issues with other fields of the Span, hopefully you have taken care of this.
Depends on what the client is. For example OpenTelemetry Service itself is a client. It can send to an upstream Service or to a backend. In production we usually monitor OpenTelemetry Service for drops and need to be able to tell why the drops happen. "Server is overloaded" requires a different action from "Network latency is high" or "Network loses packets".
You are right that the client itself does not need to react differently to these conditions, but people who monitor the client sometimes do need to react differently.
Any thoughts regarding reliability of delivery / acks / etc?
In any case I like the proposal and the initiative. I haven't looked at the implementation but conceptually your description looks good to me.
Once the http/1.1 streaming session is finished, the server has the opportunity to send a response -- that could be used to indicate overload.
Being over tcp, you'll get a write error on the socket if the packets aren't acked by the destination. Would that be sufficient or are you looking for something more like application level acking of each individual span?
How about language-native structs? I realize this means the LightStep exporter will need to convert that data into protobuf fragments, but not all exporters produce protobufs.
Also, doesn't this add a dependency on protobuf from OpenTelemetry core? As opposed to e.g. just exporters that use protobuf.
Sounds good. I'd return HTTP 429 or 503 with optional Retry-After header.
How would the client interpret a socket error? One thing the client can reliably infer from this is that "possibly some data was not delivered". That is not strong enough for my needs. As a person who operates a monitoring system I usually need stronger guarantees. I typically want a guarantee that no data is lost given that there is a reasonable way to prevent the loses or I want to see counters/metrics about how much data was lost if it could not be prevented and why it was lost. This can be better achieved by tracking what is not delivered and re-sending. For this to be feasible the client needs more granular knowledge about what was and what was not delivered. If all we have is one error per stream we will have to re-send the entire stream from the very beginning to guarantee that nothing is lost. This obviously is not practical, we cannot keep a duplicate of the entire stream around just in case the client needs to re-send it.
I am working on a general purpose OpenTelemetry protocol and my proposal uses acks and limited pool of unacknowledged data that is re-sent in certain cases, providing stronger delivery guarantees (at the cost of possible duplicates).
The protocol your propose is more specialized so this may be a non-issue for you (although it would be a no-go for any production environment I am involved with). I'd want to see this specifically called out (particularly, whether reliability is a goal or no). You may also be interested in reading the goals and requirements that I suggested earlier for the general-purpose OpenTelemetry protocol: open-telemetry/opentelemetry-specification#193
It does manual protobuf serialization so it can be done without requiring protobuf as a dependency.
And the plan is to have the serialization functions as a customization point so the SDK could be use with any format that supports the eager serialization approach.
@tigrannajaryan I think it would be possible to have a mode that supports stronger acking while still keeping many of the benefits by using bidirectional streaming where the collection point can send regular acks back and then the client only removes spans from the buffer once they've been acked.
Chunked http doesn't look like it supports bidrectional streaming (https://stackoverflow.com/a/28309215/4447365), but perhaps a WebSocket would work.