-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduces Profiling Data Model #237
Conversation
81c35b9
to
8df82be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great first step in profiling -> Stacktraces + correlation of them to other telemetry types.
Added a few non-blocking comments.
|
||
### Semantic Conventions | ||
|
||
TODO: describe things like profile types and units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an OTEP - the only section you need here is describing that you plan to leverage Semantic Conventions to provide "open enums", i.e. a set of known types + enums that can expand over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Updated the OTEP with some words about how we're planning to leverage Semantic Conventions, and provided a few examples (for profile type and units).
|
||
## Open questions | ||
|
||
Client implementations are out of scope for this OTEP. At the time of writing this we do have a reference implementation in Go, as well as a working backend and collector, but they are not yet ready for production use. We are also working on a reference implementation in Java. We are looking for contributors to help us with other languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like you have enough prototypes for us to evaluate for actual Specification inclusion.
I'd recommend as soon as we start see approvals landing on this OTEP to open the DataModel specification markdown with just the datamodel components.
|
||
There are two main ways relationships between messages are represented: | ||
* by embedding a message into another message (standard protobuf way) | ||
* by referencing a message by index (similar to how it's done in pprof) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually suspect that table-lookup is a technique we MAY want to apply to other signals for significant reduction in size. However, given the state of OTLP-Arrow, it could be a moot point.
IF we see successful implementations of profiling "processors" in the otel collector, I'd suggest we think about adding this capability to other Signals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be careful with this. In my experiments I have seen in some cases for compressed payloads string lookup tables result in a regression (bigger compressed payloads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great otep! Very well reasoned.
Promising work so far. I am curious what input the go |
- function_index: 1 | ||
- line: | ||
- function_index: 2 | ||
profile_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is a simplified example focused on showing how data is linked, however, adding at least type_index
and unit_index
(which I think aren't optional) would make it clearer to me. Without those, it's not clear what is measured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Updated the OTEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together @petethepig
I did a quick pass, but would like to take a more thorough look one more time, so let's keep this open for a few days (also to make sure we have more eyes on this).
repeated Link links = 11; | ||
|
||
// A lookup table of AttributeSets. Other messages refer to AttributeSets in this table by index. The first message must be an empty AttributeSet — this represents a null AttributeSet. | ||
repeated AttributeSet attribute_sets = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit confusing that we have both attributes
and attribute_set
. Maybe use a different name or explain explicitly the difference?
// List of indices referring to AttributeSets in the Profile's attribute set table. Each attribute set corresponds to a Stacktrace in stacktrace_indices list. Length must match stacktrace_indices length. [Optional] | ||
repeated uint32 attribute_set_indices = 12; | ||
|
||
// List of values. Each value corresponds to a Stacktrace in stacktrace_indices list. Length must match stacktrace_indices length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to clarify what a value is.
} | ||
|
||
// Represents a single profile type. It implicitly creates a connection between Stacktraces, Links, AttributeSets, values and timestamps. The connection is based on the order of the elements in the corresponding lists. This implicit connection creates an ephemeral structure called Sample. The length of reference lists must be the same. It is acceptable however for timestamps, links and attribute set lists to be empty. It is not acceptable for stacktrace or values lists to be empty. | ||
message ProfileType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This essentially describes a list of samples, right? Would SampleList
be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, ProfileType is somewhat confusing. I like SampleList
. I'll update the proto and OTEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Another possible name is just Samples
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up going with SampleList
. I like Samples
more, but Samples
is plural, less specific, and more ambiguous than SampleList
. So I'm worried calling it Samples
is more confusing.
// The object this entry is loaded from. This can be a filename on | ||
// disk for the main binary and shared libraries, or virtual | ||
// abstractions like "[vdso]". Index into string table | ||
uint32 filename_index = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to show a benchmark that shows how much these and other string lookup tables save. If the savings are small it may be useful to just use the string values directly. We typically compress the OTLP payloads anyway which will very likely result in negligible savings. In my experiments in some cases I have seen GZIP/ZSTD compressed payloads increase because of using lookup tables since numeric varints compress worse than duplicate strings.
SYMBOL_FIDELITY_UNSPECIFIED = 0; | ||
SYMBOL_FIDELITY_FULL = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments to explain these.
uint32 mapping_index = 1; | ||
// The instruction address for this location, if available. It | ||
// should be within [Mapping.memory_start...Mapping.memory_limit] | ||
// for the corresponding mapping. A non-leaf address may be in the | ||
// middle of a call instruction. It is up to display tools to find | ||
// the beginning of the instruction if necessary. | ||
uint64 address = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in Go this ordering typically results in unused bytes in memory due to struct field alignment (on a 64bit platform we will have 4 unused bytes here). It helps to sort fields in the decreasing field size order.
|
||
### Benchmarking | ||
|
||
[Benchmarking results](https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit#gid=0) showed that `arrays` representation is the most efficient in terms of CPU utilization, memory consumption and size of the resulting protobuf payload. Some notable benchmark results are showcased below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on what the benchmark actually did and what CPU utilization and memory consumption are referring to? Did you measure serialization (pprof to protobuf), or where there also some deserialization benchmarks done?
Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com>
Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com>
// Source file containing the function. Index into string table | ||
uint32 filename_index = 3; | ||
// Line number in source file. | ||
uint32 start_line = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to clarify whether this points to function definition, call site, or is something that profilers are free to choose from?
// The id of the corresponding profile.Function for this line. | ||
uint32 function_index = 1; | ||
// Line number in source code. | ||
uint32 line = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand shrinking many numbers from 64 bit to 32 bit, but I wanted to double-check this case. I've definitely seen auto-generated code get quite large, for example, but even 32-bit is large for "lines in a file" so it may be okay. Was there any discussion or analysis around this specifically? Maybe some compilers publish limits we can look at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you concerned about 32 bits not being enough or being too large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is about potentially not being big enough for all languages. I've heard of programs in the millions of lines of code (always autogenerated code), and so I was curious if there was any discussion or analysis about this. Still, "millions" is still not 4 billion, so to be clear I'm not saying 32-bit line is a problem, just that we should be cautious about that. It's a nice win if it's defendable, and I think likely is.
uint32 name_index = 1; | ||
// Name of the function, as identified by the system. | ||
// For instance, it can be a C++ mangled name. Index into string table | ||
uint32 system_name_index = 2; | ||
// Source file containing the function. Index into string table | ||
uint32 filename_index = 3; | ||
// Line number in source file. | ||
uint32 start_line = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these fields mandatory? Can I omit for example system_name_index
? Are 0 values valid indices? Should 0 values be invalid so that the fields can be optional? This is important if using a ProtoBuf implementation that does not support the notion of field "presence" and it is impossible to distinguish between the 0-value of the field and the absence of the field.
Or do you use the fact that the string at 0 index of Profile.string_table
is an empty string as a way to deal with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero-th element in string table is always empty string ("") to be able to encode nil strings.
repeated int64 values = 13; | ||
|
||
// List of timestamps. Each timestamp corresponds to a Stacktrace in stacktrace_indices list. Length must match stacktrace_indices length. | ||
repeated uint64 timestamps = 14; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these nanoseconds since UNIX epoch? If so a fixed64
may be a better choice since I think for the typical current values the varint encoding is larger than fixed 8 bytes and is more expensive to encode/decode.
fixed64 start_time_unix_nano = 2; | ||
|
||
// end_time_unix_nano is the end time of the profile. | ||
// Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970. | ||
// | ||
// This field is semantically required and it is expected that end_time >= start_time. | ||
fixed64 end_time_unix_nano = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of these times? Are sample timestamps expected to fall in this range?
Quick warning: this post is written in semi-brainfried state. May need to correct myself later. I'll also follow up with a separate post that outlines some concerns I have about the general design we're at now.
Oh, shiny, that's a super valuable PR and discussion. Thanks!! Anyhow; I'll read the thread you posted in more detail, and mull this over a bit. More in a future reply.
True, and I think that this will be key.
I don't know -
Fair, but this approach honestly gives me the creeps. I'll discuss this in the follow-on post.
We should definitely document that assumption. My reasoning for strongly preferring a "stream many messages continuously" over "accumulate large memory buffers to compress" is partially driven by trying to push work from the production machines on which the profiler runs to the backend, and by avoiding having to keep large buffers of events...
So this is a FaaS vendor that allows their customers to spin up specialized FaaS (essentially tiny ELFs, I presume) on their servers on the Edge. So if I understand correctly, when the FaaS is invoked, they create the mapping, link it to other areas in the address space, run the FaaS, and (I don't know based on what criteria) unmap again. The net result was thousands of mappings, and lots and lots of executable sections. So this isn't a common setup, but I don't think we should design a protocol that unnecessarily restricts the design space of people writing their code.
Well, in the current pprof design, you need the mapping information if you want to do any form of ex-post symbolization, too, as the pprof design does not place an executable identifier into the |
Hey all, ok, so the following post will be a bit longer, and I hope I don't upset people, but I think I have a number of concerns with the design that we've so far converged onto. The fact that we have converged on something that is sufficiently similar to So I'll outline my concerns here; forgive me if I am re-hashing things that were discussed previously, and also if this isn't the ideal place to document the concerns. This will be both lengthy and highly subjective; apologies for that. Philosophy sectionDesign philosophy for profiling toolsProfiling tools (if timestamps are included) allow the synthetic generation of fine-grained spans (see @felixge's great description in UC1 here. The upshot of this is that the ability to perform pretty high-frequency sampling on demand has high value, 10000Hz is better than 100Hz in such a scenario. While you won't always need this, you don't want to carelessly limit yourself unnecessarily. So this means that profiling tools should strive to do their jobs with the minimum number of instructions, data movement, etc. - the more lightweight the data collection is, the more profiling can be afforded given a certain budget, and the increased value does not taper off quickly. This is very different from e.g. metrics; you do not gain much benefit from collecting those at a higher frequency. If you make the collection of a stack trace 3x as expensive, you limit your max frequency to 1/3rd of what it could be, depriving users of value. Design philosophy for a profiling wire formatIn my view, a profiling wire format should be designed as to enable the profiling code on the machines-to-be-profiled to be lightweight in the sense of "working in a slim CPU and memory budget". Whenever sensible, work should be pushed to the backend. The protocol should also be designed in a way that it can be easily implemented by both userspace runtime-specific profilers, and by eBPF-based kernelspace ones, without forcing one of the two into compromising their advantages. Concrete sectionWhat's good about using eBPF vs. the old perf way of copying large chunks of the stack?The big advantage of eBPF is that it allows cheap in-kernel aggregation of measurements (including profiling data). The old perf way of doing stack collections when no frame pointers were present was just dumping a large chunk of stack to disk, and then trying to recover the stack later on (this is a bad idea for obvious reasons). If your profiling can be done by the OS's timer tick just handing control to some code in kernel space that unwinds the stack in kernel space and aggregates data there without having to round-trip to userspace or copy a lot of data into userspace, you have a pretty lean operation. You walk the stack, you hash the relevant bits, you increment a kernel-level counter or alternatively concatenate a timestamp with the hash into a data structure that userspace can later read (ring buffer, eBPF map, whatnot). Such a construct is always preferable to having to keep the entire stacktrace around each time it happens, and having to copy it into userspace in it's entirety to be processed there. Let's say we have 40 layers of stack; the difference between copying a 16-byte identifier or 40 stack frames isn't trivial. Implications of stateless vs. stateful protocolEarly on in the discussion, we proposed a stateful protocol (e.g. a protocol where the individual machine is allowed to send just the ID of a stack trace vs. the entire stack trace, provided it has sent the entire stack trace previously). The reasons for this design in our infrastructure were:
This means that a kernel-based / eBPF-based profiler can simply increment a counter, and be done with it; or send out an identifier and a timestamp. Both are very cheap, but also imply that you don't have to copy lots of data from kernel space to user space (where you also have to manage it, and then potentially gzip it etc.). If you move to a stateless protocol where you always have to send the entire stack trace, the profiler itself has to move the entire stack trace around, and also pay for the memory of the stack trace until it has been sent. This is fine if you're sampling at 20Hz - let's say the average trace has 40 frames, and representing a frame takes 32 bytes, a trace is ~1k; that is still 20-40 times more than you'd handle if you were just handling an ID. This will have immediate repercussions for the frequency at which you'll be able to sample. You will not only need to copy more data around, you'll also need to run some LZ over the data thereafter; you'll pay when you send the data out etc. Implications of "file format" vs "streaming protocol"
File formats assume that the recipient is a disk, something that cannot or will not do any processing, and just stores the data for further processing in the future. Designing a file format means you expect the creator of the file to keep all the data around to write it in a coherent and self-contained manner. A streaming protocol can assume some amount of processing and memory capability on the receiving side, and can hence move some work from the writer of the data to the recipient of the data. If your recipient is smarter, you can be more forgetful; if your recipient is smart and aggregates data from many writers, each individual writer can be less reliable. Another aspect is that file formats tend to be not designed for a scenario where a machine tries to continuously send out the profiling events without accumulating many of them in memory. If we look at the existing OTLP Spec, my cursory reading is that the protobufs do not penalize you much for sending data in a too-fine-grained manner (it's not free, obviously, but it seems to be less problematic than using pprof for a handful of samples). A spec that is designed for "larger packages of data" forces the profiler to accumulate more events in memory, send them out more rarely, and perform de-duplication by LZ'ing over the data once more. This also has repercessions for the frequency at which you'll be able to sample. Implications for kernel-based eBPF profilersThe current trajectory of the protocol design is highly disadvantageous to kernel-based eBPF profilers. Instead of being able to accumulate data in the kernel where it is cheap, it forces data to be
This may not matter if you're already in userspace, already in a separate process, and you're already copying a lot of data back and forth between different address spaces; but if you have successfully avoided wasteful copies like that, it's not great. It forces implementers into objectively worse implementations. At almost every decision, the protocol opts for "let's push any work that the backend may normally be expected to do into the profiler onto the production machine, because we are fearful of assuming the backend is more sophisticated than a dumb disk drive", leading to more computing cycles being spent on the production machines than necessary. In some sense, we're optimizing the entire protocol for backend simplicity, and we're doing it by pushing the cost of this onto the future user of the protocol. This feels backward to me, in my opinion, we should optimize for minimum cost for the future user of the protocol (as there will be many more users of profiling than backend implementors). SummaryI would clearly prefer a protocol design that doesn't force profilers to copy a lot of data around unnecessarily, and that prioritizes "can be implemented efficiently on the machine-to-be-profiled" and "uses cycles sparingly as to maximize the theoretical sampling frequency that can be supported by the protocol". I think by always optimizing for "dumb backend", we converged on pprof-with-some-modifications, as we were subjecting us to the same design constraint ("recipient is a disk, not a server"). I am convinced it's the wrong design constraint. |
Thanks for your writeup @thomasdullien, that was a very interesting read. One question I had while reading was whether or not it's possible for eBPF profilers to aggregate ~60s worth of profiling data (stack traces and their frames) in kernel memory? If yes, I don't understand why you'd have to copy this data to user space all the time? You could just deduplicate stack traces and frames in kernel and only send the unique data into user space every once in a while (~60s)? Another thing I'm curious about is your perception on the cost of unwinding ~40 frames relative to copying ~1KiB between user space and kernel (assuming it's needed)? Unless you're unwinding with frame pointers, I'd expect the unwinding to be perhaps up to 90% of the combined costs. I'm asking because while I love the idea of building a very low overhead profiling protocol, I think we also need to keep the bigger overhead picture in mind during these discussions. Related: How much memory do eBPF profilers use for keeping unwinding tables in memory? Edit: Unrelated to overhead, but also very important: Using a stateful protocol would make it very hard for the collector to support exporting profiling data in other formats which is at odds with the design goals of OpenTelemetry, see below. |
@thomasdullien you may want to add one more factor to your analysis: the Collector. It is often an intermediary between the producer of the telemetry and the backend. By design the Collector operates on self-contained messages. If a stateful protocol is used then the Collector receiver has to restore that state and turn the received data into self-contained pdata entries so that internally the processors can operate on them. When sending out in the exporter the Collector has to perform the opposite and encode the self-contained messages into the stateful protocol again to send to the next leg (typically backend, but can be another Collector). This is possible but adds to the processing cost. We went through this with OTLP and Otel Arrow. OTLP operates with self-contained messages. Otel Arrow was proposed later and uses gRPC streaming and is stateful. I think a similar stateful, streaming protocol is possible to add in the future for profiles. If you want to go this route I would advise you to make a proposal for a streaming, stateful protocol for profiling data in the form of an OTEP with benchmarks showing what its benefits are and what the penalty (if any) for extra processing in the Collector is. |
Good point, I'll read myself into it and then revert :) |
|
||
#### Message `Stacktrace` | ||
|
||
A stacktrace is a sequence of locations. Order of locations goes from callers to callees. Many stacktraces will point to the same locations. The link between stacktraces, attribute sets, links, values and timestamps is implicit and is based on the order of the elements in the corresponding tables in ProfileType message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it's possible to link unique timestamps and attributes to a shared stacktrace? Some of the projects I work on at Microsoft have huge amounts of stacktraces, and we've found it very useful to represent a set of common stacktraces and reference them. Is this possible in the current design?
We do both stack aggregation without time (IE: Flamegraphs) and we also show timelines (IE: Gantts) with threads and other processes on the system using this approach and the reason I'm asking if both time and attributes could point to a common stacktrace (IE: Thread 123 in Process 23 had Stacktrace X at Time T).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you can do this with stacktrace_indices, but the text here seems to imply the ordering is implicit between the stacktraces, attributes, links, values and timestamps. Is this a typo and the stacktrace_indices, attribute_set_indices, etc. be called out here instead?
#### Message `Mapping` | ||
|
||
Describes the mapping from a binary to its original source code. These are stored in a lookup table in a Profile. These are referenced by index from other messages. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operating systems, like Windows, allow mappings to come and go. Typically, we collect mappings with a "load/unload" event with a timestamp. It seems like in this format while writing the data the linking of the load/unload times vs the sample times to determine the right mapping for a given process must be done ahead of time.
|
||
#### `Sample` structure | ||
|
||
Sample is an ephemeral structure. It is not explicitly represented as a protobuf message, instead it is represented by stacktraces, links, attribute sets, values and timestamps tables in `ProfileType` message. The connection is based on the order of the elements in the corresponding tables. For example, AttributeSet with index 1 corresponds to a Stacktrace located at index 1 in stacktraces table, and a Value located at index 1 in values table. Together they form a Sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have cases where we profile events that do not have stacks, instead they have raw data to help support what is going on, such as a context switch, exception, critical error, etc. Is there a place in this format to put a sample with 1) raw binary data and 2) a way to link to the format of this raw binary data (IE: thread_id is at offset 8 in the data)?
Hey all, a longer discussion between @jhalliday, @brancz, @athre0z, me, and some others had ensued on Slack, and the consensus was that it'd be useful to keep this conversation/discussion archived:
This is the current state of the discussion. The really important thing I agree on is: We need to really define what computation the collector needs to perform on the data as it comes in. I also think we should replace the word "stateful" in our discussion with "self-contained messages", because it is more precise. As next steps, for the meeting on Thursday, I will provide some data on the implications of both the gRPC message size limit and the desire to be self-contained, because I see a lot of problems arising from precisely this interaction. |
Joint work with @christos68k and @athre0z. as promised, a bit more data/revisiting of the discussion around self-contained messages vs. non-self-contained messages. The goal of this post is to document the exact trade-offs involved in a self-contained vs. a non-self-contained protocol design. Bottom line up front (BLUF) - details below
Before we get into the meat of things, I noticed a small design issue with the current proposal: How to represent two versions of the same Java (or other HLL code) in the current proposal?The current proposal does not have an obvious way of distinguishing between two versions of the same HLL code. For native code, the mappings contain a build ID, and according to our discussions, this build ID should be a hash of some parts of the The test machine and workloadI am worried about us designing a protocol that can't deal with a real-world server running a whole-system profiler, so for this test-setup, we are running tests on an Altera ARM server with 80 cores. This generates a fair bit of data, but there are 224 core machines, so this isn't even near the top-of-the-line, and we can expect a top-end server to generate 2.5x+ more data. Workload-wise, we tested running an ElasticSearch benchmark (esrally with What we did for testingWe did three things:
Exceeding the default gRPC message size limit (possibly by 32x)It's almost certain that the current design will exceed the gRPC message size limit. When considering uncompressed self-contained messages, our preliminary testing showed ~3 megs of uncompressed data for a 5-second interval on our 80-core machine running at approximately 60% load. This implies that a 1-minute interval may reach 60 megs, and if you factor in load and higher core counts, 128 megs may definitely be within reach. Caveat: This is our internal self-contained-protocol, not the current proposal. I am still in the process of converting our own profiling data into the proposed format to get a better estimate for the average size of a stack trace for our workload in the proposed format, and expect to have that later today or tomorrow, but before even adding the traces, the size-per-trace is ~100 bytes, so there's about 100 bytes of overhead per trace without even the trace data. Network traffic, compressed and uncompressed, empirical resultsIn our testing, the difference in network traffic after compression is a factor of approximately 2.3-2.4x -- for 100 megs of profiling traffic in the non-self-contained protocol, the self-contained protocol generates 230-240 megs after compression does it's magic. How much magic is the compressor doing here?
The compressor does a 25x reduction here, we're performing compression on 1.9 gigs of data to end up with 77 megs of traffic. FWIW, this will also be reflected on the backend: You'll need to process vastly more data; the ratio on both the backend and the individual hosts is about 7.3x. Further reductions from separating out leaf framesThe above non-self-contained design hashes the entire stack trace. Most of the variation of the stack traces is in the leaf frame, so the above can be further boosted by separating the leaf frame out of the trace, and then hashing "everything but the leaf frame". This reduces the number of traces to be sent by a factor of ~0.67 in this workload. This means that for ~67 megs of profiling traffic, the self-contained protocol will send about 230-240 megs, meaning the likely network traffic overhead is about 3.5x post compression. The ratio of uncompressed data to process grows to about 10x (it's 7.3x before, divided by the 0.68, yielding about 10.7x). Summary of the data:
Next step
|
Ok, I think I converted data from our example workload to pprof. I took a few shortcuts (e.g. skipped inlined frames, which accounts for about 10% of all frames), and I end up with an average cost-per-trace/cost-per-sample of 180-200 bytes. When sampling at 20Hz per core, we see somewhere between a 1:2 to a 1:4 reduction in traces vs. events, e.g. for 100 sampling events, you get between 25 and 50 different stack traces (often differing only near the leaves). Note: This data does not attach any labels, metadata, etc. to any samples so it may be underestimating the real size by a bit. I'd wager that 200 bytes will be near the norm for Java workloads, with things like heterogeneity of the workload and some parts of the code adding variance. I suspect it'll be rare to end up with more than 400 bytes per trace, unless someone begins to attach stuff like transaction IDs to samples. This means our message size will be approximately as follows:
This means - as discussed in the meeting - it will be possible (though very unlikely) for a single message to reach 100+ megs on a large machine; it will be likely that we exceed 8-10 megabytes routinely on reasonably common machine configurations. I think a really crucial thing to understand as next step is: What is the memory overhead from converting a gRPC proto into an in-memory representation? My big worry is that big messages of this type can easily lead to a lot of memory pressure for collectors, and my expectation of a good protocol design is that a single collector can easily service hundreds of machines-to-be-profiled (the protocol shown by Christos above easily dealt with hundreds to thousand+ nodes and tens of thousands of cores on IIRC 2-3 collector processes -- I'd have to dig up the details though). |
@thomasdullien these numbers show the performance difference between stateless and stateful formats described in this doc, right? |
Correct (we did some small changes to the benchmarked protocols, namely removing some fields we no longer use such as SourceIDs, I updated the document to reflect these changes). |
@thomasdullien thanks for the analysis in your two comments above. I spend some time reviewing this today, and here are my summarized conclusions (I hope to discuss this in the SIG meeting starting now).
To be clear: I would take a 3.5x increase in efficiency on the wire very seriously when it comes to the stateful vs stateless discussions. So I wouldn't mind being convinced of it :). |
The data used for the 2nd comment (the pprof stuff) are 1-minute intervals of profiling data in JSON format that I converted to the pprof format by hacking around in the existing benchmarking code. I have attached an example JSON file here (about 400k zstd-compressed for 1 minute of data). I have 1 hour worth of these files for the testing. I'll have to dig to find the go code I hacked to convert it, give me a day or two - it's been a few weeks. |
Thanks @thomasdullien, this is great. About the code for the pprof stuff: Don't worry too much. You clarified in the SIG meeting that the pprof experiment was mostly for discussing the gRPC message size issue, rather than comparing stateful against stateless in general, and I'm fully convinced of that argument – we need to investigate this problem further. What I'm really looking forward to is seeing the results of terminating the stateful protocol every 60s (flushing the client caches) and comparing the data size this produces against the normal operations of the stateful protocol. |
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved. --- For (a lot) more details, see: * [OTel Profiling SIG Meeting Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit) --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de> Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: Felix Geisendörfer <felix@felixge.de> Co-authored-by: Reiley Yang <reyang@microsoft.com>
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](open-telemetry/oteps#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved.
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](open-telemetry#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved. --- For (a lot) more details, see: * [OTel Profiling SIG Meeting Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit) --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de> Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: Felix Geisendörfer <felix@felixge.de> Co-authored-by: Reiley Yang <reyang@microsoft.com>
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](open-telemetry#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved. --- For (a lot) more details, see: * [OTel Profiling SIG Meeting Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit) --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de> Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: Felix Geisendörfer <felix@felixge.de> Co-authored-by: Reiley Yang <reyang@microsoft.com>
Closing since #239 got merged :) |
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](open-telemetry#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved. --- For (a lot) more details, see: * [OTel Profiling SIG Meeting Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit) --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de> Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: Felix Geisendörfer <felix@felixge.de> Co-authored-by: Reiley Yang <reyang@microsoft.com>
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](open-telemetry/oteps#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved. --- For (a lot) more details, see: * [OTel Profiling SIG Meeting Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit) --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de> Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: Felix Geisendörfer <felix@felixge.de> Co-authored-by: Reiley Yang <reyang@microsoft.com>
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](open-telemetry#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved. --- For (a lot) more details, see: * [OTel Profiling SIG Meeting Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit) --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de> Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: Felix Geisendörfer <felix@felixge.de> Co-authored-by: Reiley Yang <reyang@microsoft.com>
This is second version of the Profiling Data Model OTEP. After [we've gotten feedback from the greater OTel community](open-telemetry#237) we went back to the drawing board and came up with a new version of the data model. The main difference between the two versions is that the new version is more similar to the original pprof format, which makes it easier to understand and implement. It also has better performance characteristics. We've also incorporated a lot of the feedback we've gotten on the first PR into this OTEP. Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments. So far we've done a number of things to validate it: * we've written a new profiles proto described in this OTEP * we've documented decisions made along the way in a [decision log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md) * we've done benchmarking to refine the data representation (see Benchmarking section in a [collector PR](petethepig/opentelemetry-collector#1)) * diff between original pprof and the new proto: [link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) We're seeking feedback and hoping to get this approved. --- For (a lot) more details, see: * [OTel Profiling SIG Meeting Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit) --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de> Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: Felix Geisendörfer <felix@felixge.de> Co-authored-by: Reiley Yang <reyang@microsoft.com>
OTel Profiling SIG has been working on this for a while and we're ready to present our Data Model. So far we've done a number of things to validate it:
We're seeking feedback and hoping to get this approved.
For (a lot) more details, see: