-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Support Timestamps and Labels for Individual Events #728
Conversation
Some initial thoughts:
I need to look into this more though. |
I'm not a pprof maintainer, just someone interested in profiler data formats, hoping my 2c are useful:
|
If the changes are to be backward incompatible then it's not really pprof anymore per se and should probably be done as part of the OpenTelemetry profiling format work format instead. |
@aalexand thanks for the feedback so far. Sorry for the slow response, I was traveling. @mhansen I'll reply to your feedback in my next comment.
Yeah. I could see this proposal become a lot simpler if we decide that the uncompressed size does't matter. In that case it might be enough to standardize on label names for timestamps. Is that a direction you'd be interested in exploring?
I'll expand the example to include two sample types.
It'd be nice to be able to specify the time resolution. In my testing I saw ~20% reduction going from ns to us or us to ms. That being said, I'm not very attached to using the term "ticks" for this. Let me know if you have another idea here.
That'd be awesome, but I don't see a good way of doing this without breaking backwards compatibility. Based on your reply to @mhansen it seems like you wouldn't consider that to be an option for pprof? |
@mhansen thanks for your feedback.
+1
I think I'm with @aalexand. If we decide to completely break backwards compatibility, it should probably be done as a new OTel format.
That's definitely a direction worth exploring.
As far as I'm concerned, the in-memory representation of a profile can be different from the transfer/storage format. It just makes encoding/decoding a little more complex. That being said, having a good representation that works well on the wire as well as in-memory would be nice. |
I would prefer not to special case specific tag names, but rather have a tag unit type that denotes a timestamp and have support for that. Similar to how memory and duration units are handled. The magnitude could be a part of the unit to support the different resolution of the value. On migrating to this proto to get call stack that can be reused between different samples to minimize duplication - yeah, breaking backward compatibility is expensive and would force a lot of changes upon us that are difficult to justify. One idea I considered in the past that is roughly backward compatible is having a new
But this is not forward compatible (old pprof won't be able to read new profiles and even recognize them as new and refuse) and it's unclear (aka needs to be measured) how much actual profile size savings this would produce for compressed profiles. |
This is a fair point, I hadn't considered streaming protobuf decoders. And you could stream in and map to an alternative normalized/tree in-memory structure to save memory. Streaming protobuf decoders exist, but I think they're fairly niche? e.g. the default protocol buffer libraries don't have streaming support AFAIK. I expect most tool users would use these default libraries and thus be stuck inflating the entire proto into memory, and possibly OOM'ing. Probably a lot of existing pprof tools (including I'm not saying this should prevent us from moving forward (perhaps particularly for small profiles); just noting the challenges I see with big profiles without compression. |
Sounds good. I'll try do some experiments with that when I get a chance.
Nice idea 👍. If we're going to explore this space of forwards incompatible solutions, we should definitely take a look at this approach, along with the tree stuff proposed by @mhansen, and perhaps also stack trace hashing as pioneered by prodfiler/elastic. |
I have a lot of thoughts on protobufs 😅. I agree that the standard tooling doesn't make it easy to implement efficient/streaming decoders. That being said, I think it's totally feasible to build them. In fact, we recently added a pproflite package to our Go profiler that provides streaming pprof decoding with zero allocations (only the creation of a new decoder allocates, but they are reusable). We use it to implement efficient profile delta computations. Of course an ideal uncompressed wire format would be efficient, simple and suitable for in-memory representation. But ... I think it's feasible to make some tradeoffs here. Especially for compatibility purposes.
Yeah, the increase in profile size is a challenge when adding timestamps. But given enough optimization work, I think it will be feasible. |
One note is that approaches like using a tree are incompatible in both directions while the "incremental stack encoding" approach I mentioned at least allows all current profiles to be transparently opened by the new code since num_keep_locations is unset and so is effectively zero which means reuse the whole stack. |
I think that just depends on how the decoder works? E.g. if we added a new |
That means that on the read path there is an if statement to take the old path or the new path. With the |
For now I'm not planning to continue pushing this proposal forward in its current form, so I'm closing this. More discussion are still happening as part of the OTel Profiling Meetings. |
Motivation
My team at Datadog is using pprof as our data collection format for profiling data (cpu, wall clock, lock contention, allocations, etc.) coming from non-JVM languages (which use JFR). This works very well for us, and we'd like to continue building on pprof. However, there are two use cases we are interested in that are currently difficult to implement efficiently using pprof.
This proposal outlines these use cases as well as potential changes to the
profile.proto
format that would support them. We believe that these use cases would also be useful to the OSS community. Additionally the proposal should meet the signal correlation requirements of the Profiling Vision OTEP, which might allow pprof to be adopted by OpenTelemtry.Last, but not least, we hope that this proposal would ease the adoption of these features by the Go runtime, e.g. by adding timestamps to CPU profiles. I already implemented a prototype for this.
Also see: #457
UC 1: Thread Timelines (FlameChart)
We would like to break down the self-time of distributed tracing spans using per-thread (or goroutine) flame chart timelines. The data for this would be collected using wall clock sampling, tracing techniques, or a combination of both. This is similar to what fgtrace does, but would only show the threads relevant to the span:
In order to support this use case, pprof would need the ability to efficiently store timestamps, thread ids and span ids for the individual profiling events that are being recorded. Additionally the pprof UI might be extended to show this data, or the command line could be extended to target perfetto or other flame chart visualization tools.
UC 2: CPU Heatmaps (FlameScope)
Another use case we are interested in is the display of of CPU Heatmaps pioneered by FlameScope which provide a more powerful way to understand application behavior.
This use case is simpler than the first use case and only requires timestamp information for individual CPU samples. Additionally the UI could add this visualization or the command line could output the data in a format understood by flamescope.
Example
Let's say we have a wallclock profile that contains 3 events:
Note: For the purpose of this example the stack trace is the same for all 3 events.
How could this data be encoded into pprof?
Using Labels (Today)
With the existing
profile.proto
, one may use labels to store the data like this (shown in pseudo pprof format):Unfortunately this requires to encode the location_id list three times b/c the
Sample
message has no concept of referencing a stack trace by id and each unique combination of labels requires us to have its ownSample
. In practice this leads to very large pprof files.Using Breakdown (Proposed Change)
Using the proposed
profile.proto
change in this PR, the same data could be encoded like shown below. In practice, this should lead to much smaller pprof sizes.Efficiency
This proposal can lead to significant size reduction of uncompressed pprofs compared to using the existing label mechanism when adding timestamps and trace ids. Workloads with less unique stack traces and longer profiling periods gain the most benefits from this proposal. E.g. in the examples below the gains are from 1.4x to 3.2x, but it's possible to construct examples that see bigger or smaller improvements.
That being said, the benefits mostly disappear after gzip compression. E.g. the same examples see only 1.01x to 1.22x compression gains. This is perhaps not surprising given that the duplicated stack traces are easy targets for compression.
For more details see felixge/pprof-breakdown and raw results (spreadsheet).
Compatibility
The proposed change is fully backwards compatible, but not fully forwards compatible when it comes to labels. The new format puts event specific labels into the new
Breakdown
message. This means that profiles taking advantage of the newprofile.proto
Breakdown labels won't be completely accessible to old versions of the pprof tool or other tools consuming pprof information. Those tools could still render basic flame graphs, but the label information would be invisible to them.Next Steps
We're looking for feedback from the pprof maintainers to understand if "Support Timestamps and Labels for Individual Events" is a compelling feature or if the pprof project would rather stay focused on storing pre-aggregated data. Our suggested changes to
profile.proto
should mostly be seen as a demonstration of the technical feasibility of implementing such a proposal, but we're not too attached to the implementation details and happy to adjust them as needed.Perhaps the small gains on compressed pprofs are not sufficient justification for the added complexity, and a simpler proposal to standardize the usage of the existing label mechanism for the use cases outlined above would offer better tradeoffs. Alternatively more optimizations to this proposal such as delta encoding and other tweaks could be explored.
Additionally we'd like to get feedback from the OSS community and potentially OTel group to make sure the outlined use cases are clear and useful to others.