proposal: Support Timestamps and Labels for Individual Events #728

felixge · 2022-10-21T15:24:23Z

Motivation

My team at Datadog is using pprof as our data collection format for profiling data (cpu, wall clock, lock contention, allocations, etc.) coming from non-JVM languages (which use JFR). This works very well for us, and we'd like to continue building on pprof. However, there are two use cases we are interested in that are currently difficult to implement efficiently using pprof.

This proposal outlines these use cases as well as potential changes to the profile.proto format that would support them. We believe that these use cases would also be useful to the OSS community. Additionally the proposal should meet the signal correlation requirements of the Profiling Vision OTEP, which might allow pprof to be adopted by OpenTelemtry.

Last, but not least, we hope that this proposal would ease the adoption of these features by the Go runtime, e.g. by adding timestamps to CPU profiles. I already implemented a prototype for this.

Also see: #457

UC 1: Thread Timelines (FlameChart)

We would like to break down the self-time of distributed tracing spans using per-thread (or goroutine) flame chart timelines. The data for this would be collected using wall clock sampling, tracing techniques, or a combination of both. This is similar to what fgtrace does, but would only show the threads relevant to the span:

In order to support this use case, pprof would need the ability to efficiently store timestamps, thread ids and span ids for the individual profiling events that are being recorded. Additionally the pprof UI might be extended to show this data, or the command line could be extended to target perfetto or other flame chart visualization tools.

UC 2: CPU Heatmaps (FlameScope)

Another use case we are interested in is the display of of CPU Heatmaps pioneered by FlameScope which provide a more powerful way to understand application behavior.

This use case is simpler than the first use case and only requires timestamp information for individual CPU samples. Additionally the UI could add this visualization or the command line could output the data in a format understood by flamescope.

Example

Let's say we have a wallclock profile that contains 3 events:

time since start	duration	stack trace	trace id	span id	thread id
10ms	15ms	a;b;c;d;e;f;g;h	1	1	1
25ms	5ms	a;b;c;d;e;f;g;h	1	1	1
40ms	20ms	a;b;c;d;e;f;g;h	2	2	2

Note: For the purpose of this example the stack trace is the same for all 3 events.

How could this data be encoded into pprof?

Using Labels (Today)

With the existing profile.proto, one may use labels to store the data like this (shown in pseudo pprof format):

sample_type: [walltime/nanoseconds]

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 1500000000
  label: {time_ms=10}
  label: {trace_id=1}
  label: {span_id=1}
  label: {thread_id=1}
}

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 500000000
  label: {time_ms=25}
  label: {trace_id=1}
  label: {span_id=1}
  label: {thread_id=1}
}

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 2000000000
  label: {time_ms=40}
  label: {trace_id=2}
  label: {span_id=2}
  label: {thread_id=2}
}

Unfortunately this requires to encode the location_id list three times b/c the Sample message has no concept of referencing a stack trace by id and each unique combination of labels requires us to have its own Sample. In practice this leads to very large pprof files.

Using Breakdown (Proposed Change)

Using the proposed profile.proto change in this PR, the same data could be encoded like shown below. In practice, this should lead to much smaller pprof sizes.

sample_type: [walltime/nanoseconds]
tick_unit: ms

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 3000000000
  breakdown: {
    tick: 10
    tick: 25
    tick: 40
    value: 15
    value: 5
    value: 20,
    label_set_id: 1
    label_set_id: 1
    label_set_id: 2
  }
}

labelSet {
  id: 1
  label: {trace_id=1}
  label: {span_id=1}
  label: {thread_id=1}
}

labelSet {
  id: 2
  label: {trace_id=2}
  label: {span_id=2}
  label: {thread_id=2}
}

Efficiency

This proposal can lead to significant size reduction of uncompressed pprofs compared to using the existing label mechanism when adding timestamps and trace ids. Workloads with less unique stack traces and longer profiling periods gain the most benefits from this proposal. E.g. in the examples below the gains are from 1.4x to 3.2x, but it's possible to construct examples that see bigger or smaller improvements.

That being said, the benefits mostly disappear after gzip compression. E.g. the same examples see only 1.01x to 1.22x compression gains. This is perhaps not surprising given that the duplicated stack traces are easy targets for compression.

For more details see felixge/pprof-breakdown and raw results (spreadsheet).

Compatibility

The proposed change is fully backwards compatible, but not fully forwards compatible when it comes to labels. The new format puts event specific labels into the new Breakdown message. This means that profiles taking advantage of the new profile.proto Breakdown labels won't be completely accessible to old versions of the pprof tool or other tools consuming pprof information. Those tools could still render basic flame graphs, but the label information would be invisible to them.

Next Steps

We're looking for feedback from the pprof maintainers to understand if "Support Timestamps and Labels for Individual Events" is a compelling feature or if the pprof project would rather stay focused on storing pre-aggregated data. Our suggested changes to profile.proto should mostly be seen as a demonstration of the technical feasibility of implementing such a proposal, but we're not too attached to the implementation details and happy to adjust them as needed.

Perhaps the small gains on compressed pprofs are not sufficient justification for the added complexity, and a simpler proposal to standardize the usage of the existing label mechanism for the use cases outlined above would offer better tradeoffs. Alternatively more optimizations to this proposal such as delta encoding and other tweaks could be explored.

Additionally we'd like to get feedback from the OSS community and potentially OTel group to make sure the outlined use cases are clear and useful to others.

aalexand · 2022-11-03T15:49:48Z

Some initial thoughts:

The change is fairly large. Changes to profile.proto always immediately make me think about what it would take to make sure the existing tooling (pprof et cetera) supports it and this change is touching some of the fundamentals of the format.
The example uses a single sample type. It would help if the example would use 2 sample types to illustrate how the breakdown is represented in that case.
The change introduces the notion of "ticks" which is an entirely new concept. I wonder if we could avoid that somehow. I am not opposed to discussing how to make timestamps better represented in profile.proto but going as far as adding a new concept into the schema fields gives me a bit of a pause.
Instead of reusing label sets I wonder if normalizing out the stack would make things simpler, similar to this proto.

I need to look into this more though.

mhansen · 2022-11-11T11:01:40Z

I'm not a pprof maintainer, just someone interested in profiler data formats, hoping my 2c are useful:

I think there's a big opportunity for pprof to expand into the 'timestamped profiles' space. pprof
Pprof is probably the best-positioned player to do so. There aren't many popular others doing timestamped profiles: perfetto, chrome profiler, firefox profiler, linux perf.data, speedscope, a few other vendor-specific ones. pprof has the biggest cross-lang ecosystem.
Given this existing ecosystem, backwards compatibility vs evolving the protocol is the key tradeoff, and my natural reaction is to try really hard to avoid backwards incompatibility. Or at the very least, offer backwards adapters.
Timestamped profiles are a new feature, the timestamps not visible in old viewers. Perhaps this is a natural time to evolve the profile representation in a backwards-incompatible way -- a "v2"? A "big carrot" feature like timestamps would probably encourage uptake of the new version across the ecosystem.
If we decide to evolve the profile in a backwards-incompatible way, I think it'd be a big win for in-memory size to represent the stacks as a tree, with stacks pointing to their parent stack, rather than repeated frame-lists (I think this is the normalization Alexey is referring to above). Firefox Profiler also represents stacks as trees, enabling them to fit longer profiles within a browser window's RAM. The "compression" benefits are better with longer stacks, which I see a lot in Java code!
If we're think gzip is good enough that we can rely on gzip, then we can be more backwards-compatible, a bit more redundant. However: I think gzip for transferring and storage is fine, but the profile is always eventually decompressed for reading, and holding profiles in RAM is limiting.

aalexand · 2022-11-11T16:12:31Z

If the changes are to be backward incompatible then it's not really pprof anymore per se and should probably be done as part of the OpenTelemetry profiling format work format instead.

felixge · 2022-11-17T09:11:14Z

@aalexand thanks for the feedback so far. Sorry for the slow response, I was traveling. @mhansen I'll reply to your feedback in my next comment.

The change is fairly large. Changes to profile.proto always immediately make me think about what it would take to make sure the existing tooling (pprof et cetera) supports it and this change is touching some of the fundamentals of the format.

Yeah. I could see this proposal become a lot simpler if we decide that the uncompressed size does't matter. In that case it might be enough to standardize on label names for timestamps. Is that a direction you'd be interested in exploring?

The example uses a single sample type. It would help if the example would use 2 sample types to illustrate how the breakdown is represented in that case.

I'll expand the example to include two sample types.

The change introduces the notion of "ticks" which is an entirely new concept. I wonder if we could avoid that somehow. I am not opposed to discussing how to make timestamps better represented in profile.proto but going as far as adding a new concept into the schema fields gives me a bit of a pause.

It'd be nice to be able to specify the time resolution. In my testing I saw ~20% reduction going from ns to us or us to ms. That being said, I'm not very attached to using the term "ticks" for this. Let me know if you have another idea here.

Instead of reusing label sets I wonder if normalizing out the stack would make things simpler, similar to this proto.

That'd be awesome, but I don't see a good way of doing this without breaking backwards compatibility. Based on your reply to @mhansen it seems like you wouldn't consider that to be an option for pprof?

felixge · 2022-11-17T11:33:04Z

@mhansen thanks for your feedback.

Given this existing ecosystem, backwards compatibility vs evolving the protocol is the key tradeoff, and my natural reaction is to try really hard to avoid backwards incompatibility. Or at the very least, offer backwards adapters.

+1

Timestamped profiles are a new feature, the timestamps not visible in old viewers. Perhaps this is a natural time to evolve the profile representation in a backwards-incompatible way -- a "v2"? A "big carrot" feature like timestamps would probably encourage uptake of the new version across the ecosystem.

I think I'm with @aalexand. If we decide to completely break backwards compatibility, it should probably be done as a new OTel format.

If we decide to evolve the profile in a backwards-incompatible way, I think it'd be a big win for in-memory size to represent the stacks as a tree, with stacks pointing to their parent stack, rather than repeated frame-lists (I think this is the normalization Alexey is referring to above). Firefox Profiler also represents stacks as trees, enabling them to fit longer profiles within a browser window's RAM. The "compression" benefits are better with longer stacks, which I see a lot in Java code!

That's definitely a direction worth exploring.

If we're think gzip is good enough that we can rely on gzip, then we can be more backwards-compatible, a bit more redundant. However: I think gzip for transferring and storage is fine, but the profile is always eventually decompressed for reading, and holding profiles in RAM is limiting.

As far as I'm concerned, the in-memory representation of a profile can be different from the transfer/storage format. It just makes encoding/decoding a little more complex. That being said, having a good representation that works well on the wire as well as in-memory would be nice.

aalexand · 2022-11-18T02:22:57Z

In that case it might be enough to standardize on label names for timestamps. Is that a direction you'd be interested in exploring?

I would prefer not to special case specific tag names, but rather have a tag unit type that denotes a timestamp and have support for that. Similar to how memory and duration units are handled. The magnitude could be a part of the unit to support the different resolution of the value.

On migrating to this proto to get call stack that can be reused between different samples to minimize duplication - yeah, breaking backward compatibility is expensive and would force a lot of changes upon us that are difficult to justify.

One idea I considered in the past that is roughly backward compatible is having a new num_keep_locations field in the Sample message like

message Sample {
  // The ids recorded here correspond to a Profile.location.id.
  // The leaf is at location_id[0].
  repeated uint64 location_id = 1;

  // Number of location IDs to keep from the previous sample. This is an
  // optimization that allows incremental encoding of the sample stack.
  // For example, if the stack is exactly the same as the previous one because
  // only the labels are different, then all of the location IDs can be reused
  // and so num_keep_locations could be set to the length of the location_id
  // field in the previous sample and the location_id field in the current sample
  // would be empty.
  uint64 num_keep_locations = 4;

  // The type and unit of each value is defined by the corresponding
  // entry in Profile.sample_type. All samples must have the same
  // number of values, the same as the length of Profile.sample_type.
  // When aggregating multiple samples into a single sample, the
  // result has a list of values that is the element-wise sum of the
  // lists of the originals.
  repeated int64 value = 2;

  // label includes additional context for this sample. It can include
  // things like a thread id, allocation size, etc
  repeated Label label = 3;
}

But this is not forward compatible (old pprof won't be able to read new profiles and even recognize them as new and refuse) and it's unclear (aka needs to be measured) how much actual profile size savings this would produce for compressed profiles.

mhansen · 2022-11-19T05:47:13Z

As far as I'm concerned, the in-memory representation of a profile can be different from the transfer/storage format. It just makes encoding/decoding a little more complex. It just makes encoding/decoding a little more complex.

This is a fair point, I hadn't considered streaming protobuf decoders. And you could stream in and map to an alternative normalized/tree in-memory structure to save memory.

Streaming protobuf decoders exist, but I think they're fairly niche? e.g. the default protocol buffer libraries don't have streaming support AFAIK. I expect most tool users would use these default libraries and thus be stuck inflating the entire proto into memory, and possibly OOM'ing.

Probably a lot of existing pprof tools (including go tool pprof use these 'non-streaming' parsers; If we make a format that is so large that it strains memory to read it all in, I think that might limit takeup of the new features, or limit it to short timelines. Protocol buffers dev guide says that "Protocol Buffers are not designed to handle large messages". I wonder how large in-practice time-series pprof profiles would be, per second? 1GB would be OK for laptops (but probably OOM many shared webservers), and 10GB would strain laptops. Perhaps if you expect big profiles from this, we might have to update/normalize in-memory representation of go tool pprof at least, which even for that one tool is a nontrivial job (given the pprof go struct is public).

I'm not saying this should prevent us from moving forward (perhaps particularly for small profiles); just noting the challenges I see with big profiles without compression.

felixge · 2022-11-19T11:17:10Z

In that case it might be enough to standardize on label names for timestamps. Is that a direction you'd be interested in exploring?

I would prefer not to special case specific tag names, but rather have a tag unit type that denotes a timestamp and have support for that. Similar to how memory and duration units are handled. The magnitude could be a part of the unit to support the different resolution of the value.

Sounds good. I'll try do some experiments with that when I get a chance.

On migrating to this proto to get call stack that can be reused between different samples to minimize duplication - yeah, breaking backward compatibility is expensive and would force a lot of changes upon us that are difficult to justify.

One idea I considered in the past that is roughly backward compatible is having a new num_keep_locations field in the Sample message like
message Sample {
  // The ids recorded here correspond to a Profile.location.id.
  // The leaf is at location_id[0].
  repeated uint64 location_id = 1;

  // Number of location IDs to keep from the previous sample. This is an
  // optimization that allows incremental encoding of the sample stack.
  // For example, if the stack is exactly the same as the previous one because
  // only the labels are different, then all of the location IDs can be reused
  // and so num_keep_locations could be set to the length of the location_id
  // field in the previous sample and the location_id field in the current sample
  // would be empty.
  uint64 num_keep_locations = 4;

  // The type and unit of each value is defined by the corresponding
  // entry in Profile.sample_type. All samples must have the same
  // number of values, the same as the length of Profile.sample_type.
  // When aggregating multiple samples into a single sample, the
  // result has a list of values that is the element-wise sum of the
  // lists of the originals.
  repeated int64 value = 2;

  // label includes additional context for this sample. It can include
  // things like a thread id, allocation size, etc
  repeated Label label = 3;
}
But this is not forward compatible (old pprof won't be able to read new profiles and even recognize them as new and refuse) and it's unclear (aka needs to be measured) how much actual profile size savings this would produce for compressed profiles.

Nice idea 👍. If we're going to explore this space of forwards incompatible solutions, we should definitely take a look at this approach, along with the tree stuff proposed by @mhansen, and perhaps also stack trace hashing as pioneered by prodfiler/elastic.

felixge · 2022-11-19T12:28:52Z

As far as I'm concerned, the in-memory representation of a profile can be different from the transfer/storage format. It just makes encoding/decoding a little more complex. It just makes encoding/decoding a little more complex.

This is a fair point, I hadn't considered streaming protobuf decoders. And you could stream in and map to an alternative normalized/tree in-memory structure to save memory.

Streaming protobuf decoders exist, but I think they're fairly niche? e.g. the default protocol buffer libraries don't have streaming support AFAIK. I expect most tool users would use these default libraries and thus be stuck inflating the entire proto into memory, and possibly OOM'ing.

Probably a lot of existing pprof tools (including go tool pprof use these 'non-streaming' parsers; If we make a format that is so large that it strains memory to read it all in, I think that might limit takeup of the new features, or limit it to short timelines. Protocol buffers dev guide says that "Protocol Buffers are not designed to handle large messages".

I have a lot of thoughts on protobufs 😅. I agree that the standard tooling doesn't make it easy to implement efficient/streaming decoders. That being said, I think it's totally feasible to build them. In fact, we recently added a pproflite package to our Go profiler that provides streaming pprof decoding with zero allocations (only the creation of a new decoder allocates, but they are reusable). We use it to implement efficient profile delta computations.

Of course an ideal uncompressed wire format would be efficient, simple and suitable for in-memory representation. But ... I think it's feasible to make some tradeoffs here. Especially for compatibility purposes.

I wonder how large in-practice time-series pprof profiles would be, per second? 1GB would be OK for laptops (but probably OOM many shared webservers), and 10GB would strain laptops. Perhaps if you expect big profiles from this, we might have to update/normalize in-memory representation of go tool pprof at least, which even for that one tool is a nontrivial job (given the pprof go struct is public).

I'm not saying this should prevent us from moving forward (perhaps particularly for small profiles); just noting the challenges I see with big profiles without compression.

Yeah, the increase in profile size is a challenge when adding timestamps. But given enough optimization work, I think it will be feasible.

aalexand · 2022-11-19T21:02:14Z

Nice idea 👍. If we're going to explore this space of forwards incompatible solutions, we should definitely take a look at this approach, along with the tree stuff proposed by @mhansen, and perhaps also stack trace hashing as pioneered by prodfiler/elastic.

One note is that approaches like using a tree are incompatible in both directions while the "incremental stack encoding" approach I mentioned at least allows all current profiles to be transparently opened by the new code since num_keep_locations is unset and so is effectively zero which means reuse the whole stack.

felixge · 2022-11-19T22:54:08Z

One note is that approaches like using a tree are incompatible in both directions while the "incremental stack encoding" approach I mentioned at least allows all current profiles to be transparently opened by the new code since num_keep_locations is unset and so is effectively zero which means reuse the whole stack.

I think that just depends on how the decoder works? E.g. if we added a new Sample.stack_id field that points into a stack tree, new versions of pprof would recognize that this field is present and ignore the location list. That would be backwards compatible IMO. Maybe I'm missing something?

aalexand · 2022-11-20T00:12:09Z

That means that on the read path there is an if statement to take the old path or the new path. With the num_keep_locations approach there is no if statement.

felixge · 2023-03-05T16:56:21Z

For now I'm not planning to continue pushing this proposal forward in its current form, so I'm closing this.

More discussion are still happening as part of the OTel Profiling Meetings.

felixge added 12 commits September 23, 2022 17:40

proposal: add timestamp breakdow field for samples

5b275ad

fix typo

f443525

small docs fix (thanks erwan)

602b805

Add label set, ticks, tick_unit, etc

86dd3bf

wording

85b814b

Clarify

875e844

add todo

67ec5d6

fix label_set_id field index

5eb8b44

Add LabelSet ID

dd8e60a

clarify Sample.label and Breakdown.label_set_id interplay

ca5f222

formatting

33523a8

small edits

00fe71a

felixge marked this pull request as ready for review October 21, 2022 15:24

This was referenced Oct 21, 2022

Proposal Draft: Support Timestamps and Labels for Individual Events felixge/pprof#3

Closed

pprof: implement breakdown for cpu profiles felixge/go#3

Draft

Add support for flamescope #457

Open

Merge branch 'main' into felixge/timestamps

6d9af1e

felixge closed this Mar 5, 2023

felixge mentioned this pull request Jul 6, 2023

REQUEST: New membership for felixge open-telemetry/community#1582

Closed

6 tasks

petethepig mentioned this pull request Sep 12, 2023

Introduces Profiling Data Model open-telemetry/oteps#237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: Support Timestamps and Labels for Individual Events #728

proposal: Support Timestamps and Labels for Individual Events #728

felixge commented Oct 21, 2022 •

edited

aalexand commented Nov 3, 2022

mhansen commented Nov 11, 2022 •

edited

aalexand commented Nov 11, 2022

felixge commented Nov 17, 2022

felixge commented Nov 17, 2022

aalexand commented Nov 18, 2022

mhansen commented Nov 19, 2022

felixge commented Nov 19, 2022

felixge commented Nov 19, 2022

aalexand commented Nov 19, 2022

felixge commented Nov 19, 2022

aalexand commented Nov 20, 2022

felixge commented Mar 5, 2023

proposal: Support Timestamps and Labels for Individual Events #728

proposal: Support Timestamps and Labels for Individual Events #728

Conversation

felixge commented Oct 21, 2022 • edited

Motivation

UC 1: Thread Timelines (FlameChart)

UC 2: CPU Heatmaps (FlameScope)

Example

Using Labels (Today)

Using Breakdown (Proposed Change)

Efficiency

Compatibility

Next Steps

aalexand commented Nov 3, 2022

mhansen commented Nov 11, 2022 • edited

aalexand commented Nov 11, 2022

felixge commented Nov 17, 2022

felixge commented Nov 17, 2022

aalexand commented Nov 18, 2022

mhansen commented Nov 19, 2022

felixge commented Nov 19, 2022

felixge commented Nov 19, 2022

aalexand commented Nov 19, 2022

felixge commented Nov 19, 2022

aalexand commented Nov 20, 2022

felixge commented Mar 5, 2023

felixge commented Oct 21, 2022 •

edited

mhansen commented Nov 11, 2022 •

edited