Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Support Timestamps and Labels for Individual Events #728

Closed
wants to merge 13 commits into from

Conversation

felixge
Copy link

@felixge felixge commented Oct 21, 2022

Motivation

My team at Datadog is using pprof as our data collection format for profiling data (cpu, wall clock, lock contention, allocations, etc.) coming from non-JVM languages (which use JFR). This works very well for us, and we'd like to continue building on pprof. However, there are two use cases we are interested in that are currently difficult to implement efficiently using pprof.

This proposal outlines these use cases as well as potential changes to the profile.proto format that would support them. We believe that these use cases would also be useful to the OSS community. Additionally the proposal should meet the signal correlation requirements of the Profiling Vision OTEP, which might allow pprof to be adopted by OpenTelemtry.

Last, but not least, we hope that this proposal would ease the adoption of these features by the Go runtime, e.g. by adding timestamps to CPU profiles. I already implemented a prototype for this.

Also see: #457

UC 1: Thread Timelines (FlameChart)

We would like to break down the self-time of distributed tracing spans using per-thread (or goroutine) flame chart timelines. The data for this would be collected using wall clock sampling, tracing techniques, or a combination of both. This is similar to what fgtrace does, but would only show the threads relevant to the span:

image

In order to support this use case, pprof would need the ability to efficiently store timestamps, thread ids and span ids for the individual profiling events that are being recorded. Additionally the pprof UI might be extended to show this data, or the command line could be extended to target perfetto or other flame chart visualization tools.

UC 2: CPU Heatmaps (FlameScope)

Another use case we are interested in is the display of of CPU Heatmaps pioneered by FlameScope which provide a more powerful way to understand application behavior.

image

This use case is simpler than the first use case and only requires timestamp information for individual CPU samples. Additionally the UI could add this visualization or the command line could output the data in a format understood by flamescope.

Example

Let's say we have a wallclock profile that contains 3 events:

time since start duration stack trace trace id span id thread id
10ms 15ms a;b;c;d;e;f;g;h 1 1 1
25ms 5ms a;b;c;d;e;f;g;h 1 1 1
40ms 20ms a;b;c;d;e;f;g;h 2 2 2

Note: For the purpose of this example the stack trace is the same for all 3 events.

How could this data be encoded into pprof?

Using Labels (Today)

With the existing profile.proto, one may use labels to store the data like this (shown in pseudo pprof format):

sample_type: [walltime/nanoseconds]

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 1500000000
  label: {time_ms=10}
  label: {trace_id=1}
  label: {span_id=1}
  label: {thread_id=1}
}

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 500000000
  label: {time_ms=25}
  label: {trace_id=1}
  label: {span_id=1}
  label: {thread_id=1}
}

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 2000000000
  label: {time_ms=40}
  label: {trace_id=2}
  label: {span_id=2}
  label: {thread_id=2}
}

Unfortunately this requires to encode the location_id list three times b/c the Sample message has no concept of referencing a stack trace by id and each unique combination of labels requires us to have its own Sample. In practice this leads to very large pprof files.

Using Breakdown (Proposed Change)

Using the proposed profile.proto change in this PR, the same data could be encoded like shown below. In practice, this should lead to much smaller pprof sizes.

sample_type: [walltime/nanoseconds]
tick_unit: ms

sample {
  location_id: a
  location_id: b
  location_id: c
  location_id: d
  location_id: e
  location_id: f
  location_id: g
  location_id: h
  value: 3000000000
  breakdown: {
    tick: 10
    tick: 25
    tick: 40
    value: 15
    value: 5
    value: 20,
    label_set_id: 1
    label_set_id: 1
    label_set_id: 2
  }
}

labelSet {
  id: 1
  label: {trace_id=1}
  label: {span_id=1}
  label: {thread_id=1}
}

labelSet {
  id: 2
  label: {trace_id=2}
  label: {span_id=2}
  label: {thread_id=2}
}

Efficiency

This proposal can lead to significant size reduction of uncompressed pprofs compared to using the existing label mechanism when adding timestamps and trace ids. Workloads with less unique stack traces and longer profiling periods gain the most benefits from this proposal. E.g. in the examples below the gains are from 1.4x to 3.2x, but it's possible to construct examples that see bigger or smaller improvements.

That being said, the benefits mostly disappear after gzip compression. E.g. the same examples see only 1.01x to 1.22x compression gains. This is perhaps not surprising given that the duplicated stack traces are easy targets for compression.

For more details see felixge/pprof-breakdown and raw results (spreadsheet).

image

Compatibility

The proposed change is fully backwards compatible, but not fully forwards compatible when it comes to labels. The new format puts event specific labels into the new Breakdown message. This means that profiles taking advantage of the new profile.proto Breakdown labels won't be completely accessible to old versions of the pprof tool or other tools consuming pprof information. Those tools could still render basic flame graphs, but the label information would be invisible to them.

Next Steps

We're looking for feedback from the pprof maintainers to understand if "Support Timestamps and Labels for Individual Events" is a compelling feature or if the pprof project would rather stay focused on storing pre-aggregated data. Our suggested changes to profile.proto should mostly be seen as a demonstration of the technical feasibility of implementing such a proposal, but we're not too attached to the implementation details and happy to adjust them as needed.

Perhaps the small gains on compressed pprofs are not sufficient justification for the added complexity, and a simpler proposal to standardize the usage of the existing label mechanism for the use cases outlined above would offer better tradeoffs. Alternatively more optimizations to this proposal such as delta encoding and other tweaks could be explored.

Additionally we'd like to get feedback from the OSS community and potentially OTel group to make sure the outlined use cases are clear and useful to others.

@aalexand
Copy link
Collaborator

aalexand commented Nov 3, 2022

Some initial thoughts:

  • The change is fairly large. Changes to profile.proto always immediately make me think about what it would take to make sure the existing tooling (pprof et cetera) supports it and this change is touching some of the fundamentals of the format.
  • The example uses a single sample type. It would help if the example would use 2 sample types to illustrate how the breakdown is represented in that case.
  • The change introduces the notion of "ticks" which is an entirely new concept. I wonder if we could avoid that somehow. I am not opposed to discussing how to make timestamps better represented in profile.proto but going as far as adding a new concept into the schema fields gives me a bit of a pause.
  • Instead of reusing label sets I wonder if normalizing out the stack would make things simpler, similar to this proto.

I need to look into this more though.

@mhansen
Copy link
Contributor

mhansen commented Nov 11, 2022

I'm not a pprof maintainer, just someone interested in profiler data formats, hoping my 2c are useful:

  • I think there's a big opportunity for pprof to expand into the 'timestamped profiles' space. pprof
  • Pprof is probably the best-positioned player to do so. There aren't many popular others doing timestamped profiles: perfetto, chrome profiler, firefox profiler, linux perf.data, speedscope, a few other vendor-specific ones. pprof has the biggest cross-lang ecosystem.
  • Given this existing ecosystem, backwards compatibility vs evolving the protocol is the key tradeoff, and my natural reaction is to try really hard to avoid backwards incompatibility. Or at the very least, offer backwards adapters.
  • Timestamped profiles are a new feature, the timestamps not visible in old viewers. Perhaps this is a natural time to evolve the profile representation in a backwards-incompatible way -- a "v2"? A "big carrot" feature like timestamps would probably encourage uptake of the new version across the ecosystem.
  • If we decide to evolve the profile in a backwards-incompatible way, I think it'd be a big win for in-memory size to represent the stacks as a tree, with stacks pointing to their parent stack, rather than repeated frame-lists (I think this is the normalization Alexey is referring to above). Firefox Profiler also represents stacks as trees, enabling them to fit longer profiles within a browser window's RAM. The "compression" benefits are better with longer stacks, which I see a lot in Java code!
  • If we're think gzip is good enough that we can rely on gzip, then we can be more backwards-compatible, a bit more redundant. However: I think gzip for transferring and storage is fine, but the profile is always eventually decompressed for reading, and holding profiles in RAM is limiting.

@aalexand
Copy link
Collaborator

If the changes are to be backward incompatible then it's not really pprof anymore per se and should probably be done as part of the OpenTelemetry profiling format work format instead.

@felixge
Copy link
Author

felixge commented Nov 17, 2022

@aalexand thanks for the feedback so far. Sorry for the slow response, I was traveling. @mhansen I'll reply to your feedback in my next comment.

The change is fairly large. Changes to profile.proto always immediately make me think about what it would take to make sure the existing tooling (pprof et cetera) supports it and this change is touching some of the fundamentals of the format.

Yeah. I could see this proposal become a lot simpler if we decide that the uncompressed size does't matter. In that case it might be enough to standardize on label names for timestamps. Is that a direction you'd be interested in exploring?

The example uses a single sample type. It would help if the example would use 2 sample types to illustrate how the breakdown is represented in that case.

I'll expand the example to include two sample types.

The change introduces the notion of "ticks" which is an entirely new concept. I wonder if we could avoid that somehow. I am not opposed to discussing how to make timestamps better represented in profile.proto but going as far as adding a new concept into the schema fields gives me a bit of a pause.

It'd be nice to be able to specify the time resolution. In my testing I saw ~20% reduction going from ns to us or us to ms. That being said, I'm not very attached to using the term "ticks" for this. Let me know if you have another idea here.

Instead of reusing label sets I wonder if normalizing out the stack would make things simpler, similar to this proto.

That'd be awesome, but I don't see a good way of doing this without breaking backwards compatibility. Based on your reply to @mhansen it seems like you wouldn't consider that to be an option for pprof?

@felixge
Copy link
Author

felixge commented Nov 17, 2022

@mhansen thanks for your feedback.

  • Given this existing ecosystem, backwards compatibility vs evolving the protocol is the key tradeoff, and my natural reaction is to try really hard to avoid backwards incompatibility. Or at the very least, offer backwards adapters.

+1

  • Timestamped profiles are a new feature, the timestamps not visible in old viewers. Perhaps this is a natural time to evolve the profile representation in a backwards-incompatible way -- a "v2"? A "big carrot" feature like timestamps would probably encourage uptake of the new version across the ecosystem.

I think I'm with @aalexand. If we decide to completely break backwards compatibility, it should probably be done as a new OTel format.

  • If we decide to evolve the profile in a backwards-incompatible way, I think it'd be a big win for in-memory size to represent the stacks as a tree, with stacks pointing to their parent stack, rather than repeated frame-lists (I think this is the normalization Alexey is referring to above). Firefox Profiler also represents stacks as trees, enabling them to fit longer profiles within a browser window's RAM. The "compression" benefits are better with longer stacks, which I see a lot in Java code!

That's definitely a direction worth exploring.

  • If we're think gzip is good enough that we can rely on gzip, then we can be more backwards-compatible, a bit more redundant. However: I think gzip for transferring and storage is fine, but the profile is always eventually decompressed for reading, and holding profiles in RAM is limiting.

As far as I'm concerned, the in-memory representation of a profile can be different from the transfer/storage format. It just makes encoding/decoding a little more complex. That being said, having a good representation that works well on the wire as well as in-memory would be nice.

@aalexand
Copy link
Collaborator

In that case it might be enough to standardize on label names for timestamps. Is that a direction you'd be interested in exploring?

I would prefer not to special case specific tag names, but rather have a tag unit type that denotes a timestamp and have support for that. Similar to how memory and duration units are handled. The magnitude could be a part of the unit to support the different resolution of the value.

On migrating to this proto to get call stack that can be reused between different samples to minimize duplication - yeah, breaking backward compatibility is expensive and would force a lot of changes upon us that are difficult to justify.

One idea I considered in the past that is roughly backward compatible is having a new num_keep_locations field in the Sample message like

message Sample {
  // The ids recorded here correspond to a Profile.location.id.
  // The leaf is at location_id[0].
  repeated uint64 location_id = 1;

  // Number of location IDs to keep from the previous sample. This is an
  // optimization that allows incremental encoding of the sample stack.
  // For example, if the stack is exactly the same as the previous one because
  // only the labels are different, then all of the location IDs can be reused
  // and so num_keep_locations could be set to the length of the location_id
  // field in the previous sample and the location_id field in the current sample
  // would be empty.
  uint64 num_keep_locations = 4;

  // The type and unit of each value is defined by the corresponding
  // entry in Profile.sample_type. All samples must have the same
  // number of values, the same as the length of Profile.sample_type.
  // When aggregating multiple samples into a single sample, the
  // result has a list of values that is the element-wise sum of the
  // lists of the originals.
  repeated int64 value = 2;

  // label includes additional context for this sample. It can include
  // things like a thread id, allocation size, etc
  repeated Label label = 3;
}

But this is not forward compatible (old pprof won't be able to read new profiles and even recognize them as new and refuse) and it's unclear (aka needs to be measured) how much actual profile size savings this would produce for compressed profiles.

@mhansen
Copy link
Contributor

mhansen commented Nov 19, 2022

As far as I'm concerned, the in-memory representation of a profile can be different from the transfer/storage format. It just makes encoding/decoding a little more complex. It just makes encoding/decoding a little more complex.

This is a fair point, I hadn't considered streaming protobuf decoders. And you could stream in and map to an alternative normalized/tree in-memory structure to save memory.

Streaming protobuf decoders exist, but I think they're fairly niche? e.g. the default protocol buffer libraries don't have streaming support AFAIK. I expect most tool users would use these default libraries and thus be stuck inflating the entire proto into memory, and possibly OOM'ing.

Probably a lot of existing pprof tools (including go tool pprof use these 'non-streaming' parsers; If we make a format that is so large that it strains memory to read it all in, I think that might limit takeup of the new features, or limit it to short timelines. Protocol buffers dev guide says that "Protocol Buffers are not designed to handle large messages". I wonder how large in-practice time-series pprof profiles would be, per second? 1GB would be OK for laptops (but probably OOM many shared webservers), and 10GB would strain laptops. Perhaps if you expect big profiles from this, we might have to update/normalize in-memory representation of go tool pprof at least, which even for that one tool is a nontrivial job (given the pprof go struct is public).

I'm not saying this should prevent us from moving forward (perhaps particularly for small profiles); just noting the challenges I see with big profiles without compression.

@felixge
Copy link
Author

felixge commented Nov 19, 2022

In that case it might be enough to standardize on label names for timestamps. Is that a direction you'd be interested in exploring?

I would prefer not to special case specific tag names, but rather have a tag unit type that denotes a timestamp and have support for that. Similar to how memory and duration units are handled. The magnitude could be a part of the unit to support the different resolution of the value.

Sounds good. I'll try do some experiments with that when I get a chance.

On migrating to this proto to get call stack that can be reused between different samples to minimize duplication - yeah, breaking backward compatibility is expensive and would force a lot of changes upon us that are difficult to justify.

One idea I considered in the past that is roughly backward compatible is having a new num_keep_locations field in the Sample message like

message Sample {
  // The ids recorded here correspond to a Profile.location.id.
  // The leaf is at location_id[0].
  repeated uint64 location_id = 1;

  // Number of location IDs to keep from the previous sample. This is an
  // optimization that allows incremental encoding of the sample stack.
  // For example, if the stack is exactly the same as the previous one because
  // only the labels are different, then all of the location IDs can be reused
  // and so num_keep_locations could be set to the length of the location_id
  // field in the previous sample and the location_id field in the current sample
  // would be empty.
  uint64 num_keep_locations = 4;

  // The type and unit of each value is defined by the corresponding
  // entry in Profile.sample_type. All samples must have the same
  // number of values, the same as the length of Profile.sample_type.
  // When aggregating multiple samples into a single sample, the
  // result has a list of values that is the element-wise sum of the
  // lists of the originals.
  repeated int64 value = 2;

  // label includes additional context for this sample. It can include
  // things like a thread id, allocation size, etc
  repeated Label label = 3;
}

But this is not forward compatible (old pprof won't be able to read new profiles and even recognize them as new and refuse) and it's unclear (aka needs to be measured) how much actual profile size savings this would produce for compressed profiles.

Nice idea 👍. If we're going to explore this space of forwards incompatible solutions, we should definitely take a look at this approach, along with the tree stuff proposed by @mhansen, and perhaps also stack trace hashing as pioneered by prodfiler/elastic.

@felixge
Copy link
Author

felixge commented Nov 19, 2022

As far as I'm concerned, the in-memory representation of a profile can be different from the transfer/storage format. It just makes encoding/decoding a little more complex. It just makes encoding/decoding a little more complex.

This is a fair point, I hadn't considered streaming protobuf decoders. And you could stream in and map to an alternative normalized/tree in-memory structure to save memory.

Streaming protobuf decoders exist, but I think they're fairly niche? e.g. the default protocol buffer libraries don't have streaming support AFAIK. I expect most tool users would use these default libraries and thus be stuck inflating the entire proto into memory, and possibly OOM'ing.

Probably a lot of existing pprof tools (including go tool pprof use these 'non-streaming' parsers; If we make a format that is so large that it strains memory to read it all in, I think that might limit takeup of the new features, or limit it to short timelines. Protocol buffers dev guide says that "Protocol Buffers are not designed to handle large messages".

I have a lot of thoughts on protobufs 😅. I agree that the standard tooling doesn't make it easy to implement efficient/streaming decoders. That being said, I think it's totally feasible to build them. In fact, we recently added a pproflite package to our Go profiler that provides streaming pprof decoding with zero allocations (only the creation of a new decoder allocates, but they are reusable). We use it to implement efficient profile delta computations.

Of course an ideal uncompressed wire format would be efficient, simple and suitable for in-memory representation. But ... I think it's feasible to make some tradeoffs here. Especially for compatibility purposes.

I wonder how large in-practice time-series pprof profiles would be, per second? 1GB would be OK for laptops (but probably OOM many shared webservers), and 10GB would strain laptops. Perhaps if you expect big profiles from this, we might have to update/normalize in-memory representation of go tool pprof at least, which even for that one tool is a nontrivial job (given the pprof go struct is public).

I'm not saying this should prevent us from moving forward (perhaps particularly for small profiles); just noting the challenges I see with big profiles without compression.

Yeah, the increase in profile size is a challenge when adding timestamps. But given enough optimization work, I think it will be feasible.

@aalexand
Copy link
Collaborator

Nice idea 👍. If we're going to explore this space of forwards incompatible solutions, we should definitely take a look at this approach, along with the tree stuff proposed by @mhansen, and perhaps also stack trace hashing as pioneered by prodfiler/elastic.

One note is that approaches like using a tree are incompatible in both directions while the "incremental stack encoding" approach I mentioned at least allows all current profiles to be transparently opened by the new code since num_keep_locations is unset and so is effectively zero which means reuse the whole stack.

@felixge
Copy link
Author

felixge commented Nov 19, 2022

One note is that approaches like using a tree are incompatible in both directions while the "incremental stack encoding" approach I mentioned at least allows all current profiles to be transparently opened by the new code since num_keep_locations is unset and so is effectively zero which means reuse the whole stack.

I think that just depends on how the decoder works? E.g. if we added a new Sample.stack_id field that points into a stack tree, new versions of pprof would recognize that this field is present and ignore the location list. That would be backwards compatible IMO. Maybe I'm missing something?

@aalexand
Copy link
Collaborator

That means that on the read path there is an if statement to take the old path or the new path. With the num_keep_locations approach there is no if statement.

@felixge
Copy link
Author

felixge commented Mar 5, 2023

For now I'm not planning to continue pushing this proposal forward in its current form, so I'm closing this.

More discussion are still happening as part of the OTel Profiling Meetings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants