Proposal: Adding profiling as a support event type #139

MovieStoreGuy · 2020-10-22T01:31:33Z

Profiling events

There is a shifting concept that performance monitoring and application monitoring (the idea of tracking the time spent in functions and or methods, vs how long it takes to serve a request) are near identical and come under the realm of Observability (understanding how your service is performing).

How is this different from tracing

Conventional tracing looks at showing the user's request flow through the application to show time spent in different operations. However, this can miss any background operations that indirectly impact the user request flow.

ie. If I take a rate limiting service that has a background sync to share state among other nodes:

func ShouldRateLimit(next http.Handler) http.Handler {
   return http.HandlerFunc(w http.ResponseWriter, r *http.Request) {
         span, ctx := otel.SpanFromContext(r.Context())
         defer span.Finish()
         key, err := ratelimit.GetKey(r)
        
        if limits.Key(key).Exceed() {
             // return 429 status code
        }
        next.ServeHTTP(w,r)
   })
}

func (l *limits) SyncLimits() {
    l.cache.RLock()
    defer l.cache.RUnlock()
    for _, limit := limits.cache {
          // publish data to each node or distributed cache
          // Update internal values with shared updates
    }
}

In the above example, I can clearly see how the function ShouldRateLimit impacts the requests processing time considering the context used as part of the request can be used to link spans together but there is a hidden cost here with SyncLimits that currently can not be exposed due to the fact it runs independently from in bound requests and thus can not / should not share the same context.

Now, the SyncLimits function could implement metrics to help expose runtime performance issues but could be problematic due to:

As a developer, I need to know what to start observing in order to diagnose
The problem may disappear due to the nature of the issue (race conditions, Heisenbug)
Measure performance of my function comparatively to my entire application
Can not easily measure deadlocks / livelocks without elaborate code orchestration

Suggestion

At least within the golang community, https://github.com/google/pprof has been the leading tool in order to facilitate these kinds of questions while also offering first part support within Go. Moreover, AWS also have their own solution https://aws.amazon.com/codeguru/ that offers something similar for JVM based applications.

Desired outcomes of data:

Show cumulative runtime of functions (could also derive percentage from this data)
Map resource usage (CPU, Memory, and I/O) to internal methods / functions

Desired outcomes of orchestration:

Low friction with adding profiling support (As an example, pprof adds a single handler to perform software based profiling)
Should not require major modifications of existing code to work (should not require adding functions that would complicate existing logic)

I understand that software based profiling is not 100% accurate as per the write up here https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md however, this could give an amazing insight into hidden application performance that could help increase reliability, performance and discover resource issues that were hard to discover with the existing events being emitted.

The text was updated successfully, but these errors were encountered:

jkwatson · 2020-10-22T02:19:15Z

FYI, JFR is probably the top JVM profiling tool, as it's built-in to the JVM these days.

MovieStoreGuy · 2020-10-22T04:17:49Z

That is awesome to know @jkwatson :D I don't often work with JVM based languages but I will 100% have a look :D

iNikem · 2020-10-22T06:55:53Z

@jkwatson @MovieStoreGuy The top profiling tool for JVM is async-profiler :)

jkwatson · 2020-10-22T15:06:01Z

@jkwatson @MovieStoreGuy The top profiling tool for JVM is async-profiler :)

The docs on that are seriously out of date...they still reference JFR as a commercial product. I guess that's true if you're profiling java 7, but I don't wish that on anyone.

iNikem · 2020-10-22T15:27:16Z

What docs are out of date?

jkwatson · 2020-10-22T15:35:23Z

well, now I can't find the ones I was just looking at, so /shrug.
Also, this probably isn't the place to argue about specific profiling tools. :)

MovieStoreGuy · 2020-10-28T05:20:37Z

I agree with @jkwatson, I appreciate bring JVM tools to my intention, that is not the focus of this proposal :)

rakyll · 2021-03-25T04:55:48Z

We're interested in being able to collect CPU, memory, contention and other profiles with OpenTelemetry and have representations of profiles in OTLP and support in the collector. We are currently also looking into existing data model alternatives such as pprof as an option given its wide use in open source and language support.

We want to enable cases where we can use OpenTelemetry attributes to label profiles as well. pprof has support for labelling (an example can be seen at https://rakyll.org/profiler-labels/).

As of today, it's very difficult for our users to enable profiling at a later time, especially in production. They need to add CodeGuru Profiler libraries, rebuild and redeploy. As more and more of them are linking OpenTelemetry for other telemetry collection, we want to enable cases where we can enable profile collection dynamically in runtime. This use case will require the OpenTelemetry client libraries to speak to the collector (or another control plane) to enable/disable collection.

thegreystone · 2021-04-01T11:05:15Z

Not sure if this will help, but I thought I'd chip in with what we're doing at Datadog. For the continuous profiler (which is integrating with our tracer), we're using our own profiling libraries for most platforms, and our own agent using JFR on the JVM. For the JVM we've added our own profiling events for various different kinds of profiling (e.g. rate limited exception profiling). For non-JVM languages we're partly using pprof as the serialization format (some data doesn't fit well into the model, so it's currently an archive with multiple files in it). For the JVM we're using JFR for the serialization format.

There are a few interesting initiatives for JFR in recent and upcoming versions - such as a new allocation profiler in JDK 16, and much faster stack trace capturing (I believe JDK 17). We (Datadog), are also considering contributing an all new, full process, proper CPU profiler and some neat new capabilities allowing you to, for example, easily implement your own dynamic wall clock profiler.

MovieStoreGuy · 2021-04-20T05:52:45Z

It has been sometime since I have opened this, but I'd like to know how I could speed up anything that is required to make this part of the default otel offering.

tedsuo · 2021-04-21T05:39:42Z

Hi @MovieStoreGuy. We're pretty heads down getting metrics and logs completed, as well as expanding and improving library instrumentation. There probably will not be a lot of bandwidth from the current community until these components are stable; apologies in advance, it will probably be slow going. However, profiling is definitely top priority after metrics and logs!

If you, @thegreystone, and others are interested in contributing work towards this project, I would suggest the following steps, which any new signal would need to take:

Create a prototype in (ideally) two or three languages.
Write an OTEP with the proposed specification, based on those prototypes.

If there is a group willing to put in the time to prototype, we can help by creating a OTel SIG for this work (a repo plus a slack channel for discussion). But again, I'm concerned that the spec reviewers and language maintainers are fully committed, so there may not be a lot of bandwidth for review or assistance until we clear the deck. I hate saying "next year" but six months to complete metrics and the remaining current initiatives is probably realistic. If there are well thought out proposals and prototypes by then, it would definitely give this project a speed boost. :)

aalexand · 2021-07-14T03:00:33Z

We (owners of https://github.com/google/pprof repo) would be curious what it would take to standardize on the profile.proto as the wire format for profiling data in OTel.

@thegreystone RE "some data doesn't fit well into the model, so it's currently an archive with multiple files in it" - do you mind elaborating on that?

alolita · 2021-07-27T20:54:57Z

Is pprof being evaluated? It would be great to have a formal issue in the community repo. Ty!

MovieStoreGuy · 2021-09-10T04:26:44Z

Hey @alolita ,

Which community repo are you referring to?

ymotongpoo · 2021-09-17T05:36:49Z

@alolita do you mean this repository? https://github.com/open-telemetry/community

If yes, could you point out which SIGs or teams to chime for this topic?

jsuereth · 2021-09-17T12:32:55Z

Here's the donation process for contributing code.

If pprof itself will be contributed over time, then we need to do the long-form process (similar to other major technologies we've pulled in).
If the proposal (right now) is just the ability to send profiles in OTLP and correlate with other observability signals, perhaps just an OTEP is enough for the initial discussion.

No matter which process is in place, we should have a location where we collect documentation on:

Current state of the art for profiling (what technologies, outside of pprof, are used, across languages, etc.)
The need for profiling as an observability signal (you have some of this written down here)
What would be contributed to OpenTelemetry (AFAIK - this OTEP is just for the protocol)
Why OpenTelemetry is the best fit.

mhansen · 2021-10-02T03:21:01Z

Current state of the art for profiling (what technologies, outside of pprof, are used, across languages, etc.)

I think I can help with this. I've just researched the ecosystem of profilers, profile data formats, data format converters, and profile analysis UIs: https://www.markhansen.co.nz/profilerpedia/. I'm probably missing a few, but I think I've covered most of the main ones. I hope this can be a useful starting point for the standardisation process.

mhansen · 2021-11-20T05:34:37Z

FYI, I've now made a website for Profilerpedia (it's not just a Google Sheet any more): https://profilerpedia.markhansen.co.nz/, and the site renders directed graphs of profilers, their data formats, the transitive closure of data formats you can convert to, and UIs that can read those formats.

For example, the transitive set of profilers that are convertable to pprof (warning: huge graph, and some conversions are lossy): https://profilerpedia.markhansen.co.nz/formats/pprof/#converts-from-transitive

thomasdullien · 2022-04-07T08:40:29Z

For what it's worth: We've been running prodfiler's continuous profiling service for the last 15 months, and have collected extensive experience with the various footguns involved in collecting profiling data & how to make use of it. Would be more than happy to help share what we've learnt and what to watch out for or otherwise assist in the design process.

A few things to keep in mind:

Issues when pre-aggregating the data too much
Data volume / data efficiency

On (1): For a good user experience, it is often necessary for users to drill-down into fine-grained profiling event data; which means filtering profiling events by things like container, thread, and timeframes. This ends up creating problems when the data is pre-aggregated too early at too coarse granularity. The ideal format for the recipient is actual individual sampling events. This ideal format then needs to be balanced with other requirements.

(2) It's important to be careful about data volume. Given that the ideal format sends individual samples, and given that one wants to sample at anywhere between 20Hz and 200Hz per core, we are looking at 20 * 2^6 to 200 * 2^6 events events per second in the worst case on a 64-core server. This means that sending out full stack traces for each event quickly comes prohibitive: A java method name can easily have 32-64 characters, and a deep java stack can be 128+ frames.

So if we look at:
2^7 frames

2^6 characters
2^6 cores times
2^5-2^6 samples per second
we are looking at 2^25 bytes per second (32 mb?) just for profiling data.

We ended up solving this by not transmitting full stack traces, and just hashes of traces, which reduces the amount of data dramatically.

Happy to help & provide more input!

jpkrohling · 2022-04-07T11:47:00Z

Tagging @brancz, which should have an opinion or two about this.

petethepig · 2022-04-07T20:24:21Z

At Pyroscope we've been building an open source continuous profiling platform for over a year now. We integrate with many different profilers from various languages and other open source projects in our agents:

Go: pprof
ruby: rbspy
python: py-spy
Java: async-profiler
eBPF: profile.py from bcc-tools
php: phpspy
.NET: dotnet trace
Rust: pprof-rs

Since we've had to deal with supporting all these different formats of profiles in order to store them, we are also looking forward to an agreed upon standardized format for profiles — especially as more tooling gets created to analyze and interact with profiles.

For example, we recently created an otelpyroscope package to link traces to profiles. Thanks to label support in pprof, this was really easy to implement.

On the other hand, some agents report profiling data in a format that doesn't support "labels" which makes an integration like this impossible since labels are needed to link profiles to other types of telemetry data.

Another example of where standardization would be useful is that to support Java profiles from async-profiler we had to write a JFR Parser in Go so that we can ingest profiles from async-profiler. Again, if all profilers were using (or at least supported) one output format it would have made this much easier.

All that being said, every profiler on this list also has its own quirks and nuances in output formats which make supporting them all overly complicated compared to if they supported the same standardized format.

Happy to help provide our thoughts and experience as we've gone through supporting many profiling formats across languages and projects and would love to help contribute to this effort.

jhalliday · 2022-05-24T13:24:28Z

With the metrics effort hitting release candidate stage (yay!) we're hopefully approaching a period when reviewers have a bit more time available. However, that only matters if there is something to review...
I have some time available to discuss ideas/requirements and start prototyping on profiling support, mainly with a JVM focus. Anyone else available for contributing seed work, perhaps for go or another language? If we hit critical mass then requesting a new SIG probably makes sense, if not I'll just work in my own space for now.

mtwo · 2022-05-26T16:59:14Z

This is perfect timing! We discussed the project roadmap during the in-person community meeting at Kubecon last week, and profiling support was the second most popular topic, after logging (which is already in-flight)! The process that you mentioned (contributing seed work, writing requirements, forming a SIG) is what we used for logging, and I think that it makes sense to follow that here as well.

Do people want to discuss this on a call sometime next week? Any objections to 8:00 AM PT on Friday, June 3rd?

Rperry2174 · 2022-05-27T03:12:32Z

@mtwo I shared the steps you mentioned the other day in a slack channel here with a bunch of profiling developers where several have expressed interest

Anyway would love to chat Friday!

mtwo · 2022-05-31T15:33:27Z

I've created a meeting in the OpenTelemetry calendar for 8:00 AM PT this Friday for us to meet!

ahaw023 · 2022-06-14T04:01:04Z

would be good to include eBFP tools like Pixie

brancz · 2022-06-14T07:28:55Z

Parca, Pixie and prodfiler are all eBPF based and participating in this.

Rperry2174 · 2022-09-08T17:17:00Z

Hi all, as many of you know there has a been a working group of many people in this thread meeting to come up with a collective vision for profiling. A PR has been submitted detailing that vision and we'd love to get more feedback on it!

Please check it out and comment if you have any feedback or if you are generally in agreement we'd love to get more approvals from various community members who have expressed interest in this (even if you are not part of the OTel org)!

#212

This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](open-telemetry/opentelemetry-specification#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!

gillg · 2023-02-12T07:49:23Z

Is there some experiments and alpha tests around this subject?
I would bring an initiative I've found in DotNet ecosystem dotnet/diagnostics#2948 (comment) the idea is to use pprof API as kind of standard.
There is also a very new project (last comment) Wich allows to directly use grafana phlare to store profiles.

Rperry2174 · 2023-02-13T05:33:16Z

hi @gillg we are actively doing tests around this subject right now. You can follow the progress of our most recent benchmarks here, but yes we are definitely planning on something close to pprof.

The majority of the discussion is happening in the #otel-profiles channel in the cncf slack. Would love to have you hop in and give your thoughts there!

brunobat · 2024-01-31T08:00:36Z

I guess this proposal: open-telemetry/community#1918 might fix this issue.

ayewo · 2024-01-31T08:40:26Z

@brunobat Came here to post the same thing.

trask · 2024-04-15T21:34:39Z

Closed by #239

andrewhsu added the release:after-ga Not required before GA release, and not going to work on before GA label Oct 23, 2020

nolanmar511 mentioned this issue Oct 11, 2021

support of pprof files in Perfetto google/perfetto#9

Closed

Rperry2174 mentioned this issue Apr 8, 2022

Implement support for dynamic tags grafana/pyroscope-java#14

Closed

Rperry2174 mentioned this issue May 30, 2022

Contribute to adding profiling as OTEL supported event type cncf/tag-observability#89

Closed

jhalliday mentioned this issue Jun 27, 2022

Event streaming async-profiler/async-profiler#404

Open

This was referenced Aug 15, 2022

Add support for generic flamegraph visualization type grafana/grafana#53723

Closed

REQUEST: New membership for @rperry2174 open-telemetry/community#1133

Closed

Rperry2174 mentioned this issue Aug 30, 2022

Propose OpenTelemetry Profiling Vision #212

Merged

apangin mentioned this issue Sep 21, 2022

OpenTelemetry Traces async-profiler/async-profiler#652

Closed

getlarge mentioned this issue Aug 31, 2023

Add support for Open Telemetry getsentry/profiling-node#186

Open

trask closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Adding profiling as a support event type #139

Proposal: Adding profiling as a support event type #139

MovieStoreGuy commented Oct 22, 2020 •

edited

Loading

jkwatson commented Oct 22, 2020

MovieStoreGuy commented Oct 22, 2020

iNikem commented Oct 22, 2020

jkwatson commented Oct 22, 2020

iNikem commented Oct 22, 2020

jkwatson commented Oct 22, 2020

MovieStoreGuy commented Oct 28, 2020

rakyll commented Mar 25, 2021

thegreystone commented Apr 1, 2021 •

edited

Loading

MovieStoreGuy commented Apr 20, 2021

tedsuo commented Apr 21, 2021

aalexand commented Jul 14, 2021

alolita commented Jul 27, 2021

MovieStoreGuy commented Sep 10, 2021

ymotongpoo commented Sep 17, 2021

jsuereth commented Sep 17, 2021

mhansen commented Oct 2, 2021 •

edited

Loading

mhansen commented Nov 20, 2021 •

edited

Loading

thomasdullien commented Apr 7, 2022 •

edited

Loading

jpkrohling commented Apr 7, 2022

petethepig commented Apr 7, 2022

jhalliday commented May 24, 2022

mtwo commented May 26, 2022

Rperry2174 commented May 27, 2022 •

edited

Loading

mtwo commented May 31, 2022

ahaw023 commented Jun 14, 2022

brancz commented Jun 14, 2022

Rperry2174 commented Sep 8, 2022

gillg commented Feb 12, 2023

Rperry2174 commented Feb 13, 2023

brunobat commented Jan 31, 2024

ayewo commented Jan 31, 2024

trask commented Apr 15, 2024

Proposal: Adding profiling as a support event type #139

Proposal: Adding profiling as a support event type #139

Comments

MovieStoreGuy commented Oct 22, 2020 • edited Loading

Profiling events

How is this different from tracing

Suggestion

jkwatson commented Oct 22, 2020

MovieStoreGuy commented Oct 22, 2020

iNikem commented Oct 22, 2020

jkwatson commented Oct 22, 2020

iNikem commented Oct 22, 2020

jkwatson commented Oct 22, 2020

MovieStoreGuy commented Oct 28, 2020

rakyll commented Mar 25, 2021

thegreystone commented Apr 1, 2021 • edited Loading

MovieStoreGuy commented Apr 20, 2021

tedsuo commented Apr 21, 2021

aalexand commented Jul 14, 2021

alolita commented Jul 27, 2021

MovieStoreGuy commented Sep 10, 2021

ymotongpoo commented Sep 17, 2021

jsuereth commented Sep 17, 2021

mhansen commented Oct 2, 2021 • edited Loading

mhansen commented Nov 20, 2021 • edited Loading

thomasdullien commented Apr 7, 2022 • edited Loading

jpkrohling commented Apr 7, 2022

petethepig commented Apr 7, 2022

jhalliday commented May 24, 2022

mtwo commented May 26, 2022

Rperry2174 commented May 27, 2022 • edited Loading

mtwo commented May 31, 2022

ahaw023 commented Jun 14, 2022

brancz commented Jun 14, 2022

Rperry2174 commented Sep 8, 2022

gillg commented Feb 12, 2023

Rperry2174 commented Feb 13, 2023

brunobat commented Jan 31, 2024

ayewo commented Jan 31, 2024

trask commented Apr 15, 2024

MovieStoreGuy commented Oct 22, 2020 •

edited

Loading

thegreystone commented Apr 1, 2021 •

edited

Loading

mhansen commented Oct 2, 2021 •

edited

Loading

mhansen commented Nov 20, 2021 •

edited

Loading

thomasdullien commented Apr 7, 2022 •

edited

Loading

Rperry2174 commented May 27, 2022 •

edited

Loading