Skip to content
This repository was archived by the owner on Nov 17, 2025. It is now read-only.
This repository was archived by the owner on Nov 17, 2025. It is now read-only.

Proposal: Adding profiling as a support event type #139

@MovieStoreGuy

Description

@MovieStoreGuy

Profiling events

There is a shifting concept that performance monitoring and application monitoring (the idea of tracking the time spent in functions and or methods, vs how long it takes to serve a request) are near identical and come under the realm of Observability (understanding how your service is performing).

How is this different from tracing

Conventional tracing looks at showing the user's request flow through the application to show time spent in different operations. However, this can miss any background operations that indirectly impact the user request flow.

ie. If I take a rate limiting service that has a background sync to share state among other nodes:

func ShouldRateLimit(next http.Handler) http.Handler {
   return http.HandlerFunc(w http.ResponseWriter, r *http.Request) {
         span, ctx := otel.SpanFromContext(r.Context())
         defer span.Finish()
         key, err := ratelimit.GetKey(r)
        
        if limits.Key(key).Exceed() {
             // return 429 status code
        }
        next.ServeHTTP(w,r)
   })
}

func (l *limits) SyncLimits() {
    l.cache.RLock()
    defer l.cache.RUnlock()
    for _, limit := limits.cache {
          // publish data to each node or distributed cache
          // Update internal values with shared updates
    }
}

In the above example, I can clearly see how the function ShouldRateLimit impacts the requests processing time considering the context used as part of the request can be used to link spans together but there is a hidden cost here with SyncLimits that currently can not be exposed due to the fact it runs independently from in bound requests and thus can not / should not share the same context.

Now, the SyncLimits function could implement metrics to help expose runtime performance issues but could be problematic due to:

  • As a developer, I need to know what to start observing in order to diagnose
  • The problem may disappear due to the nature of the issue (race conditions, Heisenbug)
  • Measure performance of my function comparatively to my entire application
  • Can not easily measure deadlocks / livelocks without elaborate code orchestration

Suggestion

At least within the golang community, https://github.com/google/pprof has been the leading tool in order to facilitate these kinds of questions while also offering first part support within Go. Moreover, AWS also have their own solution https://aws.amazon.com/codeguru/ that offers something similar for JVM based applications.

Desired outcomes of data:

  • Show cumulative runtime of functions (could also derive percentage from this data)
  • Map resource usage (CPU, Memory, and I/O) to internal methods / functions

Desired outcomes of orchestration:

  • Low friction with adding profiling support (As an example, pprof adds a single handler to perform software based profiling)
  • Should not require major modifications of existing code to work (should not require adding functions that would complicate existing logic)

I understand that software based profiling is not 100% accurate as per the write up here https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md however, this could give an amazing insight into hidden application performance that could help increase reliability, performance and discover resource issues that were hard to discover with the existing events being emitted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    release:after-gaNot required before GA release, and not going to work on before GA

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions