Profiling events
There is a shifting concept that performance monitoring and application monitoring (the idea of tracking the time spent in functions and or methods, vs how long it takes to serve a request) are near identical and come under the realm of Observability (understanding how your service is performing).
How is this different from tracing
Conventional tracing looks at showing the user's request flow through the application to show time spent in different operations. However, this can miss any background operations that indirectly impact the user request flow.
ie. If I take a rate limiting service that has a background sync to share state among other nodes:
func ShouldRateLimit(next http.Handler) http.Handler {
return http.HandlerFunc(w http.ResponseWriter, r *http.Request) {
span, ctx := otel.SpanFromContext(r.Context())
defer span.Finish()
key, err := ratelimit.GetKey(r)
if limits.Key(key).Exceed() {
// return 429 status code
}
next.ServeHTTP(w,r)
})
}
func (l *limits) SyncLimits() {
l.cache.RLock()
defer l.cache.RUnlock()
for _, limit := limits.cache {
// publish data to each node or distributed cache
// Update internal values with shared updates
}
}
In the above example, I can clearly see how the function ShouldRateLimit impacts the requests processing time considering the context used as part of the request can be used to link spans together but there is a hidden cost here with SyncLimits that currently can not be exposed due to the fact it runs independently from in bound requests and thus can not / should not share the same context.
Now, the SyncLimits function could implement metrics to help expose runtime performance issues but could be problematic due to:
- As a developer, I need to know what to start observing in order to diagnose
- The problem may disappear due to the nature of the issue (race conditions, Heisenbug)
- Measure performance of my function comparatively to my entire application
- Can not easily measure deadlocks / livelocks without elaborate code orchestration
Suggestion
At least within the golang community, https://github.com/google/pprof has been the leading tool in order to facilitate these kinds of questions while also offering first part support within Go. Moreover, AWS also have their own solution https://aws.amazon.com/codeguru/ that offers something similar for JVM based applications.
Desired outcomes of data:
- Show cumulative runtime of functions (could also derive percentage from this data)
- Map resource usage (CPU, Memory, and I/O) to internal methods / functions
Desired outcomes of orchestration:
- Low friction with adding profiling support (As an example, pprof adds a single handler to perform software based profiling)
- Should not require major modifications of existing code to work (should not require adding functions that would complicate existing logic)
I understand that software based profiling is not 100% accurate as per the write up here https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md however, this could give an amazing insight into hidden application performance that could help increase reliability, performance and discover resource issues that were hard to discover with the existing events being emitted.
Profiling events
There is a shifting concept that performance monitoring and application monitoring (the idea of tracking the time spent in functions and or methods, vs how long it takes to serve a request) are near identical and come under the realm of Observability (understanding how your service is performing).
How is this different from tracing
Conventional tracing looks at showing the user's request flow through the application to show time spent in different operations. However, this can miss any background operations that indirectly impact the user request flow.
ie. If I take a rate limiting service that has a background sync to share state among other nodes:
In the above example, I can clearly see how the function
ShouldRateLimitimpacts the requests processing time considering the context used as part of the request can be used to link spans together but there is a hidden cost here withSyncLimitsthat currently can not be exposed due to the fact it runs independently from in bound requests and thus can not / should not share the same context.Now, the
SyncLimitsfunction could implement metrics to help expose runtime performance issues but could be problematic due to:Suggestion
At least within the golang community, https://github.com/google/pprof has been the leading tool in order to facilitate these kinds of questions while also offering first part support within Go. Moreover, AWS also have their own solution https://aws.amazon.com/codeguru/ that offers something similar for JVM based applications.
Desired outcomes of data:
Desired outcomes of orchestration:
I understand that software based profiling is not 100% accurate as per the write up here https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md however, this could give an amazing insight into hidden application performance that could help increase reliability, performance and discover resource issues that were hard to discover with the existing events being emitted.