runtime/trace: execution tracer overhaul #60773
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsFix
The path to resolution is known, but the work has not been done.
Milestone
Execution tracer overhaul
Authored by mknyszek@google.com with a mountain of input from others.
In no particular order, thank you to Felix Geisendorfer, Nick Ripley, Michael Pratt, Austin Clements, Rhys Hiltner, thepudds, Dominik Honnef, and Bryan Boreham for your invaluable feedback.
Background
Original design document.
Go execution traces provide a moment-to-moment view of what happens in a Go program over some duration. This information is invaluable in understanding program behavior over time and can be leveraged to achieve significant performance improvements. Because Go has a runtime, it can provide deep information about program execution without any external dependencies, making traces particularly attractive for large deployments.
Unfortunately limitations in the trace implementation prevent widespread use.
For example, the process of analyzing execution traces scales poorly with the size of the trace. Traces need to be parsed in their entirety to do anything useful with them, making them impossible to stream. As a result, trace parsing and validation has very high memory requirements for large traces.
Also, Go execution traces are designed to be internally consistent, but don't provide any way to align with other kinds of traces, for example OpenTelemetry traces and Linux sched traces. Alignment with higher level tracing mechanisms is critical to connecting business-level tasks with resource costs. Meanwhile alignment with lower level traces enables a fully vertical view of application performance to root out the most difficult and subtle issues.
Lastly, the implementation of the execution tracer has evolved organically over time and it shows. The codebase also has many old warts and some age-old bugs that make collecting traces difficult, and seem broken. Furthermore, many significant decision decisions were made over the years but weren't thoroughly documented; those decisions largely exist solely in old commit messages and breadcrumbs left in comments within the codebase itself.
Thanks to work in Go 1.21 cycle, the execution tracer's run-time overhead was reduced from about -10% throughput and +10% request latency in web services to about 1% in both for most applications. This reduced overhead in conjunction with making traces more scalable enables some exciting and powerful new opportunities for traces.
Goals
The goal of this document is to define an alternative implementation for Go execution traces that scales up to large Go deployments.
Specifically, the design presented aims to achieve:
This document also describes the existing state of the tracer in detail and explains how we got there.
Design
Link to design document.
CC @felixge @nsrip-dd @prattmic @aclements @rhysh @dominikh @bboreham @thepudds
The text was updated successfully, but these errors were encountered: