Authored by mknyszek@google.com with a mountain of input from others.
In no particular order, thank you to Felix Geisendorfer, Nick Ripley, Michael Pratt, Austin Clements, Rhys Hiltner, thepudds, Dominik Honnef, and Bryan Boreham for your invaluable feedback.
Original design document from 2014.
Go execution traces provide a moment-to-moment view of what happens in a Go program over some duration. This information is invaluable in understanding program behavior over time and can be leveraged to achieve significant performance improvements. Because Go has a runtime, it can provide deep information about program execution without any external dependencies, making traces particularly attractive for large deployments.
Unfortunately limitations in the trace implementation prevent widespread use.
For example, the process of analyzing execution traces scales poorly with the size of the trace. Traces need to be parsed in their entirety to do anything useful with them, making them impossible to stream. As a result, trace parsing and validation has very high memory requirements for large traces.
Also, Go execution traces are designed to be internally consistent, but don't provide any way to align with other kinds of traces, for example OpenTelemetry traces and Linux sched traces. Alignment with higher level tracing mechanisms is critical to connecting business-level tasks with resource costs. Meanwhile alignment with lower level traces enables a fully vertical view of application performance to root out the most difficult and subtle issues.
Thanks to work in Go 1.21 cycle, the execution tracer's run-time overhead was reduced from about -10% throughput and +10% request latency in web services to about 1% in both for most applications. This reduced overhead in conjunction with making traces more scalable enables some exciting and powerful new opportunities for the diagnostic.
Lastly, the implementation of the execution tracer has evolved organically over time and it shows. The codebase also has many old warts and some age-old bugs that make collecting traces difficult, and seem broken. Furthermore, many significant decision decisions were made over the years but weren't thoroughly documented; those decisions largely exist solely in old commit messages and breadcrumbs left in comments within the codebase itself.
The goal of this document is to define an alternative implementation for Go execution traces that scales up to large Go deployments.
Specifically, the design presented aims to achieve:
- Make trace parsing require a small fraction of the memory it requires today.
- Streamable traces, to enable analysis without storage.
- Fix age-old bugs and present a path to clean up the implementation.
Furthermore, this document will present the existing state of the tracer in detail and explain why it's like that to justify the changes being made.
The design is broken down into four parts:
- Timestamps and sequencing.
- Orienting against threads instead of Ps.
- Partitioning.
- Wire format cleanup.
These four parts are roughly ordered by how fundamental they are to the trace design, and so the former sections are more like concrete proposals, while the latter sections are more like design suggestions that would benefit from prototyping. The earlier parts can also be implemented without considering the latter parts.
Each section includes in the history and design of the existing system as well to document the current system in one place, and to more easily compare it to the new proposed system. That requires, however, a lot of prose, which can obscure the bigger picture. Here are the highlights of each section without that additional context.
Timestamps and sequencing.
- Compute timestamps from the OS's monotonic clock (
nanotime
). - Use per-goroutine sequence numbers for establishing a partial order of events (as before).
Orienting against threads (Ms) instead of Ps.
- Batch trace events by M instead of by P.
- Use lightweight M synchronization for trace start and stop.
- Simplify syscall handling.
- All syscalls have a full duration in the trace.
Partitioning.
- Traces are sequences of fully self-contained partitions that may be streamed.
- Each partition has its own stack table and string table.
- Partitions are purely logical: consecutive batches with the same ID.
- In general, parsers need state from the previous partition to get accurate timing information.
- Partitions have an "earliest possible" timestamp to allow for imprecise analysis without a previous partition.
- Partitions are bound by both a maximum wall time and a maximum size (determined empirically).
- Traces contain an optional footer delineating partition boundaries as byte offsets.
- Emit batch lengths to allow for rapidly identifying all batches within a partition.
Wire format cleanup.
- More consistent naming scheme for event types.
- Separate out "reasons" that a goroutine can block or stop from the event types.
- Put trace stacks, strings, and CPU samples in dedicated batches.
For years, the Go execution tracer has used the cputicks
runtime function for
obtaining a timestamp.
On most platforms, this function queries the CPU for a tick count with a single
instruction.
(Intuitively a "tick" goes by roughly every CPU clock period, but in practice
this clock usually has a constant rate that's independent of CPU frequency
entirely.) Originally, the execution tracer used this stamp exclusively for
ordering trace events.
Unfortunately, many modern
CPUs
don't provide such a clock that is stable across CPU cores, meaning even though
cores might synchronize with one another, the clock read-out on each CPU is not
guaranteed to be ordered in the same direction as that synchronization.
This led to traces with inconsistent timestamps.
To combat this, the execution tracer was modified to use a global sequence counter that was incremented synchronously for each event. Each event would then have a sequence number that could be used to order it relative to other events on other CPUs, and the timestamps could just be used solely as a measure of duration on the same CPU. However, this approach led to tremendous overheads, especially on multiprocessor systems.
That's why in Go 1.7 the implementation
changed so that each goroutine
had its own sequence counter.
The implementation also cleverly avoids including the sequence number in the
vast majority of events by observing that running goroutines can't actually be
taken off their thread until they're synchronously preempted or yield
themselves.
Any event emitted while the goroutine is running is trivially ordered after the
start.
The only non-trivial ordering cases left are where an event is emitted by or on
behalf of a goroutine that is not actively running (Note: I like to summarize
this case as "emitting events at a distance" because the scheduling resource
itself is not emitting the event bound to it.) These cases need to be able to be
ordered with respect to each other and with a goroutine starting to run (i.e.
the GoStart
event).
In the current trace format, there are only two such cases: the GoUnblock
event, which indicates that a blocked goroutine may start running again and is
useful for identifying scheduling latencies, and GoSysExit
, which is used to
determine the duration of a syscall but may be emitted from a different P than
the one the syscall was originally made on.
(For more details on the GoSysExit
case see in the next section.)
Furthermore, there are versions of the GoStart, GoUnblock
, and GoSysExit
events that omit a sequence number to save space if the goroutine just ran on
the same P as the last event, since that's also a case where the events are
trivially serialized.
In the end, this approach successfully reduced the trace overhead from over 3x
to 10-20%.
However, it turns out that the trace parser still requires timestamps to be in
order, leading to the infamous "time stamps out of
order" error when cputicks
inevitably doesn't actually emit timestamps in order.
Ps are a purely virtual resource; they don't actually map down directly to
physical CPU cores at all, so it's not even reasonable to assume that the same P
runs on the same CPU for any length of time.
Although we can work around this issue, I propose we try to rely on the
operating system's clock instead and fix up timestamps as a fallback.
The main motivation behind this change is alignment with other tracing systems.
It's already difficult enough to try to internally align the cputicks
clock,
but making it work with clocks used by other traces such as those produced by
distributed tracing systems is even more difficult.
On Linux, the clock_gettime
syscall, called nanotime
in the runtime, takes
around 15ns on average when called in a loop.
This is compared to cputicks'
10ns.
Trivially replacing all cputicks
calls in the current tracer with nanotime
reveals a small performance difference that depends largely on the granularity
of each result.
Today, cputicks'
is divided by 64.
On a 3 GHz processor, this amounts to a granularity of about 20 ns.
Replacing that with nanotime
and no time division (i.e. nanosecond
granularity) results in a 0.22% geomean regression in throughput in the Sweet
benchmarks.
The trace size also increases by 17%.
Dividing nanotime
by 64 we see approximately no regression and a trace size
decrease of 1.7%.
Overall, there seems to be little performance downside to using nanotime
,
provided we pick an appropriate timing granularity: what we lose by calling
nanotime
, we can easily regain by sacrificing a small amount of precision.
And it's likely that most of the precision below 128 nanoseconds or so is noise,
given the average cost of a single call into the Go scheduler (~250
nanoseconds).
To give us plenty of precision, I propose a target timestamp granularity of 64
nanoseconds.
This should be plenty to give us fairly fine-grained insights into Go program
behavior while also keeping timestamps small.
As for sequencing, I believe we must retain the per-goroutine sequence number
design as-is.
Relying solely on a timestamp, even a good one, has significant drawbacks.
For one, issues arise when timestamps are identical: the parser needs to decide
on a valid ordering and has no choice but to consider every possible ordering of
those events without additional information.
While such a case is unlikely with nanosecond-level timestamp granularity, it
totally precludes making timestamp granularity more coarse, as suggested in the
previous paragraph.
A sequencing system that's independent of the system's clock also retains the
ability of the tracing system to function despite a broken clock (modulo
returning an error when timestamps are out of order, which again I think we
should just work around).
Even clock_gettime
might be broken on some machines!
How would a tracing system continue to function despite a broken clock? For that I propose making the trace parser fix up timestamps that don't line up with the partial order. The basic idea is that if the parser discovers a partial order edge between two events A and B, and A's timestamp is later than B's, then the parser applies A's timestamp to B. B's new timestamp is in turn propagated later on in the algorithm in case another partial order edge is discovered between B and some other event C, and those events' timestamps are also out-of-order.
There's one last issue with timestamps here on platforms for which the runtime
doesn't have an efficient nanosecond-precision clock at all, like Windows.
(Ideally, we'd make use of system calls to obtain a higher-precision clock, like
QPC,
but calls to this API can take upwards of a hundred nanoseconds on modern
hardware, and even then the resolution is on the order of a few hundred
nanoseconds.)
On Windows we can just continue to use cputicks
to get the precision and rely
on the timestamp fixup logic in the parser, at the cost of being unable to
reliably align traces with other diagnostic data that uses the system clock.
Today the Go runtime batches trace events by P. That is, trace events are grouped in batches that all happened, in order, on the same P. A batch is represented in the runtime as a buffer, 32 KiB in size, which is attached to a P when it's actively being written to. Events are written to this P's buffer in their encoded form. This design choice allows most event writes to elide synchronization with the rest of the runtime and linearizes the trace with respect to a P, which is crucial to sequencing without requiring a global total order.
Batching traces by any core scheduling resource (G, M, or P) could in principle have similar properties. At a glance, there are a few reasons Ps make a better choice. One reason is that there are generally a small number of Ps compared to Ms and Gs, minimizing the maximum number of buffers that can be active at any given time. Another reason is convenience. When batching, tracing generally requires some kind of synchronization with all instances of its companion resource type to get a consistent trace. (Think of a buffer which has events written to it very infrequently. It needs to be flushed at some point before the trace can be considered complete, because it may contain critical information needed to sequence events in other buffers.) Furthermore, synchronization when starting a trace is also generally useful, as that provides an opportunity to inspect the state of the world and write down details about it, simplifying validation. Stopping the world is a convenient way to get the attention of all Ps in the Go runtime.
However, there are problems with batching by P that make traces more complex than necessary.
The core of these problems lies with the GoSysExit
event.
This event requires special arrangements in both the runtime and when validating
traces to ensure a consistent trace.
The difficulty with this event is that it's emitted by a goroutine that was
blocked in a syscall and lost its P, and because it doesn't have a P, it might
race with the runtime enabling and disabling tracing.
Therefore it needs to wait until it has a P to avoid that race.
(Note: the tracing system does have a place to put events when there is no P
available, but that doesn't help in this case.
The tracing system uses the fact that it stops-the-world to synchronize with
GoSysExit
by preventing it from writing an event until the trace system can
finish initialization.)
The problem with GoSysExit
stems from a fundamental mismatch: Ms emit events,
but only Ps are preemptible by the Go scheduler.
This really extends to any situation where we'd like an M to emit an event when
it doesn't have a P, say for example when it goes to sleep after dropping its P
and not finding any work in the scheduler, and it's one reason why we don't have
any M-specific events at all today.
So, suppose we batch trace events by M instead.
In the case of GoSysExit
, it would always be valid to write to a trace buffer,
because any synchronization would have to happen on the M instead of the P, so
no races with stopping the world.
However, this also means the tracer can't simply stop the world, because
stopping the world is built around stopping user Go code, which runs with a P.
So, the tracer would have to use something else (more on this later).
Although GoSysExit
is simpler, GoSysBlock
becomes slightly more complex in
the case where the P is retaken by sysmon
.
In the per-P world it could be written into the buffer of the taken P, so it
didn't need any synchronization.
In the per-M world, it becomes an event that happens "at a distance" and so
needs a sequence number from the syscalling goroutine for the trace consumer to
establish a partial order.
However, we can do better by reformulating the syscall events altogether.
Firstly, I propose always emitting the full time range for each syscall.
This is a quality-of-life choice that may increase trace overheads slightly, but
also provides substantially more insight into how much time is spent in
syscalls.
Syscalls already take ~250 nanoseconds at a baseline on Linux and are unlikely
to get faster in the future for security reasons (due to Spectre and Meltdown)
and the additional event would never contain a stack trace, so writing it out
should be quite fast.
(Nonetheless, we may want to reconsider emitting these events for cgo calls.)
The new events would be called, for example, GoSyscallBegin
and
GoSyscallEnd
.
If a syscall blocks, GoSyscallEnd
is replaced with GoSyscallEndBlocked
(more on this later).
Secondly, I propose adding an explicit event for stealing a P.
In the per-M world, keeping just GoSysBlock
to represent both a goroutine's
state transition and the state of a P is not feasible, because we're revealing
the fact that fundamentally two threads are racing with one another on the
syscall exit.
An explicit P-stealing event, for example ProcSteal
, would be required to
be ordered against a GoSyscallEndBlocked
, with the former always happening
before.
The precise semantics of the ProcSteal
event would be a ProcStop
but one
performed by another thread.
Because this means events can now happen to P "at a distance," the ProcStart
,
ProcStop
, and ProcSteal
events all need sequence numbers.
Note that the naive emission of ProcSteal
and GoSyscallEndBlocked
will
cause them to race, but the interaction of the two represents an explicit
synchronization within the runtime, so the parser can always safely wait for the
ProcSteal
to emerge in the frontier before proceeding.
The timestamp order may also not be right, but since we already committed to
fixing broken timestamps in general, this skew will
be fixed up by that mechanism for presentation.
(In practice, I expect the skew to be quite small, since it only happens if the
retaking of a P races with a syscall exit.)
Per-M batching might also incur a higher memory cost for tracing, since there are generally more Ms than Ps. I suspect this isn't actually too big of an issue since the number of Ms is usually close to the number of Ps. In the worst case, there may be as many Ms as Gs! However, if we also partition the trace, then the number of active buffers will only be proportional to the number of Ms that actually ran in a given time window, which is unlikely to be an issue. Still, if this does become a problem, a reasonable mitigation would be to simply shrink the size of each trace buffer compared to today. The overhead of the tracing slow path is vanishingly small, so doubling its frequency would likely not incur a meaningful compute cost.
Other than those three details, per-M batching should function identically to
the current per-P batching: trace events may already be safely emitted without a
P (modulo GoSysExit
synchronization), so we're not losing anything else with
the change.
Instead, however, what we gain is a deeper insight into thread execution. Thread information is currently present in execution traces, but difficult to interpret because it's always tied to P start and stop events. A switch to per-M batching forces traces to treat Ps and Ms orthogonally.
Given all this, I propose switching to per-M batching. The only remaining question to resolve is trace synchronization for Ms.
(As an additional minor consideration, I would also like to propose adding the batch length to the beginning of each batch. Currently (and in the foreseeable future), the trace consumer needs to iterate over the entire trace once to collect all the batches for ordering. We can speed up this process tremendously by allowing the consumer to just collect the information it needs.)
The runtime already contains a mechanism to execute code on every M via signals,
doAllThreadsSyscall
.
However, traces have different requirements than doAllThreadsSyscall
, and I
think we can exploit the specifics of these requirements to achieve a more
lightweight alternative.
First, observe that getting the attention of every M for tracing is not strictly
necessary: Ms that never use the tracer need not be mentioned in the trace at
all.
This observation allows us to delegate trace state initialization to the M
itself, so we can synchronize with Ms at trace start simply by atomically
setting a single global flag.
If an M ever writes into the trace buffer, then it will initialize its state by
emitting an event indicating what it was doing when tracing started.
For example, if the first event the M is going to emit is GoBlock
, then it
will emit an additional event before that that indicates the goroutine was
running since the start of the trace (a hypothetical GoRunning
event).
Disabling tracing is slightly more complex, as the tracer needs to flush every
M's buffer or identify that its buffer was never written to.
However, this is fairly straightforward to do with per-M seqlocks.
Specifically, Ms would double-check the trace.enabled
flag under the seqlock
and anything trying to stop a trace would first disable the flag, then iterate
over every M to make sure it observed its seqlock was unlocked.
This guarantees that every M observed or will observe the new flag state.
There's just one problem with all this: it may be that an M might be running and
never emit an event.
This case is critical to capturing system dynamics.
As far as I can tell, there are three cases here: the M is running user code
without emitting any trace events (e.g. a tight loop), the M is in a system call
or C code the whole time, or the M is sysmon
, a special thread that always
runs without a P.
The first case is fairly easy to deal with: because it's running with a P, we can just preempt it and establish an invariant that any preemption always emits an event if the M's trace state has not been initialized.
The second case is a little more subtle, but luckily not very complex to
implement.
The tracer can identify whether the M has a G blocked in a syscall (or
equivalently has called into C code) just before disabling tracing globally by
checking readgstatus(m.curg) == _Gsyscall
on each M's G and write it down.
If the tracer can see that the M never wrote to its buffer after it disables
tracing, it can safely conclude that the M was still in a syscall at the moment
when tracing was disabled, since otherwise it would have written to the buffer.
Note that since we're starting to read gp.atomicstatus
via readgstatus
, we
now need to ensure consistency between the G's internal status and the events
we emit representing those status changes, from the tracer's perspective.
Thus, we need to make sure we hold the M's trace seqlock across any
gp.atomicstatus
transitions, which will require a small refactoring of the
runtime and the tracer.
For the third case, we can play basically the same trick as the second case, but instead check to see if sysmon is blocked on a note. If it's not, it's running. If it doesn't emit any events (e.g. to stop on a note) then the tracer can assume that it's been running, if its buffer is empty.
A big change with this synchronization mechanism is that the tracer no longer
obtains the state of all goroutines when a trace starts, only those that run
or execute syscalls.
The remedy to this is to simply add it back in.
Since the only cases not covered are goroutines in _Gwaiting
or _Grunnable
,
the tracer can set the _Gscan
bit on the status to ensure the goroutine can't
transition out.
At that point, it collects a stack trace and writes out a status event for the
goroutine inside buffer that isn't attached to any particular M.
To avoid ABA problems with a goroutine transitioning out of _Gwaiting
or
_Grunnable
and then back in, we also need an atomically-accessed flag for each
goroutine that indicates whether an event has been written for that goroutine
yet.
The result of all this synchronization is that the tracer only lightly perturbs application execution to start and stop traces. In theory, a stop-the-world is no longer required at all, but in practice (especially with partitioning), a brief stop-the-world to begin tracing dramatically simplifies some of the additional synchronization necessary. Even so, starting and stopping a trace will now be substantially more lightweight than before.
The structure of the execution trace's binary format limits what you can do with traces.
In particular, to view just a small section of the trace, the entire trace needs to be parsed, since a batch containing an early event may appear at the end of the trace. This can happen if, for example, a P that stays relatively idle throughout the trace wakes up once at the beginning and once at the end, but never writes enough to the trace buffer to flush it.
To remedy this, I propose restructuring traces into a stream of self-contained partitions, called "generations." More specifically, a generation is a collection of trace batches that represents a complete trace. In practice this means each generation boundary is a global buffer flush. Each trace batch will have a generation number associated with, and generations will not interleave. The way generation boundaries are identified in the trace is when a new batch with a different number is identified.
The size of each generation will be roughly constant: new generation boundaries will be created when either the partition reaches some threshold size or some maximum amount of wall-clock time has elapsed. The exact size threshold and wall-clock limit will be determined empirically, though my general expectation for the size threshold is around 16 MiB.
Generation boundaries will add a substantial amount of implementation complexity, but the cost is worth it, since it'll fundamentally enable new use-cases.
The trace parser will need to be able to "stitch" together events across
generation boundaries, and for that I propose the addition of a new GoStatus
event.
This event is a generalization of the GoInSyscall
and GoWaiting
events.
This event will be emitted the first time a goroutine is about to be mentioned
in a trace, and will carry a state enum argument that indicates what state the
goroutine was in before the next event to be emitted.
The state emitted for a goroutine, for more tightly coupling events to states,
will be derived from the event about to be emitted (except for the "waiting" and
"syscall" cases where it's explicit).
The trace parser will be able to use these events to validate continuity across
generations.
This global buffer flush can be implemented as an extension to the
aforementioned M synchronization design by replacing the
enabled
flag with a generation counter and doubling up much of the trace
state.
The generation counter will be used to index into the trace state, allowing for
us to dump and then toss old trace state, to prepare it for the next generation
while the current generation is actively being generated.
In other words, it will work akin to ending a trace, but for just one half of
the trace state.
One final subtlety created by partitioning is that now we may no longer have a
stop-the-world in between complete traces (since a generation is a complete
trace).
This means some events may be active when a trace is starting, and we'll lose
that information.
For user-controlled event types (user tasks and user regions), I propose doing
nothing special.
It's already true that if a task or region was in progress we lose that
information, and it's difficult to track that efficiently given how the API is
structured.
For range event types we control (sweep, mark assist, STW, etc.), I propose
adding the concept of "active" events which are used to indicate that a
goroutine or P has been actively in one of these ranges since the start of a
generation.
These events will be emitted alongside the GoStatus
event for a goroutine.
(This wasn't mentioned above, but we'll also need a ProcStatus
event, so we'll
also emit these events as part of that processes as well.)
Since we're modifying the trace anyway, this is a good opportunity to clean up the event types, making them simpler to implement, simpler to parse and understand, and more uniform overall.
The three biggest changes I would like to propose are
- A uniform naming scheme,
- Separation of the reasons that goroutines might stop or block from the event type.
- Placing strings, stacks, and CPU samples in their own dedicated batch types.
Firstly, for the naming scheme, I propose the following rules (which are not strictly enforced, only guidelines for documentation):
- Scheduling resources, such as threads and goroutines, have events related to
them prefixed with the related resource (i.e. "Thread," "Go," or "Proc").
- Scheduling resources have "Create" and "Destroy" events.
- Scheduling resources have generic "Status" events used for indicating their
state at the start of a partition (replaces
GoInSyscall
andGoWaiting
). - Scheduling resources have "Start" and "Stop" events to indicate when that resource is in use. The connection between resources is understood through context today.
- Goroutines also have "Block" events which are like "Stop", but require an "Unblock" before the goroutine can "Start" again.
- Note: Ps are exempt since they aren't a true resource, more like a best-effort semaphore in practice. There's only "Start" and "Stop" for them.
- Events representing ranges of time come in pairs with the start event having the "Begin" suffix and the end event having the "End" suffix.
- Events have a prefix corresponding to the deepest resource they're associated with.
Secondly, I propose moving the reasons why a goroutine resource might stop or block into an argument. This choice is useful for backwards compatibility, because the most likely change to the trace format at any given time will be the addition of more detail, for example in the form of a new reason to block.
Thirdly, I propose placing strings, stacks, and CPU samples in their own dedicated batch types. This, in combination with batch sizes in each batch header, will allow the trace parser to quickly skim over a generation and find all the strings, stacks, and CPU samples. This makes certain tasks faster, but more importantly, it simplifies parsing and validation, since the full tables are just available up-front. It also means that the event batches are able to have a much more uniform format (one byte event type, followed by several varints) and we can keep all format deviations separate.
Beyond these big three, there are a myriad of additional tweaks I would like to propose:
- Remove the
FutileWakeup
event, which is no longer used (even in the current implementation). - Remove the
TimerGoroutine
event, which is no longer used (even in the current implementation). - Rename
GoMaxProcs
toProcsChange
.- This needs to be written out at trace startup, as before.
- Redefine
GoSleep
as aGoBlock
andGoUnblock
pair. - Break out
GoStartLabel
into a more genericGoLabel
event that applies a label to a goroutine the first time it emits an event in a partition. - Change the suffixes of pairs of events representing some activity to be
Begin
andEnd
. - Ensure all GC-related events contain the string
GC
. - Add sequence numbers to events where necessary:
GoStart
still needs a sequence number.GoUnblock
still needs a sequence number.- Eliminate the {
GoStart,GoUnblock,Go}Local
events for uniformity.
- Because traces are no longer bound by stopping the world, and can indeed now
start during the GC mark phase, we need a way to identify that a GC mark is
currently in progress.
Add the
GCMarkActive
event. - Eliminate inline strings and make all strings referenced from a table for
uniformity.
- Because of partitioning, the size of the string table is unlikely to become large in practice.
- This means all events can have a fixed size (expressed in terms of the number of arguments) and we can drop the argument count bits from the event type.
A number of these tweaks above will likely make traces bigger, mostly due to additional information and the elimination of some optimizations for uniformity.
One idea to regain some of this space is to compress goroutine IDs in each generation by maintaining a lightweight mapping. Each time a goroutine would emit an event that needs the current goroutine ID, it'll check a g-local cache. This cache will contain the goroutine's alias ID for the current partition and the partition number. If the goroutine does not have an alias for the current partition, it increments a global counter to acquire one and writes down the mapping of that counter to its full goroutine ID in a global table. This table is then written out at the end of each partition. But this is optional.
As discussed in the overview, this design is intended to be implemented in parts. Some parts are easy to integrate into the existing tracer, such as the switch to nanotime and the traceback-related improvements. Others, such as per-M batching, partitioning, and the change to the event encoding, are harder.
While in theory all could be implemented separately as an evolution of the
existing tracer, it's simpler and safer to hide the changes behind a feature
flag, like a GOEXPERIMENT
flag, at least to start with.
Especially for per-M batching and partitioning, it's going to be quite
complicated to try to branch in all the right spots in the existing tracer.
Instead, we'll basically have two trace implementations temporarily living
side-by-side, with a mostly shared interface.
Where they differ (and there are really only a few spots), the runtime can
explicitly check for the GOEXPERIMENT
flag.
This also gives us an opportunity to polish the new trace implementation.
The implementation will also require updating the trace parser to make tests work. I suspect this will result in what is basically a separate trace parser that's invoked when the new header version is identified. Once we have a basic trace parser that exports the same API, we can work on improving the internal trace parser API to take advantage of the new features described in this document.
This document presents a large change to the existing tracer under the banner of "traces at scale" which is a bit abstract. To better understand what that means, let's explore a few concrete use-cases that this design enables.
Because the trace in this design is partitioned, it's now possible for the runtime to carry around the most recent trace partition. The runtime could then expose this partition to the application as a snapshot of what the application recently did. This is useful for debugging at scale, because it enables the application to collect data exactly when something goes wrong. For instance:
- A server's health check fails. The server takes a snapshot of recent execution and puts it into storage before exiting.
- Handling a request exceeds a certain latency threshold. The server takes a snapshot of the most recent execution state and uploads it to storage for future inspection.
- A program crashes when executing in a deployed environment and leaves behind a core dump. That core dump contains the recent execution state.
Here's a quick design sketch:
package trace
// FlightRecorder represents
type FlightRecorder struct { ... }
// NewFlightRecorder creates a new flight recording configuration.
func NewFlightRecorder() *FlightRecorder
// Start begins process-wide flight recording. Only one FlightRecorder
// may be started at once. If another FlightRecorder is started, this
// function will return an error.
//
// This function can be used concurrently with Start and Stop.
func (fr *FlightRecorder) Start() error
// TakeSnapshot collects information about the execution of this program
// from the last few seconds and write it to w. This FlightRecorder must
// have started to take a snapshot.
func (fr *FlightRecorder) TakeSnapshot(w io.Writer) error
// Stop ends process-wide flight recording. Stop must be called on the
// same FlightRecorder that started recording.
func (fr *FlightRecorder) Stop() error
- The runtime accumulates trace buffers as usual, but when it makes a partition, it puts the trace data aside as it starts a new one.
- Once the new partition is finished, the previous is discarded (buffers are reused).
- At any point, the application can request a snapshot of the current trace
state.
- The runtime immediately creates a partition from whatever data it has, puts that together with the previously-accumulated partition, and hands it off to the application for reading.
- The runtime then continues accumulating trace data in a new partition while the application reads the trace data.
- The application is almost always guaranteed two partitions, with one being as large as a partition would be in a regular trace (the second one is likely to be smaller).
- Support for flight recording in the
/debug/pprof/trace
endpoint.
Today some power users collect traces at scale by setting a very modest sampling rate, e.g. 1 second out of every 1000 seconds that a service is up. This has resulted in collecting a tremendous amount of useful data about execution at scale, but this use-case is currently severely limited by trace performance properties.
With this design, it should be reasonable to increase the sampling rate by an order of magnitude and collect much larger traces for offline analysis.
Because the new design allows for partitions to be streamed and the trace encoding is much faster to process, it's conceivable that a service could process its own trace continuously and aggregate and filter only what it needs. This online processing avoids the need to store traces for future inspection.
The kinds of processing I imagine for this includes:
- Detailed task latency breakdowns over long time horizons.
- CPU utilization estimates for different task types.
- Semantic processing and aggregation of user log events.
- Fully customizable flight recording (at a price).
Tooling can be constructed that interleaves Go execution traces with traces produced via:
perf sched record --clockid CLOCK_MONOTONIC <command>
JFR or "Java Flight Recorder" is an execution tracing system for Java applications. JFR is highly customizable with a sophisticated wire format. It supports very detailed configuration of events via an XML configuration file, including the ability to enable/disable stack traces per event and set a latency threshold an event needs to exceed in order to be sampled (e.g. "only sample file reads if they exceed 20ms"). It also supports custom user-defined binary-encoded events.
By default, JFR offers a low overhead configuration (<1%) for "continuous tracing" and slightly higher overhead configuration (1-2%) for more detailed "profile tracing." These configurations are just default configuration files that ship with the JVM.
Continuous tracing is special in that it accumulates data into an internal global ring buffer which may be dumped at any time. Notably this data cannot be recovered from a crashed VM, though it can be recovered in a wide variety of other cases. Continuous tracing is disabled by default.
The JFR encoding scheme is quite complex, but achieves low overhead partially through varints. JFR traces are also partitioned, enabling scalable analysis.
In many ways the existing Go execution tracer looks a lot like JFR, just without partitioning, and with a simpler encoding.
KUTrace is a system-wide execution tracing system for Linux. It achieves low overhead, high resolution, and staggering simplicity by cleverly choosing to trace on each kernel transition (the "goldilocks" point in the design space, as the author puts it).
It uses a fairly simple 8-byte-word-based encoding scheme to keep trace writing fast, and exploits the very common case of a system call returning quickly (>90%) to pack two events into each word when possible.
didn't take too many of its insights.
CTF or "Common Trace Format" is more of a meta tracing system. It defines a binary format, a metadata format (to describe that binary format), as well as a description language. In essence, it contains all the building blocks of a trace format without defining one for a specific use-case.
Traces in CTF are defined as a series of streams of events. Events in a stream are grouped into self-contained packets. CTF contains many of the same concepts that we define here (packets are batches; the streams described in this document are per-thread, etc.), though it largely leaves interpretation of the trace data up to the one defining a CTF-based trace format.
The trace format relies heavily on LEB128-encoded integers. While this choice makes the trace quite compact (achieving 4-6 bytes/event, with some tricks to keep each encoded integer relatively small), it comes at a cost to decoding speed since LEB128 is well-known for being relatively slow to decode compared to similar integer encodings. (Note: this is only a hunch at present; this needs to be measured.) (The cost at encoding time is dwarfed by other costs, like tracebacks, so it's not on the critical path for this redesign.)
This section proposes a possible future trace encoding that is simpler, but without any change would produce a much larger trace. It then proposes two possible methods of compressing the trace from there.
To start with, let's redefine the format as a series of 4-byte words of data.
Each event has a single header word that includes the event type (8 bits), space for a 4-bit event reason, and the timestamp delta (20 bits). Note that every event requires a timestamp. At a granularity of 64 nanoseconds, this gives us a timestamp range of ~1 second, which is more than enough for most cases. In the case where we can't fit the delta in 24 bits, we'll emit a new "timestamp" event which is 2 words wide, containing the timestamp delta from the start of the partition. This gives us a 7-byte delta which should be plenty for any trace.
Each event's header is then followed by a number of 4-byte arguments. The minimum number of arguments is fixed per the event type, and will be made evident in self-description. A self-description table at the start of the trace could indicate (1) whether there are a variable number of additional arguments, (2) which argument indicates how many there are, and (3) whether any arguments are byte arrays (integer followed by data). Byte array data lengths are always rounded up to 4 bytes. Possible arguments include, for example, per-goroutine sequence numbers, goroutine IDs, thread IDs, P IDs, stack IDs, and string IDs. Within a partition, most of these trivially fit in 32 bits:
- Per-goroutine sequence numbers easily fit provided each partition causes a global sequence number reset, which is straightforward to arrange.
- Thread IDs come from the OS as
uint32
on all platforms we support. - P IDs trivially fit within 32 bits.
- Stack IDs are local to a partition and as a result trivially fit in 32-bits. Assuming a partition is O(MiB) in size (so, O(millions) of events), even if each event has a unique stack ID, it'll fit.
- String IDs follow the same logic as stack IDs.
- Goroutine IDs, when compressed as described in the event cleanup section, will easily fit in 32 bits.
The full range of possible encodings for an event can be summarized as thus:
This format change, before any compression, will result in an encoded trace size of 2-3x. Plugging in the size of each proposed event above into the breakdown of a trace such as this 288 MiB one produced by felixge@ reveals an increase in encoded trace size by a little over 2x. The aforementioned 288 MiB trace grows to around 640 MiB in size, not including additional timestamp events, the batch header in this scheme, or any tables at the end of each partition.
This increase in trace size is likely unacceptable since there are several cases where holding the trace in memory is desirable. Luckily, the data in this format is very compressible.
This design notably doesn't handle inline strings. The easy thing to do would be to always put them in the string table, but then they're held onto until at least the end of a partition, which isn't great for memory use in the face of many non-cacheable strings. This design would have to be extended to support inline strings, perhaps rounding up their length to a multiple of 4 bytes.
One possible route from here is to simply choose a different integer compression scheme. There exist many fast block-based integer compression schemes, ranging from GVE and PrefixVarint to SIMD-friendly bit-packing schemes.
Inline strings would likely have to be excepted from this compression scheme, and would likely need to be sent out-of-band.
Another alternative compression scheme is to have none, and instead expect the trace consumer to perform their own compression, if they want the trace to be compressed. This certainly simplifies the runtime implementation, and would give trace consumers a choice between trace encoding speed and trace size. For a sufficiently fast compression scheme (perhaps LZ4), it's possible this could rival integer compression in encoding overhead.
More investigation on this front is required.
Traces already contain CPU profile samples, but are missing some information compared to what Go CPU profile pprof protos contain. We should consider bridging that gap, such that it is straightforward to extract a CPU profile from an execution trace.
This might look like just emitting another section to the trace that contains basically the entire pprof proto. This can likely be added as a footer to the trace.