-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/trace: flight recording #63185
Comments
I agree with all of the advantages of having flight recording in the runtime. As for
I think that program logic will be the far more useful trigger for the majority of use cases, and certainly easier to make use of. On top of that, this flight recorder will be much more straightforward to integrate with existing telemetry solutions, compared to external solutions, which will either not be designed for that (Gotraceui) or will have to be written first. |
Just to say the most frequent "I wish I knew what happened just before" for me is OOM, which isn't (AFAIK) something you can trigger on once the kernel makes its decision. Maybe we could trigger on hitting GOMEMLIMIT? |
Nice! How can we get CPU profile samples to show up in the flight-recorder data stream? The existing APIs for execution traces and CPU profiles have a start time and an end time, so those two pair well together. For this it seems that we'd need a way to ask for The panics in
You've said that the data from |
It turns out that's quite difficult to do because Linux provides no opportunity to dump any information when an OOM occurs; the OOM killer simply SIGKILLs processes. Unfortunately With that being said, there are still some best-effort things we might be able to do. Programs that don't use Either way though, it's still best-effort. Unless the snapshotting fully stops the world, the program may continue executing and OOM before the trace actually gets fully dumped. (Even then, it's still possible (but less likely) that it OOMs before the runtime successfully stops every running goroutine.) Once upon a time there was a patch proposed to Linux to allow for halting a process when it hit container limits, so that another process in a tiny memory reservation could inspect it. One could imagine that if this other process created a core dump of the halted process, we could write some code to extract any active trace buffers from the core dump into a trace. I'm not sure where this leaves us. Perhaps it suggests that |
This comment was marked as off-topic.
This comment was marked as off-topic.
If you happen to have CPU profiling running, it'll just work, but you make a good point that there's no good way to have it included all the time (i.e. no intention of producing a CPU profile). It seems to me like that should maybe be another option, either on the
Hm, that's a good point. I'll update the API to return an error on
There's currently no list of all the Ms per generation because we don't have M events, but yes, everything within a partition is namespaced by that partition's "generation number." The proposed trace parsing API exposes partition changes as a FWIW, the trace parsing API already does this exact kind of "stitching." Every partition is an entirely self-contained trace, which means all goroutines (and Ps) and their statuses get named in every partition. The trace parser uses this information to validate the stitching: new partitions' goroutine statuses need to match where that goroutine left off in the previous partition. |
This proposal has been added to the active column of the proposals project |
I think this is waiting on an implementation and experience from using that implementation. |
That's correct. I plan to have one in golang.org/x/exp/trace soon. |
I made a mistake in the commit message so gopherbot couldn't connect the CL to this issue, but the experimental API has now landed via https://go.dev/cl/550257. It is available in golang.org/x/exp/trace for Go 1.22 only. There are a few caveats with this implementation that will not be true with a real runtime implementation.
I don't think any of these are significant enough to detract from the usefulness of the experiment, but I wanted to bring it up in case one of these does become an issue. We can also explore ways to improve the experiment to make it more representative, if one of them is indeed a problem. Please give it a try! |
The One limitation I encountered is that the As for preserving traces after OOM, one possible approach could be to use a separate agent process with a shared memory buffer to write the traces into. That agent could then detect that the main process crashed and write the contents of the memory buffer into a file or other storage. However, this would require keeping all the flight recorder data in a single pre-allocated buffer. |
Thanks for trying it out!
Yeah, that's a limitation of the experiment which lives out-of-tree for convenience, but it technically shouldn't be necessary if it's implemented in the runtime. I'd hoped to allow one of each type of consumer (one active flight recorder, one active |
@mknyszek , we've discussed allowing the caller to control the set of events that the trace includes. That would enable runtime maintainers to use the execution trace to grant access to events that are niche and expensive, or in the other direction would enable low-overhead access to a subset of events. Maybe this is how we build/improve delta profiles for #67942, and how we give access to goroutine tags/labels in monotonic profiles (heap, mutex, etc) for #23458. If that's the direction we go, I think we'll want either a/ multiple |
Multiple The only way I can imagine this working efficiently is via something like https://go.dev/cl/594595. TL;DR: We have some light infrastructure for emitting "experimental" events now, and this CL formalizes that a bit better by allowing us to split the event stream. It's at least a step toward that. If we have separate buffer streams, you could imagine a new registration consists of an It's certainly not off the table, but I think we're a little ways off still. |
I'd like to point out that in Linux it's now possible to be actively informed by the kernel through an epoll event if a control group is under memory pressure through the PSI subsystem. https://systemd.io/MEMORY_PRESSURE/ This could be used as a signal to dump a trace and/or to trigger GC proactively. |
This has been "waiting for implementation experience". Do we have enough experience to think about whether to add this to Go 1.24? |
We've had some experience with flight recording inside Google, and we've also gotten some good feedback from @martin-sucha. I think a few things are clear from our experiences so far:
Much of this can be refined and improved once we have an implementation that lives fully inside the runtime. Lastly, from my point of view, I don't know what other feedback we would be waiting on that this point. CC @cagedmantis |
Have all remaining concerns about this proposal been addressed? The API is at #63185 (comment) in the "Design" heading. |
Those comments on As a first quirk, assume an organization has site-wide use of a package that makes periodic calls to As for the two types of runtime/trace usage interfering in the other direction: How do the results of using the package-scoped tracer change in response to calling those methods on a |
That is a very good point. This is something that hasn't come up in practice because it's not actually possible with the prototype to have both forms going simultaneously. Relaxing the requirement of "all tracing must be stopped and started" is probably not too hard to add to the implementation. As long as there is only a single flight recorder, and a single window size, then we can define that window size once |
I see what you're getting at, and I think the sketch implementation described in the original post couples the SetPeriod/SetSize parameters with the rate at which One way to resolve this, to decouple SetPeriod/SetSize from anything observable with (Note: There currently exists a runtime |
Looks good to me, thanks.
I think that sounds good, though I'm not familiar enough with the mechanics to know what else that would imply. For comparison, runtime/pprof's CPU profiler involves a goroutine waking every 100 ms, and that leads to a sometimes 200 ms delay in stopping those profiles. Is there potential for that style of delay here? When the flight recorder is not in use, does it have an implication on the volume of data that the runtime would stockpile before handing off to the Say we have a very busy app, where typical transactions are under 200 µs and the slow ones we want to find take 5 ms. Maybe it's something like Memcached/Redis, or MySQL/Postgres. If an operator configures the flight recorder for that app and requests a small memory limit (maybe equivalent to what the app would generate over 50 ms), would they end up with an entire "minimum granularity of 1 second" window resident in memory? I think that's what you're implying, and I think it's ok. Maybe an experience report from an owner of an app like that would inform the decision on that window size or could help to justify leaving the GODEBUG override. (My own experience has been with apps where 1 second is long enough to see most problems, and short enough to not cause new ones.) |
No. If tracing stops early, from either perspective, we can always call
Nope! In theory, the buffers can be handed off immediately to the writer, the question just becomes "who frees the buffer." If the flight recorder is active, then it may still "hold a reference" to the buffer. (I imagine we'll need some gentle reference counting scheme, or something, to get this right, but it's doable.)
I also think that's what I'm implying, but I'm getting a bit thrown off by "small memory limit". 😅 Do you mean a small argument to |
This discussion seems to be not relevant given the "only one tracer" limitation. I think we agree that if there are multiple tracers, then SetPeriod/SetSize apply to the one tracer they are called on, and not the others. In discussion I was confused about "period" thinking it had something to do with sampling, which obviously it cannot. SetDuration might be a better name. It should also be made clearer what happens if SetDuration/SetSize conflict. Does the bigger one win? And how does anyone know what kinds of sizes to pass? There is no documentation of the relationship between duration and bytes. Maybe the API should be SetMaxDuration and SetMaxSize? Or SetMinDuration and SetMaxSize...? Or...? Should they be dropped entirely? Users need some way to understand how to use these, and it's unclear how to use them at the moment. |
To expand on what this would mean: the buffer would be kept as small as possible to satisfy the min duration, but if that exceeds max size, then max size wins over the duration. That seems consistent with your comment about SetSize "limiting the damage" of SetDuration. If you just want as much trace as can fit in, say, 16 MiB, you'd SetMaxSize to 16 MiB and SetMinDuration to (effectively) infinity. |
Should there be some way for the application to detect that the max size limit is capping the duration? A runtime metric tracking the current duration or size of the trace data in the buffers perhaps? Would such a metric be useful at all? |
I'll just note that the original docs for
(EDIT: That being said, all for clearer names and stricter semantics, personally.) |
I support @aclements description of the semantics for One caveat I would like to add is that I would also like to mention in the documentation that there is an internal minimum for the minimum duration, and that the setting might be ignored. If the minimum is too low, a simple implementation (call Finally, I agree we should add some guidance on how to set max size and min duration to the documentation. For min duration, how about:
For max size:
If this sounds OK overall, I'll workshop it a bit more and put all this back up into the original proposal. CC @cagedmantis |
Based on this description of the duration limit and the one in #63185 (comment), it seems like we're describing a maximum duration. I.e. the flight recorder will hold on to no longer than the given duration of recent activity. FWIW I don't think a minimum duration makes sense for a flight recorder. If we have a minimum duration the buffer should hold, surely holding more than that duration would be acceptable? Meaning we'd have an unlimited flight recording if a size limit is not set? That said, if I replace "minimum duration" with "maximum duration" then the descriptions and guidance in #63185 (comment) seem reasonable to me. In a continuous profiling context, we would likely use both. We'd have a duration limit of ~1 minute since that's how often we upload data, and it'd be a waste to upload substantially overlapping traces. And we'd have a size limit to prevent OOMing user programs, and keep bandwidth & storage costs under control. For some prior art here, the Java JDK Flight Recorder (JFR) has |
I see your point, but I can still see the argument both ways. The window size is, IMO, a bit more of a target than it is a strict minimum or maximum. It's much fuzzier to me than |
I think where I got confused with Austin's comment was the relationship between duration and size in bytes. Rereading it, I think Austin is suggesting that
I had pictured a |
The original implementation I had in mind would always keep 2 generations around (the currently-accumulating one and the previous one), but adjust the rate at which generations are produced based on the duration and size parameters. It would track how many bytes live in buffers and force the creation of a new generation in response to One downside is that this implementation strategy has visible effects on concurrent calls to Your proposed implementation is also reasonable, and it might be helpful to decouple the internal generation policy from the flight recorder policy. But it also means the duration and size parameters will be less precise, always rounded up to the minimum granularity the runtime exposes. I believe it would be backwards compatible (since the wire format doesn't follow compatibility guarantees) to change the implementation if one or the other is a mistake, so we can experiment and start with whatever's easier. |
Does this mean that during the |
@martin-sucha No. In my mind, |
In the diagnostics meeting today, we discussed
I think the API in this proposal is not quite right. Having different configurations for different subscribers of trace data is going to be very complex to implement in the general case. It's going to be tricky for duration and size, but even worse if we consider things like opting into specific events in the future (filtering the event stream effectively requires parsing it, which is much slower than emitting it). I would like to propose an alternative API that makes these knobs,
This API is substantially different, but gets at the heart of what the size and duration parameters really represent, and makes them global settings, so there's no conflict between subscribers. It also presents a coherent semantics for these durations with respect to regular tracing and makes them global. It also aligns even more closely with the existing proposal for CPU profiling configuration (#42502). Lastly, it allows for multiple trace subscribers (that is, the equivalent of multiple calls to My one concern with this API is that the sync period is arguably be an implementation detail, but I think we've reached a point where it cannot. Without the concept of a synchronization point in the trace, many important properties of the new trace format simply cannot be relied upon (like the fact that large traces can be parsed using a reasonable amount of memory). While most of the time users do not need to be aware of this at all, for those working with and integrating with traces deeply, these details matter a lot. If this API sketch is of interest to those on this issue, I will file a separate proposal to define:
Extending the tracer for multiple subscribers is actually not that hard to implement. The trace reader goroutine just needs to spit trace data out into multiple io.Writers, effectively acting as a MultiWriter itself. The advantage of the subscription mechanism is that different tracing sources can come in an out concurrently at different times. This means that net/http/pprof need not return an error if someone tries to grab a trace manually while, say, the a diagnostics suite performs occasionally continuous monitoring duties by enabling tracing briefly. But, we can also cut down this API to allow only one subscriber. That would be OK with me too. |
How does net/http/pprof end up sharing the same
I don't have a feel for the cost of EventSync events. What's the difference between syncing every 1000 ms, 100 ms, or 10 ms? If "10 ms" is so great, why don't we do that unconditionally?
This sounds helpful .. we expect to have an OS thread for each GOMAXPROCS, and for each OS thread to have a pretty big system stack. (Is that true for all threads, or only for the first one?) Can we convert that into justification for a general-purpose default for the sync interval (like "if we sync every 100 ms, that's 10% more memory usage than you'd get from the OS threads' stacks themselves, even if the app didn't have its own memory needs"). Maybe I still don't understand the root we're trying to get to. It seems like every consumer of the execution trace data stream can make its own decisions about when to discard data (because it's more bytes than the consumer has in its budget, or because it covers more seconds that the consumer has been told it needs to record). But the one thing that the consumers need to agree on is how often the runtime will generate the sync events: that's the source of the need for some global config (or mutual exclusion). Is that right? |
Yeah, that's a good point; I don't have a good answer. We could have net/http/pprof expose a
You raise a good point. The documentation should state this, but yes, there's a CPU cost to generating synchronization events more frequently. The cost is roughly proportional to the number of Ms and Gs. Off the top of my head, I don't know what the cost is, but it's certainly measurable.
The cost of synchronization events is proportional to not only OS threads, but the number of goroutines (since we record information about even waiting goroutines too). I'm not totally sure I follow your "10% more memory" example though; can you explain more?
Yeah, that's mostly right. I think in a hypothetical future scenario where we want to optionally expose "expensive" events (like, say, an allocation trace, which is currently behind a |
How does one create a If it has a global nature, why not make the methods package-level functions, or make a global variable of And, how does the unrelated code know if there is already an instance of |
+1 to making the global nature of
@rhysh in addition to the CPU costs mentioned in @mknyszek's reply, I think it's also important to consider the size overhead. More @mknyszek I have a question about |
Proposal: runtime/trace flight recording
Updated: 23 September 2023
Background
"Flight recording" is a technique in which trace data is kept in a conceptual circular buffer, flushed upon request. The purpose of this technique is to capture traces of interesting program behavior, even when one does not know ahead of time when that will happen. For instance, if the web service fails a health check, or the web service takes an abnormally long time to handle a request. Specifically, the web service can identify such conditions when they happen, but the programmer setting up the environment can't predict when exactly they will occur. Starting tracing after something interesting happens also tends not to be useful, because the program has already executed the interesting part.
The Java ecosystem has had this for years through Java's flight recorder. Once the JVM's flight recorder is enabled, the JVM can obtain a trace representing the last few seconds of time. This trace can come from triggers set up in JMX, or by passing a flag to the JVM that dumps a trace on exit.
With the implementation of #60773 now approaching a stable point, hopefully in Go 1.22 we'll have all traces become series of self-contained partitions. This implementation change presents an opportunity to easily add something similar to the Go execution tracer by always retaining at least one partition that can be snapshotted at any time.
This is also enabled by work in the Go 1.21 release to make traces dramatically cheaper. Because flight recording relies on waiting until something interesting happens, tracing needs to be enabled for a much longer period of time. Enabling flight recording across, for example, a small portion of a production fleet, becomes much more palatable when the tracing itself isn't too expensive.
Design
The core of the design is a new API in the
runtime/trace
package to enable flight recording. This means that programs can be instrumented with their own triggers.Implementation
Because #60773 already partitions traces, most of the implementation work is already done. There are just a few additional changes that need to be made.
First, the values from
SetPeriod
andSetSize
need to adjust runtime-internal variables when(*FlightRecorder).Start
is called. The period already exists, but the size in bytes does not. The tracer will need to track a partition's size and the background goroutine that creates new partitions will need to be woken when the partition's size exceeds that value.Next, the trace reader goroutine puts buffers it takes off the queue onto a free list when it's done with them. When flight recording is enabled simultaneously with a
trace.Start
, the trace reader will need to put these buffers on some other list in case the flight recorder needs them. The task of putting those buffers on the reuse list will be handled by whichever goroutine creates a new partition, and it will only place the oldest partition's buffers on the reuse list. This way, at least one full partition will always be available for the flight recorder.Lastly, when
(*FlightRecorder).WriteTo
is called, it will need to create a new partition. Then, it will need to take ownership of any outstanding trace buffers by setting a flag on each buffer preventing it from going on the reuse list. This will likely require a brief stop-the-world, but the actual copying will not happen while the world is stopped. Once copying is done,(*FlightRecorder).WriteTo
will wait until the trace reader is done with the partitions for the buffers it was using (if a trace reader is active) and then place all the buffers it marked on the reuse list.Discussion
SetPeriod
andSetSize
could give more rigid guarantees, but it's both complex to implement and not terribly useful. The primary use-case forSetPeriod
is to allow users to ask for longer traces (for example, if a web service's "long request" means something much longer than a handful of seconds). Meanwhile the primary use-case ofSetSize
is to control memory overheads and limit the damage caused by a largeSetPeriod
.WriteTo
could allow multiple goroutines to call it since it could easily serialize them internally. However, this can create some bad situations. For instance, consider some snapshot trigger condition that causes multiple goroutines to callWriteTo
. The call is heavyweight and they'll queue up behind each other; the longest one will likely take quite a while to resolve, and the application will be significantly disrupted. It'll also produce traces that aren't very useful (consisting of short partitions corresponding approximately to the duration of the lastWriteTo
call) unless we also allow for multiple goroutines to read the same partition's buffers. However, that's going to be fairly complicated to implement, and also doesn't really add much value either, since it's just duplicate data. The current design encourages callers reduces the risk of run-time panics while also side-stepping these issues by returning an error in this case.Alternatives considered
External circular buffer
@dominikh has suggested adding a similar feature to gotraceui. Because the partitioning is actually baked into the trace's structure, it's possible for trace consumers to implement something similar themselves. The only thing missing is a standard streaming endpoint for execution traces (see follow-up work).
However, there are a few advantages to adding support directly to the standard library and runtime.
runtime/trace
package's annotations, but can instead decide when to grab a snapshot directly.SetPeriod
andSetSize
). Any external solution will be at the mercy of the runtime's defaults.Despite these advantages, it's likely worth pursuing support for such a use-case even if the API described in this proposal is made available. A shortcoming of this document's proposal is that there's no way to trigger a snapshot explicitly against trace data, only program logic. Handling traces externally also means the ability to perform ad-hoc analyses without the need for additional instrumentation.
Follow-up work
Add support for trace streaming endpoints to
net/http/pprof
As stated in the discussion of the "external circular buffer" alternative, we could support that alternative easily and well by just adding a standard debugging endpoint for streaming trace data. It probably makes the most sense to just add new query parameters to the existing
trace
endpoint; the details of that can be hashed out elsewhere.The text was updated successfully, but these errors were encountered: