runtime: diagnostics improvements tracking issue #57175

mknyszek · 2022-12-08T17:51:15Z

mknyszek · 2022-12-08T18:17:46Z

2022-12-07 Sync

Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @rhysh

Notes:

pprof labels in execution traces
- Michael K: I need to follow up on the CL and issue.
- Michael P: Have you considered runtime/trace regions?
- Nick: Yes, but it doesn't quite hit our use-cases.
- Rhys: Could use logs instead. Is the point to follow goroutines?
- Felix: Yeah.
- Rhys: Started parsing the trace format for some internal tooling.
- Felix: Prefer labels because inheritance. Mostly like it for profile tools on top, but maybe doesn't matter for tracing tools. The parent goroutine gets recorded so things like request IDs can be tracked in post-processing.
  - Also useful for attributing CPU time.
- Michael P: Tasks are inherited, but inheritance is processed downstream.
  - Would be nice to have pprof labels just so you don't have to think about which one to use.
- Michael K: Useful even just to bridge pprof and runtime/trace.
- Austin: Agreed. We can one day get to the point of deprecating the old APIs as well.
- Rhys: RE: Attributing CPU time, can see at better than 10 ms of granularity already (even if Ps aren't CPU time, it's still time the rest of the app couldn't run).
- Michael P: There's an issue about measure CPU time on-line. (TODO: Find it.)
Trace parsing API
- Michael K: How important is this? Priority?
- Felix: Important for the community, but we can't make use of it in our case. Will the trace format change in between releases?
- Michael K: I think we can always guarantee a trace format for a release.
- Michael P: How high level should this API be?
  - cmd/trace has two levels:
    - Low-level that understands format and events.
    - Higher-level that understands relationships between goroutines, etc.
- Michael K: Page trace splits this into "parser" and "simulator." The latter is more stateful.
- Felix: Intuitive feeling toward lower level API.
- Rhys: +1 to low level.
- Austin: Scalability of processing traces.
  - Currently not in a good state in low or high level format (currently requires the whole trace).
  - Can fix trace wire format for low-level parsing scalability issues, but it's much less clear how to do this for the high-level format.
- Austin: Flight recorder idea.
  - Interacts interestingly with simulation. Current trace snapshots everything.
  - Solved this in debuglog; reads its own tail and keeps local state updated.
  - Complicated trade-offs in this space.
- Felix: We use a lot of JFR, one thing that's nice is it's broken down into self-contained chunks.
Michael K sent out a very half-baked trace format revamp. (Thanks for the comments! Far from ready to share more widely.)
- The next step is to measure the actual current overhead.
  - Maybe add a mode to Sweet?
  - Rhys: Have been collecting CPU profiles and execution traces. 20% of CPU time during execution trace is for execution trace itself. 95% of overhead is collecting stack traces.
    - Collect 1 second every 5000 seconds and no one complains. People do complain about goroutine profiles every 2 minutes.
  - Michael K: Shooting for KUTrace overhead, so making stack traces optional/faster is just step 1.
  - Felix: Trace effect on tail latency.
    - Rhys: Traces are my view of tail latency.
  - Felix: Benchmark for pathological cases and worst case.
  - Austin: Linked to trace overhead issue, Dmitry proposed switching to frame pointer unwinding.
  - Felix: At some point implemented frame pointer unwinding in userland and it was 50x faster (link).
  - Rhys: Not sure what kind of tool you could build without stack traces in an app that doesn't set pprof labels, tasks, regions, trace logs, etc.
  - Michael K: Integration with structured logging?
  - Michael P: It does add yet another way to add things to runtime/trace.
  - Rhys: The standard library (e.g. database/sql) doesn't currently use runtime/trace at all, maybe it should.
  - Michael K: This connects to deciding what goes into a trace. I think this is a very good idea.
  - Felix: +1. Java world and JFR does this.
Deciding what goes into a trace
- Disabling stack tracing / reduce stack trace depth
- Filtering by pprof labels
- Specific event classes
- Standard library events
- Rhys: I've made this decision for my organization. Expected that you do profiling for a long running service. No opportunity for app owners to express opinions. People who complained forked the package, turned it off, and now coming back. I kind of want everything.
- Felix: I would love to be in a place where we can do that, but we get pushback from users when the overhead is too high.
- Rhys: The question is how close we get within 1% overhead. My choice was to get everything, but less often.
- Felix: Desire to get more of everything is in conflict with adding more kinds of things in the trace.
- Michael P: Agreed. Ideally we have tracing that's sufficiently fast that we have on all the time, but if libraries are allowed to add new traces then it could be a problem. It would be nice to turn that off without forking a library.
Before next sync:
- Michael K: Unblock pprof labels patch and benchmarking trace overhead.
- Felix: I can contribute a worst case benchmark.
  - Currently blocked on pprof labels in trace.
- Felix: Started to work on gentraceback. Might work on it over the holidays.
  - Trying for bug-for-bug compatibility.
- Michael P: Austin has been working on this too.

felixge · 2022-12-09T09:46:41Z

I'll miss the Dec 22nd meetup because I'm traveling for the holidays. That being said, if I find time I might also look into #57159 . Getting a proof of concept for Perfetto UI integration (ideally using their protocol buffer format) is probably more important than the gentraceback refactoring at this point. I just tried to work with a 300 MB (15s of prod activity) yesterday, and it was a real eye opener to the way the current UI struggles.

tbg · 2022-12-09T09:58:40Z

I don't know if it's relevant (probably nothing new for the folks on this thread), but I had similar problems with the go tool trace viewer where it would freeze on me all the time, esp. in the per-goroutine view (/trace?goid=N). I figured out you can download perfetto-compatible JSON data from /jsontrace?goid=N. (/jsontrace gives the default view). This can then be uploaded to ui.perfetto.dev. This doesn't show all the information in the trace so it's not as great, but I was glad to have something that worked.

thediveo · 2022-12-09T16:11:06Z

would the pprof labels also show up in goroutine traces?

qmuntal · 2022-12-09T20:58:25Z

I'm working on a PoC that improves native stack unwinding on Windows by adding additional information to the PE file. This will help debugging with WinDbg and profiling with Windows Performance Analyzer. Would this work fit into the effort tracked by this issue?

mknyszek · 2022-12-09T21:37:04Z

@thediveo I think that might be a good question for #56295, or you could file another issue. Off the top of my head, that doesn't sound like it would be too difficult to do.

@qmuntal Oh neat! That's awesome. I think it's a little tangential to the work we're proposing here, unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback). Then again, if one of the goals is better integration with the Windows Performance Analyzer that's certainly more in the same spirit. Do you have an issue for tracking that already?

qmuntal · 2022-12-09T22:05:40Z

Do you have an issue for tracking that already?

I still have to prepare the proposal, I plan to submit it next week.

unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback).

Not for now, but once I finish this I want to investigate how feasible is too unwind native code and merge it with the Go unwinding, in case the exception happens in a non-Go module.

qmuntal · 2022-12-14T08:44:12Z

Do you have an issue for tracking that already?

I do now #57302 😄

gopherbot · 2022-12-21T22:46:06Z

Change https://go.dev/cl/459095 mentions this issue: sweet: add support for execution traces and measuring trace overhead

mknyszek · 2022-12-22T16:43:57Z

2022-12-22 Sync

Attendees: @mknyszek @aclements @prattmic @bboreham @rhysh @dominikh

Organizational stuff
- OK to record meetings?
- Meeting recorded with transcript this week (please ask if you would like to see it).
Trace overhead benchmarks
- https://go.dev/cl/459095

name                                old time/op            new time/op            delta
BiogoIgor                                      17.7s ± 3%             17.5s ± 4%     ~     (p=0.190 n=10+10)
BiogoKrishna                                   15.1s ± 4%             15.1s ± 4%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             5.78s ± 7%             5.76s ±11%     ~     (p=0.853 n=10+10)
BleveQuery                                     2.37s ± 0%             2.37s ± 0%   -0.26%  (p=0.016 n=8+10)
FoglemanFauxGLRenderRotateBoat                 16.9s ± 9%             16.9s ± 7%     ~     (p=0.796 n=10+10)
FoglemanPathTraceRenderGopherIter1             36.7s ± 1%             44.4s ± 2%  +21.01%  (p=0.000 n=10+10)
GoBuildKubelet                                 47.0s ± 2%             48.8s ± 3%   +3.72%  (p=0.000 n=10+10)
GoBuildKubeletLink                             8.89s ± 2%             8.88s ± 4%     ~     (p=0.720 n=10+9)
GoBuildIstioctl                                45.9s ± 1%             47.8s ± 2%   +4.09%  (p=0.000 n=10+10)
GoBuildIstioctlLink                            9.07s ± 2%             8.99s ± 2%     ~     (p=0.095 n=10+9)
GoBuildFrontend                                15.7s ± 4%             16.1s ± 2%   +2.45%  (p=0.043 n=10+10)
GoBuildFrontendLink                            1.38s ± 2%             1.37s ± 3%     ~     (p=0.529 n=10+10)
GopherLuaKNucleotide                           27.9s ± 0%             27.9s ± 1%     ~     (p=0.853 n=10+10)
MarkdownRenderXHTML                            256ms ± 2%             256ms ± 2%     ~     (p=1.000 n=9+9)
Tile38WithinCircle100kmRequest                 618µs ± 7%             657µs ±10%   +6.30%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             722µs ± 6%             773µs ± 4%   +6.96%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  508µs ± 3%             532µs ± 3%   +4.73%  (p=0.000 n=10+10)

name                                old average-RSS-bytes  new average-RSS-bytes  delta
BiogoIgor                                     68.8MB ± 2%            71.8MB ± 4%   +4.40%  (p=0.000 n=10+10)
BiogoKrishna                                  4.42GB ± 0%            4.42GB ± 0%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             194MB ± 2%             198MB ± 3%   +1.91%  (p=0.008 n=9+10)
BleveQuery                                     536MB ± 0%             537MB ± 1%     ~     (p=0.190 n=10+10)
FoglemanFauxGLRenderRotateBoat                 444MB ± 1%             446MB ± 0%   +0.41%  (p=0.035 n=10+9)
FoglemanPathTraceRenderGopherIter1             132MB ± 1%             142MB ± 4%   +7.61%  (p=0.000 n=10+10)
GoBuildKubelet                                1.75GB ± 1%            1.85GB ± 1%   +5.51%  (p=0.000 n=10+10)
GoBuildIstioctl                               1.35GB ± 1%            1.42GB ± 1%   +5.49%  (p=0.000 n=10+9)
GoBuildFrontend                                511MB ± 2%             543MB ± 1%   +6.31%  (p=0.000 n=10+9)
GopherLuaKNucleotide                          37.0MB ± 1%            40.4MB ± 2%   +9.24%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           21.8MB ± 3%            24.0MB ± 3%  +10.14%  (p=0.000 n=9+8)
Tile38WithinCircle100kmRequest                5.40GB ± 1%            5.38GB ± 1%     ~     (p=0.315 n=10+10)
Tile38IntersectsCircle100kmRequest            5.72GB ± 1%            5.71GB ± 1%     ~     (p=0.971 n=10+10)
Tile38KNearestLimit100Request                 7.26GB ± 0%            7.25GB ± 0%     ~     (p=0.739 n=10+10)

name                                old peak-RSS-bytes     new peak-RSS-bytes     delta
BiogoIgor                                     95.9MB ± 4%            98.5MB ± 3%   +2.70%  (p=0.030 n=10+10)
BiogoKrishna                                  4.49GB ± 0%            4.49GB ± 0%     ~     (p=0.356 n=9+10)
BleveIndexBatch100                             282MB ± 3%             284MB ± 4%     ~     (p=0.436 n=10+10)
BleveQuery                                     537MB ± 0%             538MB ± 1%     ~     (p=0.579 n=10+10)
FoglemanFauxGLRenderRotateBoat                 485MB ± 1%             483MB ± 0%     ~     (p=0.388 n=10+9)
FoglemanPathTraceRenderGopherIter1             180MB ± 2%             193MB ± 3%   +7.19%  (p=0.000 n=10+10)
GopherLuaKNucleotide                          39.8MB ± 3%            46.0MB ±20%  +15.56%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           22.1MB ± 3%            25.5MB ± 7%  +15.45%  (p=0.000 n=9+10)
Tile38WithinCircle100kmRequest                5.70GB ± 1%            5.68GB ± 1%   -0.45%  (p=0.023 n=10+10)
Tile38IntersectsCircle100kmRequest            5.93GB ± 1%            5.91GB ± 2%     ~     (p=0.631 n=10+10)
Tile38KNearestLimit100Request                 7.47GB ± 1%            7.46GB ± 0%     ~     (p=0.579 n=10+10)

name                                old peak-VM-bytes      new peak-VM-bytes      delta
BiogoIgor                                      802MB ± 0%             803MB ± 0%   +0.11%  (p=0.000 n=10+10)
BiogoKrishna                                  5.24GB ± 0%            5.24GB ± 0%   +0.01%  (p=0.001 n=10+10)
BleveIndexBatch100                            1.79GB ± 0%            1.79GB ± 0%   +0.05%  (p=0.000 n=8+8)
BleveQuery                                    3.53GB ± 1%            3.53GB ± 1%     ~     (p=0.237 n=10+10)
FoglemanFauxGLRenderRotateBoat                1.21GB ± 0%            1.16GB ± 4%     ~     (p=0.163 n=8+10)
FoglemanPathTraceRenderGopherIter1             875MB ± 0%             884MB ± 0%   +1.02%  (p=0.000 n=10+10)
GopherLuaKNucleotide                           733MB ± 0%             734MB ± 0%   +0.11%  (p=0.000 n=9+10)
MarkdownRenderXHTML                            733MB ± 0%             734MB ± 0%   +0.10%  (p=0.000 n=10+9)
Tile38WithinCircle100kmRequest                6.42GB ± 0%            6.39GB ± 1%     ~     (p=0.086 n=8+10)
Tile38IntersectsCircle100kmRequest            6.62GB ± 1%            6.61GB ± 2%     ~     (p=0.927 n=10+10)
Tile38KNearestLimit100Request                 8.16GB ± 1%            8.18GB ± 0%     ~     (p=0.649 n=10+8)

name                                old p50-latency-ns     new p50-latency-ns     delta
Tile38WithinCircle100kmRequest                  144k ± 3%              159k ± 3%  +10.56%  (p=0.000 n=9+9)
Tile38IntersectsCircle100kmRequest              215k ± 1%              232k ± 2%   +7.91%  (p=0.000 n=9+10)
Tile38KNearestLimit100Request                   347k ± 2%              373k ± 1%   +7.21%  (p=0.000 n=10+10)

name                                old p90-latency-ns     new p90-latency-ns     delta
Tile38WithinCircle100kmRequest                  908k ± 6%              956k ± 9%   +5.22%  (p=0.043 n=10+10)
Tile38IntersectsCircle100kmRequest             1.07M ± 4%             1.11M ± 5%   +4.33%  (p=0.001 n=10+10)
Tile38KNearestLimit100Request                  1.03M ± 3%             1.05M ± 4%   +2.64%  (p=0.011 n=10+10)

name                                old p99-latency-ns     new p99-latency-ns     delta
Tile38WithinCircle100kmRequest                 7.55M ± 9%             7.93M ±13%     ~     (p=0.089 n=10+10)
Tile38IntersectsCircle100kmRequest             7.81M ± 8%             8.39M ± 2%   +7.36%  (p=0.000 n=10+8)
Tile38KNearestLimit100Request                  2.03M ± 4%             2.08M ± 5%   +2.52%  (p=0.019 n=10+10)

name                                old ops/s              new ops/s              delta
Tile38WithinCircle100kmRequest                 9.73k ± 7%             9.16k ±11%   -5.83%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             8.31k ± 6%             7.77k ± 4%   -6.55%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  11.8k ± 3%             11.3k ± 3%   -4.51%  (p=0.000 n=10+10)

Introduction: Bryan Boreham, Grafana Labs
- Questions within the team about whether useful information has been derived from Go execution traces.
- Phlare: continuous profiling. Interested in linking together various signals (distributed tracing, profiling)
- Michael K: Interesting data point about usability.
- Michael P: Hard to link application behavior to trace.
- Bryan: Example: channels. Still don't really know where to find that data.
- Dominik: One of the reasons I started on gotraceui was to surface more information and do more automatic inference and analysis of the data.
- Rhys: Execution trace technique: get data out of them to find the interesting traces. Try to extract features that would be interesting up-front.
  - Starts with internal trace parser. Have code to find start and end of HTTP requests, DNS lookups, etc.
  - Tooling on the way to get open sourced.
Heap analysis plan (proposal: x/debug: make the core and gocore packages public #57447)
- Austin: Additional context is we're confident in the API we're planning to export, as opposed to tracing which we have nothing for yet.
https://go.dev/issue/57307 proposal: cmd/trace: visualize time taken by syscall
- Austin: Does Perfetto do better with instantaneous events?
  - Michael P: Yes, there's a 20px wide arrow but we have so many.
  - Rhys: Hold shift, draw a box. If you aim well, you get what you want.
- Rhys: Why is there only one timestamp on some events?
  - Austin: We can add another timestamp.
  - Michael P: Syscall fast path does a lot less.
pprof labels in traces
- Michael K: I think I've unblocked Nick. Michael and I are reviewing.
runtime.gentraceback cleanup
- Austin: Back and forth on the issue about making it an iterator, sent out CLs, not tested yet.
Next meeting: Jan 5th, Michael P and Michael K won't be here, so Austin will run it.
Action items:
- We're slowing down for the holidays, so no strong expectations
- Michael K:
  - Try to land execution trace benchmarking.
  - Might look into heap analysis stuff.
  - After break, might want to start working on trace format more seriously.
Happy holidays!

mknyszek · 2023-01-13T18:33:46Z

2023-01-05 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham vnedkov @dashpole

Organizational stuff
- @mknyszek is out today, @aclements running the meeting.
- @dashpole from Google OSS Telemetry joining us.
Benchmarks: Can we add a goroutine ping pong example? (Felix)
- Tracer benchmarks all show relatively low overhead. Can we add a benchmark that demonstrates the worst case?
- Austin: Sweet probably isn’t the right place because that’s application-level. Maybe add to Bent?
- Felix: Next step on these benchmarks? Land MK’s trace benchmark support?
- Austin: It’s certainly fine to land. We don’t have a good way to integrate these “extra dimensions” into our automated benchmarking.
- AI(austin): Bring up monitoring extra benchmarking dimensions.
- Austin: “Unit benchmarks” would be the perfect place for a ping pong benchmark (we already have one in the main repo), but we never quite got to integrating these into automated monitoring.
Are only GC STW recorded? Would it make sense to record other STW events (read metrics, goroutine profile, heap dump)? (Felix)
- Rhys: You get ProcStop events
- Austin: Yeah, you’re right that we trace high level GC STW events.
- Rhys: Currently the GC traces the “best case” STW, which can be really misleading.
- Austin: We could definitely have a “stopping the world” and a “world stopped”. Maybe don’t need that for start.
- Felix: That would be great. We’re investigating rare long STWs right now.
- Rhys: Starting the world can take a while. Problems with heap lock contention. I would love to have more visibility into the runtime locks.
- Austin: Runtime locks are a bit of a mess. I also wonder if they should be “scalable”.
- Rhys: I’d love to discuss that. C&R office hours?
- Austin: Perfect.
- Conclusion: Let’s add events for all STWs and also separate “stopping” from “stopped”.
Updates on Perfetto UI (Felix and Nick)
- Add to UI CL: https://go.dev/cl/457716
- Felix: The JSON currently produced by the trace tool is basically compatible with Perfetto. Doesn’t let us open really large traces without splitting, which was one of the hopes. And it takes a while to load. I was able to use the command line tools to load a 280MB trace into a 9.8GB JSON trace and load that in Perfetto, but it took 20 minutes. Nick has been working on outputting proto directly, which will hopefully produce less data than JSON.
- Rhys: When I tried this a while ago, the connection of data flow wasn’t quite right.
- Felix: This CL doesn’t fix that. I’m hoping it’s an upstream issue, which they’re pretty responsive to. I’m hoping protobuf will just make it go away, since that’s their canonical input.
- Nick: Stack traces seem to be missing from protobuf, which we definitely want. We might need upstream changes to support that.
- Felix: I suspect there may be some long tail of issues. But the initial plan would be to keep both viewers until we feel this is solid.
- Austin: How does the streaming work?
- Felix: They have an in-memory column store with a SQL interface on top of it. Large traces would still be a problem because they’d need to be loaded fully into memory.
- Austin: In principle we could swap out that column store for our own streaming thing, but that sounds like a significant amount of work.
- Felix: On Go Time someone said they only use runtime trace when they’re really desperate and then they can’t figure it out anyway. Most people don’t think about their program from the perspective of the scheduler. I’d like to have different pivoting, like one timeline per G (or M). We sort of have that in the goroutine analysis, but that only shows on-CPU time. Dominick did that in gotraceui.
Updates on pprof labels (Nick)
- Nick: In MK’s recent comments on pprof labels CL, he wondered about a size limit on labels being recorded in the trace. Thinking about trace overhead. Users can also add arbitrary logs (limited by trace buffer size). My thought is that users are deciding to make these as big or as small as they want.
- Austin: My reaction is “do what the user said”
- Rhys: It seems like we already don’t have a limit on the pprof labels (number/length/etc) and maybe it would have been good to have a limit, but we already don’t.
- Bryan: For me it’s more important to be able to find out how much damage you’re doing with this data. Inevitably people want one more byte than the limit and will be frustrated.
- Felix: Two sides to this problem: how to get the data in the trace while keeping overhead low, and the other is keeping the memory usage low for keeping all these labels. For trace overhead, I’m thinking we want two or three levels of filtering: filter what events, filter events by properties (e.g., duration). JFR supports both of these. And potentially a way to modify events (maybe too far), like truncation. At some point you can almost guarantee fixed-cost tracing. E.g., turn off everything except profile events; now you have timestamps on profile events without all the other overhead.
- Austin: MK and I have definitely been thinking in that direction. The current trace viewer is almost purpose-built for analyzing the scheduler and needs to understand how a lot of events relate. But if we open up reading traces, the trace viewer becomes just another tool and maybe it’s fine for it to say “I need these events” (kind of like “perf sched” or similar).
- Felix: I can ask my Java colleagues about how this works in JFR.
- Rhys: Curious how you’re thinking about filtering.
- Felix: Simpler is better. You could imagine a callback, but that’s not simple. Probably something like runtime/metrics where you can discover the events and select.
- Rhys: Definitely need a header saying which events are included.
- Felix: Agreed. Also nice for viewers so they don’t have to hard-code all of the events.

mknyszek · 2023-01-19T16:53:57Z

2023-01-19 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham @mknyszek @prattmic @dominikh @dashpole

Felix: gentraceback iterator refactoring
- Felix: What's the progress?
- Austin: Made progress. Running into issues with write barriers and trying to knock down all the write barriers one by one. Big open question of testing; so many horrible corner cases. No good answers.
- Felix: Tried to do it incrementally instead of all at once; also painful. RE: testing, would it be useful to have the ability to instrument a PC and do a traceback from there?
- Austin: That would help. The worst parts are weird though, like signals. If we had a good way to inject a symbol, like a breakpoint, that would help a lot.
  - Idea: could use hardware breakpoints via perf-event-open (Linux only, but at least architecture-independent) which could get enough coverage for Austin to be happy.
  - Could potentially synthesize other signal tests from a single signal.
- Felix: I'll give it a shot.
- Michael K: What work could we do in parallel?
- Felix: Could write a frame pointer unwinder separately for tracing just to get an idea of the overhead.
  - Austin: +1. Tricky things include logic in gentraceback for filtering out frames. Maybe it doesn't matter for the trace viewer (i.e. don't filter). Also inline unwinding. Trying to totally separate inline unwinding in gentraceback. Once its its own separate thing, it'd be straightforward to plumb that into a frame pointer unwinder.
  - Michael K: Could we skip inline unwinding for the experiment?
  - Austin: Yeah.
  - Michael P: +1 to separating out inline unwinding. Already "runtime_expandFinalInlineFrame" in the runtime which is a good reference point for this.
  - Felix: Also all the complexity with cgo traceback, but we should just ignore that for the experiment.
  - Michael K: The cgo traceback tests are also really flaky, and if we could have better testing around that that would be great.
Felix: Perfetto UI blues … (timeline bug, link bug, stack traces, large traces, small screens, protocol buffer format) … gotraceui w/ wasm? Having an online tool with independent release cycle is tempting?
- CL out that makes Perfetto work. Limitations:
  - Limited for very large traces as-is.
  - Doesn't seem easy to make it work as well as go tool trace (bugs). e.g. timelines not named correctly. Events not connected correctly.
    - Harder: getting stack traces to show up. Nick has tried to make it work. Protobuf format doesn't have an obvious stack trace format?
    - Nick: Not a one-to-one mapping between Catapult format and Perfetto. Can stick a single location in the Perfetto format, but not a full stack trace. Little things in the protobuf format that aren't well-documented. e.g. string interning only works if you include a number in the header.
    - Michael K: MP and I looked into this. Perfetto knows how to do this for some traces, but it’s built into a C++ library, so we’d have to rewrite that in Go or call into it from Go. I’m not sure it even has strong backwards compatibility.
    - Michael P: There is the Perfetto tool that runs the RPC server. (trace_processor.) That loads into a SQLite in-memory DB, but does do better than the fully in-browser implementation. It can do bigger traces, though is still limited. That seems like enough of an improvement to me.
    - Felix: I have a 280MB trace that gets split into 90 parts for 15 seconds on a busy server. Maybe we should start with deciding what size trace we want to have a good experience for.
    - Michael K: I think 280MB is a big trace, though it’s only 15 seconds. I think we should be targeting bigger than that. It’s easy to get a 1GB trace. But we can start with Perfetto as long as it’s better and work toward that.
    - Austin: Is that better with Perfetto?
    - Felix: I think it would be better. Maybe 5x better, so a second at a time (don’t quote me on that).
    - Michael P: The trace_processsor is better, but still limited by the in-memory SQLite DB. Presumably that could be on disk. I don’t know if the trace loading is also linear in the trace size.
    - Rhys: What do you even do with an execution trace that large? How do you get value out of that?
    - Felix: This trace was from a colleague from an instance that was struggling with pauses. It looked like a straggling procstop. It was debugging the behavior of a whole application that was behaving poorly.
    - Rhys: So you were looking for behavior that was pretty zoomed-out.
    - Felix: Yeah.
    - Michael K: Part of the problem with existing traces is the usability of this. I think it’s a valid question about whether big traces are all that useful. Sometimes you’re not even really sure what you’re looking for. Say I wanted to run a full trace on every invocation of the compiler. You don’t necessarily know what you’re looking for to improve compiler speed.
    - Austin: I bet if you were to profile the space of large trace file, the vast majority of that would not be useful to you looking at it at a high level. Suggests a solution here for filtering is to just reduce what goes into the trace.
    - 280MB Trace Size Breakdown
    - Michael K: Maybe just proc start/proc stop for what Felix was describing.
    - Rhys: But once you find the problem, you want more detail. It's hard to catch the end of a garbage collection cycle because of the rules of starting a trace during a GC cycle.
    - Michael K: Fixing the mark phase issue should be easier than before.
    - Austin: Awesome breakdown!
  - User group said "please don't do this" because Perfetto isn't nice to small screens.
  - Felix: gotraceui
    - Viewing timelines for goroutines is great.
    - Would like Dominik to talk about gotraceui some more.
    - I want to be intentional about choosing Perfetto.
    - Michael K: I think the dependency on gio was a concern.
    - Dominik: Gio (the UI library I use) supports wasm, so it should be fairly
      straightforward to have gotraceui run in the browser if we want to go
      down that road.
    - Dominik: I still rely on loading entire traces into memory (but using significantly less memory than current go tool trace), but with the upcoming format changes, streaming data might be possible. We currently load everything into memory because when the user zooms out far enough, we need all events to compute what we display. But we could probably precompute these zoom levels, similar to mipmaps.
    - Dominik: For the current trace format, gotraceui needs roughly 30x the size of the trace in memory. so a 300 MB trace needs 9 GB.
    - Michael K: I have been thinking about an HTML UI that does something like Google Maps tiles to scale. We could skip a lot of work if we could take gotraceui as the UI, but port it into something more portable than Gio. OTOH, it’s even more work to build something from scratch.
    - Dominik: WRT gotraceui's use of Gio, there'll be pretty rich UI, and I don't fancy writing UIs in HTML/JS. But all of the processing of trace data could live externally
    - Michael P: It’s not necessarily a hard requirement that the Go project itself ship a trace viewer. We have to now because there’s no API. But if we shipped an API, it wouldn’t be a hard requirement. Much like we don’t ship a debugger.
    - Michael K: One option is that we ignore the UI situation entirely and build something that you can parse separately and ship something really bare later. In the meantime, point at a little tool that will shove it into trace_processor and point people at Perfetto. For a brief time, stop shipping our own. It’s very convenient that you only need a Go installation to view these traces, but I think you’re right that we could stop shipping a UI. We could also keep the existing UI working/limping while we do other things in parallel.
    - Felix: Is Dominik looking for contributors? (That comes with its own overheads)
    - Dominik: I'm usually not big on contributions in the form of code; but ideas and feedback are hugely appreciated
    - Michael K: We don’t have to make a decision on using Perfetto now. Maybe we should plug along for two more weeks (with Perfetto) and figure out if we can fix the issues without too much effort, and then make a hard decision on what to do at the next meeting.
    - 👍
Felix: traceutils anonymize & breakdown and ideas: (flamescope, graphviz, tracer overhead)
- Implemented anonymization of traces. Breakdowns, too.
- Tracer overhead tool that uses profile samples in the trace to identify overheads.
Felix: Format: Consistent message framing, remove varint padding for stacks
- 4 different cases for how an event can be laid out.
- Maybe a way to skip messages and layouts it doesn't understand.
- Austin: self-descriptive header giving lengths for each opcode
- Michael K: Any state in the trace makes things hard to push it up into OTel, since that’s completely stateless.
- Felix: We’re actually trying to do two things in OTel. Including binary data blobs, like pprof and JFRs. And something to send stateful things like stack traces, etc, where you can refer back to them efficiently.
- David: For trace I wouldn’t expect a stateful protocol to be introduced any time soon. But for profiling it may be a possibility.

mknyszek · 2023-02-02T16:53:11Z

2023-02-02 Sync

Attendees: @aclements @felixge @nsrip-dd @thepudds @bboreham @dashpole @mknyszek @prattmic

Felix: Discuss results from frame pointer unwinding experiments (blog, sweet results) and next steps
- Targeted ping-pong example, worst case. Worth noting that the stack depth in that benchmark is 2. Went from +773% -> +30%, apparently doing 50% more work too!
- Sweet: 10% -> 2% overhead!
- Michael K: Michael P mentioned missed cases.
- Michael P: Inlined frames are one example. Maybe we just accept slightly less accurate traces in the tracer.
- Austin: +1 to missing inlined frames, but we can also expand that after the fact.
- Michael K: Do you need the binary for that?
- Austin: Today, yes.
- Felix: The tracer already de-duplicates stack traces. If we do inline expansion at the end, there's probably not that much work to do.
- Michael P: Other avenue, do we need stack traces on every event? Maybe remove stack traces for some events?
- Michael K: Where does the rest of the time go?
- Felix: In the blog post. Frame pointer unwinding is only 9% of the trace overhead. 28% is cputicks. 21% is stack put.
- Austin: Shocked that cputicks is 28%. It's one instruction. I guess that's a good sign?
- Austin: (FP unwinding is also relevant for #53286. In that case it’s the kernel’s FP unwinder, but it means our FP data is going to have to be high quality for both.)
- Thepudds: Or maybe an option for sampling of stack traces?
- Michael K: I think it depends. As traces are used today, you probably want 100% sampling. For larger scale aggregation, I think it's a solid option.
- Michael K: Dream of nanotime to line up clocks.
- Austin: It might not be that bad. RDTSC is serializing so the extra math in nanotime might not make much of a difference in overhead.
- Michael K: We should definitely pursue this, at least for tracing.
- Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.
- Austin: Status of FPs is better than it once was. Saving grace of the tracer is you often don't have an assembly frame on the stack. Talked about making vet complain if you clobber the frame pointer in assembly code. Would be surprised if there are problems in the compiler generated code; worry much more about assembly.
- Felix: Worried about stack shrinking / relocation. Growing shouldn't happen while in unwinding, but not sure about shrinking.
- Austin: I think you always see a fully formed stack.
- Felix: There's no chance of seeing the stack mid-move?
- Austin: The goroutine that's getting moved has to be stopped.
- Nick: If unwinding happens asynchronously then it's a problem, like CPU profiling. We could use gentraceback in the difficult cases.
- Felix: Plan on working on better unwind testing. That machinery could be used to harden frame pointer unwinding as well.
- Michael K and Austin: Not a blocker to have the testing.
- Austin: FP on x86 is specified as part of the Go internal ABI. If the compiler is messing that up that's a violation of the ABI and definitely a bug. Doesn't apply to hand-written assembly.
- thepudds: One of the older CLs mentioned its approach depended on the stack not being copied while walking the frames, along with the comment “currently ok, but won't be if we preempt at loop backedges”... but maybe that old concern is not a current concern....
- Michael K: I think loop backedges aren't a concern, and async preemption as it exists shouldn't be an issue.
- Michael P: Traceback itself would just disable preemption just for consistency, but just because it's in the runtime package, we won't ever async preempt.
- Austin: I'm not sure why loop backedges would be a concern.
- Michael K: I don't think we should block on inline expansion, but maybe cgo tracebacks.
- Austin: As an intermediate step, use gentraceback for if there’s a cgoTracebacker and cgo on the stack. Will work for 99% of our users.
- Felix: Like the idea of making it the new default, but with the ability to switch back.
- Michael K: We could add a GODEBUG flag
Felix: Flight recorder / ring buffer mode
- Felix: We’d like to capture traces of slow spans. Wait for a p99 response and then get the last N MB of trace. I’m currently working on an experiment to see if this can be done in user space.
- Michael K: I think finding the oldest batch is an O(N) operation. Ordering the batches is difficult because we assume everything will show up eventually.
- Austin: The tracer is really stateful, so it's really difficult to actually manage a ring buffer. debuglog is a ring buffer, and what it does is consume its own format in order to manage a snapshot of the state.
- Felix: I’d be okay with getting a non-perfect trace at the end. At least understand what the goroutines are doing. Maybe we could every once in a while emit a “synchronization” event. If a complete redesign of the format is required, [flight recorded mode] is something we’d be interested in.
- Michael K: I’d like to find out what the overhead of writing the trace is. Say you have no stack traces, where is the rest of the time going? That’s important information for redesigning the trace format. I’ve already been thinking about redesigning the format. At the cost of using more space, it has to end up less stateful. Regularly synchronizing is one way to do that. That’s kind of where I was going: a “trace” is really a collection of self-contained traces. With the tooling able to be more resilient at the edges. Synchronation wouldn’t necessarily be STW, but you have a ragged barrier across the Ps that sync them all to the next trace chunk. That gets complicated in a ring buffer. I was thinking of gathering the requirements for a new trace format. Because there’s so much state, it’s hard to make it completely stateless without ballooning the trace.
- Felix: JFR does that ... splitting the stream up into self-contained chunks.
- Michael K: We’re definitely on the same page [wrt flight recorder]. The Go team arrived at this, too. We’re also trying to make ELF core dumps the source of truth for heap analysis. Ideally we’d be able to pull the ring buffer out of a core dump so you can see exactly what was happening before crashing.

qmuntal · 2023-02-02T17:00:34Z

Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.

FYI: #57302 is hitting this as well, as I'm implementing SEH unwinding using the frame pointer. Whichever is the fix for that, would be good to take SEH also into account.

This change does a lot at once, but it's mostly refactoring. First, it moves most of the profile abstraction out of benchmarks/internal/driver and into a new shared package called diagnostics. It also renames profiles to diagnostics to better capture the breadth of what this mechanism collects. Then, it adds support for turning on diagnostics from configuration files. Next, it adds support for generating additional configurations to capture the overhead of collecting diagnostics, starting with CPU profiling. Lastly, it adds support for the new Trace diagnostic. (This change also fixes a bug in go-build where Linux perf flags weren't being propagated.) In the future, core dumps could easily be folded into this new diagnostics abstraction. For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754 Reviewed-on: https://go-review.googlesource.com/c/benchmarks/+/459095 Run-TryBot: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

mknyszek · 2023-02-16T16:59:26Z

2023-02-16 Sync

Attendees: @mknyszek @aclements @felixge @nsrip-dd @prattmic @dominikh @thepudds @pmbauer @dashpole @rhysh

468301: runtime: delete gentraceback
- Austin: Needs more testing.
- Austin: Nice things to do as a result, listed in the issue. e.g.
  - Simpler defer processing
  - CPU profiles have a low limit on frames it'll capture.
    - Iterator makes this much more tenable to fix.
  - Years-old squirrely bug in the race detector.
- Felix: I’m happy to look into testing using perf, but I’m not sure when I can get to it.
- Rhys: If there are more frames than you want to record, could you add some context by including N outermost frames and M innermost frames. Maybe a “runtime._Elided” frame in the middle.
- Michael P: We’ve thought about doing that for panic tracebacks.
463835: runtime: frame pointer unwinding for tracer Felix Geisendörfer: wip, but almost ready for review
- Are slight stack trace differences acceptable?
  - Michael K: I think that’s fine. As we move toward letting people parse the format, I think lining up traces with stacks from other sources could become more of a problem.
  - Felix: The current patch passes most of the tests of tracebacks in traces.
- Should it use an unwinder interface similar to austin’s patches?
- Could systemstack be changed to push frame pointers? Otherwise the caller frame is lost. Naive attempts to make this change caused crashes.
  - Austin: Yes please.
- Weird issue with syscalls on BSDs losing a frame.
  - Austin: That’s probably lazy assembly.
  - Felix: Another option is to only enable FP unwinding on Linux for now.
  - Austin: As long as it works on Linux, Windows, and Darwin I’m happy.
- Cgo unwinders
  - Austin: It’s fine to take the slow path if the current goroutine has cgo frames and there’s a cgo unwinder.
- Felix: I got inlining to work (when traces are finalized). Benchmark numbers are still holding.
Michael K: Once all of the backtrace stuff is settled, I want to try using the monotonic clock (nanotime) rather than CPU ticks.
- Nick: Could you record nanotime at the beginning of a batch and then CPU ticks after that.
- Michael P: To do that safely, you’d need to know when you migrate CPUs. Linux’s restartable sequences can get you that.
- Michael K: There might not be a performance gap between nanotime and cputicks.
- Austin: If there’s even a performance gap, you could push more of the nanotime computation into the trace reader.

$ benchstat -col '.name@(CPUTicks Nanotime)' /tmp/bench
goos: linux
goarch: amd64
pkg: runtime
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
    │  CPUTicks   │              Nanotime               │
    │   sec/op    │   sec/op     vs base                │
*-8   10.75n ± 0%   16.11n ± 0%  +49.88% (p=0.000 n=20)

runtime: copystack doesn't adjust frame pointers on arm64 · Issue #58432 Felix Geisendörfer
- It was relatively easy to fix once I understood what was going on, but there appear to be dragons there.
- Boolean in the runtime does a double check of FPs on stack copies.
- Would like to treat arm64 as a separate issue, but I plan to get to it.
460541: runtime: reduce sysmon goroutine preemption (Felix Geisendörfer)
- Michael P: There are likely issues here with special cases in the scheduler. Not sure they're easy to fix.
cmd/pprof: macOS 12.6.1 (M1) profile overcounts system calls (again) #57722 (Felix Geisendörfer)
- Michael P: C reproducer and handing off to Apple (if it works) seems like a reasonable next step. No guarantee we'll get a fix though.
proposal: runtime: add per-goroutine CPU stats · Issue #41554 (Felix Geisendörfer)
- Felix: Initial justification was along the lines of billing, which seems better served by pprof. Then it shifted to fast control loops to throttle users. It seems better to have scheduling priorities, but barring that it seems good to let user space do priorities.
- Michael P: We’ve been discussing having tracing that’s cheap enough to have on all the time, and a parsing library. Maybe a user could do this by enabling tracing and parsing their own trace. Is this generally the right approach to user throttling at all?
- Rhys: I think a cheap trace that can be parsed in the app is good and flexible. I’m not sure per-goroutine stats is the right approach. E.g., if I use the net/http client, there are a bunch of goroutines involved that I don’t control but I want to understand the latency of.
- Felix: One trade-off of the trace route is the latency of reading your own trace.
- Rhys: It would be useful if the app could say, “I need a ragged barrier ASAP and I’m willing to take some performance hit.”
- Michael K: The other complication is how fast we can make the parsing. That might add unacceptable latency.
- Felix: I think the “explain analyze” case is not the best example. The most difficult is trying to throttle a user of the database that’s doing something you don’t want. In that case you don’t know ahead of time, so you’d be doing the ragged barrier all the time.
- Michael P: I think that’s a good argument for actual priorities in the scheduler. If you have some background goroutine watching for bad behavior, that might not get scheduled if there’s bad behavior.
- Austin: Swirling around problems that people have been thinking about for decades. Would love to see a summary of the current state-of-the-art is here.
- Michael K: Probably only OS APIs.
- Austin: That's not a bad thing. If it's a good API, we can consider replicating it.
AIs
- Michael K: Writing down trace requirements in earnest
- Michael K: Testing for x/debug
- Michael P: Need to review Austin's CL stack.
- Michael P: debug/gosym proposal.
- Felix: Clean up the tracer FP unwind patch (for amd64) to get it ready for review.
- Austin: Try to keep moving along gentraceback stack. Think about test-hook-coverage aspect.

mknyszek · 2023-03-03T16:41:35Z

2023-03-02 Sync

Attendees: @mknyszek @prattmic @felixge @nsrip-dd @aclements @thepudds @rhysh @bboreham

Michael K: I'm 70% of the way to a trace v2 (producer, consumer, trace format), and 40% of the way to writing it up.
- Most of the time is being spent detangling the existing tracer, documenting it, and using that to justify next decisions. Hopefully I'll have a draft to share before next time.
- [Michael K proceeds to go into way too much detail about this. Highlights below. A public document will follow.]
  - Let's use the system clock (e.g. clock_gettime) instead of RDTSC (for a number of reasons).
  - There are a very small number of places where you really need to understand the exact order of events. The current tracer takes advantage of that and I believe we need to retain this. Timestamps aren't enough.
  - Attach traces to Ms, not Ps. There’s a lot of complexity around GoSysExit racing with trace start. Thinking about ragged start and making the parser robust to that.
    - This choice forces us into avoiding a stop-the-world.
  - Trace binary format ended up being more about consumer efficiency than producer efficiency, but still more efficient on both sides.
    - Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings.
    - Trace events are sequences of 4-byte words whose internal structure respects byte boundaries and field alignment, to allow encoding/decoding events to just be memcpys and state management.
    - Using Felix's 280 MiB trace breakdown as a motivating example. By my calculations the current design woukld use around 10% more. Personally that seems acceptable for the other gains.
    - Every G event has an explicit G ID, but it's derived from a "G context" event. G IDs are also compressed.
- Michael K: We could make the stack table faster by only checking the hash instead of an exact match. Small chance of error.
- Rhys: Let's be cautious about making sure that traces actually work.
- Michael K: That's a good point. We should put an explicit bound on the likelihood of error. If it's astronomically small, is that fine?
- Rhys: Astronomical is good.
- Rhys: Would the new trace format still enumerate every goroutine? Currently can get stuck in many-millisecond STW waiting for tracing to enumerate all goroutines.
- Michael K: My plan was no. Goroutine profiles if you want that?
- Rhys: That's good. Yeah, you should be able to correlate a goroutine profile with a corresponding STW event in a trace. Happy about no STW in general for traces too.
- Rhys: RE: correlating things with traces, do we want to keep things orthogonal in general? Thinking about CPU profile events in traces.
- Michael P: I see where you're coming from in that you might want just the CPU profile events from a trace (with timestamps) and it's weird to get the whole trace and throw most of it away. We discussed having an API for configuring the trace and which events get emitted, so that might be a good place for that.
- Austin: There's a relevant issue about making CPU profiles more configurable as well, so maybe that's a good place for it too?
- Michael P: I think there are a lot of API questions here. Do you configure CPU profile in tracing at the CPU profile side or at the tracing side? The most natural way sounds like the tracing side because that's your actual output format, but I'm not sure. And then it gets complicated if you turn on CPU profiling in the tracing API and then you separately turn on CPU profiling, is that allowed? Right now you can't turn on profiling twice. And that's even more complicated, if we let you figure the sampling rate and they're not the same.
- Rhys: One of the difficulties that I've had in using execution traces and CPU profiles at the same time is that even though the CPU profile doesn't exactly stream its output while it's going. It's tricky to juggle two different output formats. At the same time that I'm trying to put into a single zip file to upload to blob storage. A single buffer would be handy.
- Michael P: A single buffer is ideal, but we don't have a converter that could pull a CPU profile out of a trace. We're missing information.
- Rhys: For one, we're definitely missing goroutine labels, though there's a patch out for that. We're also missing /proc/<pid>/maps for binary/symbol information.
- Austin: It occurs to me that Linux perf basically starts with /proc/<pid>/maps.
- Michael P: Perhaps we should also dump build information. We've been brainstorming about including this information for PGO.
- Michael K: There's room for as much as we want at the beginning of the trace, basically, so I'm all for adding more there.
Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
- Felix: Would the old trace format fit in the new parser?
- Michael K: That was my goal. We'd have to do some retrofitting, but the old parser already exists. Caveat: parsing old traces would still have the same overall properties as the trace parser currently does.
Felix: Frame pointer unwinding patch for tracer is ready to review. It’s only amd64 for now and a bit rough around the edges. We should discuss what needs to be done before landing. Cgo is still missing, but I’m working on adding that.

dominikh · 2023-03-04T01:39:59Z

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

mknyszek · 2023-03-06T16:56:46Z

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

It does not. It only cares about the initial state of all Ms (including goroutines running on them), and generally only mentions goroutines that actually emit events. For goroutines that aren't running, there are only two cases where we actually care about the initial state of a goroutine: whether it was blocked, or whether it was waiting. In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

The trace still needs to indicate if a goroutine (and M) is in a syscall or if it's running. In the new design, this information is emitted together at the first call into the tracer by that M for that partition. The timestamp needs to be back-dated to the start of the partition. There's some imprecision with this back-dating but it's only relevant at the very start of a trace. The worst case is that a goroutine may appear to have been running or in a syscall at the start of a trace for longer than it actually was. The amount of imprecision here is bounded by the time delta between the global (serialized) declaration of a new partition and when an M has it's buffer flushed and/or is notified (via an atomic) that tracing has started, which I expect in general to be very short and non-blocking. (We can also explicitly bound the time by telling the M what time it was contacted for a new partition.)

Note that the details above imply that when a new partition starts, a running M may have been in a tight loop and so hasn't emitted any events for the last partition, in which case we need to preempt it to have it dump its initial state. Generally, moving partitions forward doesn't have to even involve preemption.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

dominikh · 2023-03-06T17:48:09Z

In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

That has two implications, however:

goroutines that don't unblock during the trace will be unaccounted for
the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view? However that wouldn't really fit into an M-centric format…

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

mknyszek · 2023-03-06T18:34:49Z

That has two implications, however:

goroutines that don't unblock during the trace will be unaccounted for

the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view?

Both of those things are good points.

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated). At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

However, I can definitely see that getting the information you describe has utility and we don't want to lose that. In the last meeting we discussed how goroutine profiles could be used to fill this gap. As a baseline, it should be fairly straightforward to correlate a goroutine profile's STW timestamp with a STW event in the trace. Taking that one step further, we could explicitly mention that the STW was for a goroutine profile in the trace. (In theory we could also dump the goroutine profile into the trace, like we do with CPU samples. I am not opposed to this, but I probably wouldn't do it to start with.)

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable? Perhaps I'm missing some use-case that's totally missed. FTR, I fully recognize that we're losing something here in the trace, but I argue the net benefit is worth that cost.

Also I just want to disclaim the design details in the last paragraph: subject to change in the first document draft. :) That's just where my head's at right now. It may turn out that the per-M synchronization I have in mind is too complex.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

That's another good point. To be clear, I do plan to have an API with some level of abstraction and not quite just []byte-to-type. :) Events will be opaque and fields will be accessed through methods, so we have a lot of wiggle room. However, something like the Link field I think requires keeping the whole trace in memory, because you never know when someone might want to access an event from a long long time ago (though I haven't thought this through). In theory an accessor can be arbitrarily complicated and even re-parse the trace to find the event, I suppose. :P

My general hope and expectation is that the vast majority of users should never have to look at the API at all, and instead rely on tools built with it. And those that do use the API don't need to understand the file format, just the execution model it presents (which I think is somewhat unavoidable).

dominikh · 2023-03-06T18:58:36Z

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated).

I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?

At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

One use case of looking at execution traces as they are now is debugging synchronization issues. Imagine having an N:M producer/consumer model using goroutines and channels, and we're debugging why producers are blocking. The reason might be that all of the consumers are stuck, which is only evident if we can see them be stuck. If they're already stuck at the beginning of the trace then they would be invisible in the new implementation.

More generally speaking, a lot of users aren't interested in the per-P or per-M views and instead want to see what each goroutine is doing (see also the per-goroutine timelines in gotraceui.) It turns out that per-G views are useful for debugging correctness and performance issues in user code and that traces aren't only useful for debugging the runtime.

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable?

In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.

At least this would give people the choice if they want to tolerate STW for more detailed traces.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

Probably, yeah.

mknyszek · 2023-03-06T19:21:51Z

I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?

It's not quite that it didn't pan out and more that it just doesn't work with a per-M approach given other design constraints.

The ragged barrier I mentioned in an earlier design sketch is the forEachP one which is ultimately still P-focused. Part of the reason I want to switch to a per-M approach is to remove the GoSysExit complexity that comes from the fact that goroutines can in fact run without Ps sometimes. That complexity is part of the event's semantics, so it tends to leak everywhere.

A per-M approach can side-step a lot of that complexity, but it means we need a way to synchronize all Ms that doesn't involve waiting until the M gets back into the scheduler. What I wrote above is a rough sketch of a proposed lightweight synchronization mechanism that most of the time doesn't require preemption. I think that in general we can't require preemption in a per-M approach if we want to be able to simplify the no-P edge cases and also get events out of e.g. sysmon, which always runs without a P. (EDIT: D'oh. I keep forgetting that the current tracer can indeed emit events without a P. So that's really more that just that we don't currently have a great way of tracking Ms in general. I would like to add more explicit M-related events. The GoSysExit point still stands because it races with a trace stop-the-world, which is the main source of complexity. If we synchronize via Ms that goes away.)

(In effect, I am proposing to shift the GoSysExit complexity somewhere else, but I hope that in the end it will be less complexity overall because the M synchronization details can probably be written in a way such that the details don't leak as much.)

dominikh · 2023-03-06T19:31:33Z

An aside that might steer you closer to a per-M approach: I tried adding per-M timelines to gotraceui using the current format and found it impossible due to the current event sorting logic. I ran into scenarios where a P would start on an M while the M was still blocked in a syscall.

mknyszek · 2023-12-07T17:24:28Z

2023-12-07 Sync

Attendees: @rhysh @mknyszek @prattmic @bboreham @nsrip-dd @felixge

Michael K: Everything's in for the new tracer. Lots of cleanup to do for next release, like moving old traces behind the new API. Working on a prototype of the flight recorder for x/exp.
Michael P: There's also x/debug that needs fixing, but I might not get to it before the holidays.
Felix: RE: making the new API compatible with old traces, I thought that might be too hard?
Michael K: The difficulty was with exporting the new traces behind the old (internal) API. I don't think it'll be hard to export the old traces behind the new API. It's just work.
Michael K: Russ Cox was concerned about the performance of traceAcquire, since there's a trace.gen atomic load to check if tracing is enabled. I didn't notice any problems in scheduler-heavy microbenchmarks on amd64 or arm64, but his point is that coroutine switches might be happening an order of magnitude more often, so every little bit (especially with respect to atomics) counts a lot more. He suggested adding a non-atomic bool that's set during the trace start STW and then unset when tracing stops later. This would replace the first check of trace.gen, so if tracing is disabled there's no non-atomic check. I'm pretty sure this works because we always double-check trace.gen, but I have to think about it more in the case of repeated trace starts and stops.
- Michael P: The atomic load of trace.gen adds a happens-before edge, so there shouldn't be any flip-flopping. I think it sounds reasonable. I was going to suggest making trace stop and start get the attention of every M and set a bool on them, but that's more complicated.
Felix: I'm curious what y'alls thoughts are about coroutines and diagnostics.
- Michael K: In tracing it's just handing off goroutines.
- Michael P: +1, it's just goroutines. The CPU profile will treat them as separate goroutines.
- Michael K: That reminds me; there was some discussion about tracebacks merging the stack traces from multiple goroutines.
- Felix: Will there be a parent/child relationship between the coroutines?
- Michael P: I would expect so.
- Michael K: I'll note that this is only for iter.Pull.
- Michael P: The proposal isn't yet fully accepted, so this can still be discussed on the proposal. This probably won't stop the proposal though, unless we decide that the current proposal makes it impossible to create sensible information for debugging. Delve might have some fun with this.
Topic: OS Metrics (Process RSS) for runtime/metrics (Felix Geisendörfer)
- Backstory: playing a lot with runtime/metrics, especially the memory classes. I've been writing a blog post explaining the memory classes and how this relates to OS metrics.
- One thing I noticed that's lacking is estimating how much non-Go memory a process is using. Our strategy is to take a best guess of the physical memory used by the Go runtime, and subtract that from the OS's RSS numbers… assuming there's no shenanigans like ballasts going on (which should be less common now). From that perspective, does it sound reasonable to add OS metrics? At some point you have to reconcile those numbers anyway. Lots of libraries collect these metrics differently making for a confusing landscape. Are there any good reasons not to make this a proposal?
- Bryan: I like the suggestion, but I think it's going to be very hard to get meaning from the OS numbers. RSS includes memory the OS just hasn't kicked out yet. If you want to know if you're running out of memory you need the PSI metrics from cgroups v2.
- Felix: Are you specifically thinking about the page cache or something else?
- Bryan: If you write a bunch of stuff to disk it'll sit in your RSS. But specifically that's not the working set, which is a much tighter definition.
- Rhys: Are you talking about a write syscall or something like a memory mapped file?
- Bryan: I'm talking about a write syscall. I proposed a talk to KubeCon along the lines of "why does my process OOM." I proposed this to try and force myself to figure this out.
- Michael P: I have a similar concern to Bryan's. There are so many metrics and it's hard to make sense of them. But Felix is also right: at some point somebody has to understand it. The metrics also differ on different OSes. Should we have different metric sets for different OSes?
- Felix: I would also like to point out there might be some low-hanging fruit. One particular issue is looking at your memory usage and getting surprised that it doesn't match with the heap profile.
- Michael K: At the bare minimum, I think we should communicate that the heap profile isn't exactly intended to map directly to actual memory usage.
- Felix: Except that in practice I find the heap profile to be pretty accurate, within single-digit percent error.
- Rhys: Haven't been closely involved with teams debugging OOMs, but my understanding is that the heap profile is confusing and insufficient for this use-case. Confusing because it's off by a factor of GOGC, and insufficient because it only includes reports of allocations once they've been around for a cycle or two. You don't get data on the most recently allocations that caused the OOM.
- Michael P: Just double-checked the Linux kernel source that Bryan is correct. The RSS is effectively the number of pages mapped into your page table. So if you map a file, the faulted in pages are in your page table. If your memory is low then those pages get zapped. That's the core reason why it's usually confusing. Anonymous memory is pretty straightforward, but files get complicated.
- Felix: Thanks for confirming that, but I think that just shows that it's important to give users more guidance. I think what most people want is what happened just before dying, but it's really hard because the OOM killer sends a SIGKILL. But at the same time, if only 20% caused the failure, what about the other 80%? Maybe that's more important, and we can get that information.
- Bryan: I second that the post-mortem heap profile is difficult to get.
- Michael K: The state of OOMs on Linux is fairly dismal, and we could spend a lot of time working around it, but I wonder if maybe we should put some effort into engaging with that community to change things. Regarding where to export OS metrics, I think Jonathan Amsterdam was starting discussions around standard library metrics and package for that. If that allows users to hook in their own metrics, we could provide an x/debug package that inserts platform-specific memory metrics. I feel more comfortable with that because platforms all provide different metrics and the x/ repositories have less strict compatibility guarantees, so we could evolve with the various platforms more naturally. The barrier to fixing up these metrics as things change also goes down. (Not said in the meeting, but in my mind this is like the difference between the "syscall" package, now frozen, and the "x/sys/unix" package which is allowed to continue to evolve relatively fluidly.) If we could spend some time and apply the expertise in this virtual room to do this work once, I think that would bring a big benefit to a lot of people without that much additional work. As a final note, I've also run into plenty of situations where, for example, some system provides a metric like "bytes of memory in use" with a similarly vague description, and figuring out exactly what that means and how its derived is endlessly frustrating.
- Rhys: The frequent cause of OOMs are not giant processes on one machine, but usually tiny processes with small resource limits. So if it's OOMing, it's doing so on a regular basis. So what I want to do is stash a heap profile to disk every once in a while. But it only includes data from a couple GCs ago, so it'll always be stale. RE: RSS, I don't know if it's helpful, but I thought there was a syscall on Linux that you can ask for what part of memory is actually mapped and sets a bitmap.
- Felix: /proc/* has this bitmap.
- Bryan: For the case where you really only have anonymous memory and no swap (most processes), the Go runtime could figure out what the limit is and generate the post-mortem diagnostic close to the limit.
- Michael P: I agree we could just do it ourselves and perhaps should. cgroups even have an event that sends a signal when close to the limit. My biggest annoyance with this limit is that all sorts of reclaimable memory gets counted against you, but it throws it away as soon as you get close to the limit. So you might spend a lot of time thrashing for memory but the kernel could drop some memory too. It's annoying that today these would have to be two completely different systems that don't know about each other.
- Felix: Just to recap, it's probably not worth a proposal for runtime/metrics, but maybe it could be in an x/debug package that hooks into some more general metrics package, like what Jonathan Amsterdam is starting discussions around.
- Michael K: That's my gut feeling, but I don't want to say for sure. Maybe I'm wrong about runtime/metrics being in the wrong place. In theory we could export to expvar, but that package is in a weird place. It's not a great API, yet there are people out there using it.
- Michael P: I think we should invite Jonathan Amsterdam to this meeting. I'll talk to him.
Topic: Why is CPU profiling so different from other profile types (start/stop, given that there’s no streaming in the current implementation). Implications on emitting labels for other profile types, or including them in execution trace (especially heap profile sample events?). (Rhys)
- Michael K: I think CPU profiles are different because they implicitly require a duration (first Start and then later Stop). I'm not really sure why it's not just always-on, but my assumption is it's higher overhead. Regarding emitting heap profile sample events into execution traces, I'm hoping to use the new tracing infrastructure to bring in certain optional diagnostics into traces, like allocation events (if sampled, that's approximately like the heap profile), page allocation events, or possibly even pointer-write events to identify precise lifetimes for heap objects. This information would be useless to most people, but super useful for future runtime/compiler work.
- Felix: Thank you Rhys for pointing out the footgun of CPU profiles accumulating data over time without freeing it. That's a lesson I probably avoided learning the hard way. A workaround could be to
- Rhys: Michael K said CPU profiles are expensive. It's 0.6% or 0.7%, which is not huge. Maybe we could have something on all the time that samples even less frequently.
- Michael P: I find myself using Linux perf for the initialization part, since that's what you're stuck with. I wonder if in the SIGPROF handler we could check if we're at an async preempt safepoint and use frame pointer unwinding at that point. So we could reduce the overhead quite a bit. Signal delivery probably costs a bit, but I imagine tracebacks are the biggest cost.
- Michael P: The pprof.Profile API lets you make a custom profile. Has anyone actually done this? I don't see this much.
- Rhys: I've done it to count concurrent HTTP requests to find places where we didn't close the response body. Also, custom profiles are only for active handles. It doesn't help with accumulating things like bytes.
- Felix: Have we ever measured the overhead of the CPU profile? I'm just wondering how much we're going to get.
- Rhys: When I look at execution trace data with timestamps and call stacks, there's a minimum time delta between execution trace events that include call stacks. My understanding from that is that a normal traceback takes ~30µs. I also don't know how much time it takes the kernel to do all its timer stuff. Linux perf would have the answers.
No meeting in two weeks since it would be on December 21st. See you next year!

mknyszek · 2024-01-04T19:42:30Z

2024-01-04 Sync

Attendees: @nsrip-dd @felixge @bboreham @prattmic @dominikh @rhysh @thepudds @mknyszek Daniel Schwartz-Narbonne

Bryan: It came up that GOMEMLIMIT should replace a heap ballast, but we've found in practice that it doesn't work and we've had to roll back to a heap ballast. [Thread.]
- One example is a web proxy that handles a ton of requests, so if you do nothing you get a really small heap and a frequent GC. GOMEMLIMIT makes you pick a number, so you can't burst up a bit. In practice we didn't like having to pick that number, because a number too low causes issues.
- Felix: What about GOGC?
- Bryan: You need to have it at 1000 or 2000, so you get these really big spikes if you do nothing else. Some combination of the two ought to work, but we had a couple of attempts but had to roll it back in production. In practice we set a heap ballast of ~1 GiB.
- Michael K: For some context, Bryan's situation involves overcommit of container memory, so simply setting GOMEMLIMIT and GOGC=off is problematic, since the runtime will just use all memory up to GOMEMLIMIT. We discussed both the idea of setting GOGC to some value and a much higher GOMEMLIMIT (closer to the true maximum, after overcommit), as well as other alternatives. The memory target proposal is a much closer fit, but it didn't get much traction. A similar issue also suggesting GOMEMLIMIT isn't good enough for all use cases was filed by someone. That was a little different since it involved a persistent memory leak, but the recommendations are about the same.
- Rhys: Ballasts are trivial, why have them in the runtime?
- Felix: Quick reply to ballast is that it messes with heap inuse profiles and metrics in ways that are confusing if you don't know there's a ballast.
- Bryan: We believed that you could map memory without Go touching it, but we saw in some specific cases that the runtime actually does end up zeroing the memory, paging it all in. +1 to Felix's point.
- Rhys: There's a blog post that points out that a whole bunch of things that need to line up for it to work out. So easy if you get the subtle parts.
- thepudds: Is the memory target something that can at least be approximated in user space?
- Michael K: There are two ways: one is a ballast, and one is with GOGC. The downside of the latter is that you're not going to get perfect responsiveness, but if you're just using it as an optimization, that's probably fine.
- thepudds: Someone could implement it as a package.
- Bryan: The mismatch between k8s CPU numbers and GOMAXPROCS is a pain, but yeah when the GC kicks up in response to memory pressure (as defined by GOMEMLIMIT) it can become a real problem.
- thepudds: Is SetFinalizer responsive enough?
- Michael K: Yeah, the Go compiler actually does this and it works out OK.
  - David Chase CL for cmd/compile: https://go.dev/cl/436235
- Rhys: In the compiler we know that nothing is going to attach a finalizer that just blocks forever, then if you know that's true in your application then it's responsive enough. But you can also make a finalizer that blocks all other finalizers from running, which is a hazard.
Nick/Daniel: Adding GC roots to the inuse heap profile to improve memory leak debugging (proof-of-concept CL)
- Michael P: Seems interesting. We considered modifying the heap dump tooling to emit a valid core file, so the same tooling works on crashes and heap dumps. I think the core idea of figuring out why memory is held live is very valuable.
- Bryan: Wanted to call out that I got viewcore working again for 1.20 and 1.21. Can you give more details as to what this looks like?
- Nick: These are labels attached to profile samples that tell you something about the root keeping the object alive. The prototype just picks a random number of roots and it's non-deterministic. RE: core dumps, I don't know how hard it is to handle core dumps operationally. In our context, we know our users might not feel great about sending a full dump of memory to us.
- Daniel: The nice thing about this is that it does tie into the existing heap profile, and not yet another new type of thing.
- Bryan: About getting the data, CRIU is interesting but not something I've tried. It is part of k8s though. It's an alpha feature as of 1.26, haven't checked what it is today. You can ask for a checkpoint of a container. I assume that's analogous to a core file. → Link
- Michael P: You raise a good point about the PII stuff. A heap profile doesn't actually contain the contents of memory. In theory if we're making these artificial core files, we could exclude non-pointer data. Also agreed that it's nice to have things in the same heap profile format.
- Daniel: Dumping the heap graph could be really useful, but our prototype doesn't do that.
- Bryan: I have to think for a reasonably sized program it takes too long to STW to dump the full heap graph. You probably want some kind of virtual memory copy-on-write thing. In .NET you could get a dump of the heap graph. Just trying to get back there personally.
- Daniel: One of the nice things about this approach is that we're already doing most of the work in the GC. The overhead is about 7% slower when disabled, but 12% slower when enabled. With some more careful tuning it might be possible to get this to zero when off, lower and when enabled.
- Bryan: Yeah, if it's possible to get it cheap enough maybe that's a good path.
- Daniel: You could also get a partial graph and that could be useful.
- Rhys: Can the chain be something simpler like the type? What if we represent the chains as a graph of types? Is that small enough that that could fit in a new kind of hash table? Is that enough information to be useful?
- Michael P: I don't really know the answer to your question as to whether that's enough, but I'll note that the roots aren't really call stacks at all. The chain that it's reachable through is indeed more about types. It's unclear to me exactly what information we want.
- Rhys: The mprof hash table today has a list of call stacks as a unique key. What if this were a different sort of profile where instead of a list of call stacks, it's a list of types?
- thepudds: Almost anything here would be a big step forward. Even if you just had a number of bytes tied to some root, it can be an 80% solution.
- Michael K: I worry that the investment to make this fast would be really high. 7% represents >1 years of investment in improving the GC hot paths. But I could be wrong, there could be a way to make this fast. (Disclaimer: I have not looked closely at the patch.)
- Daniel: How hard would it be to have a GC visitor framework? Are there other use-cases? RE: double counting, I don't think that should be an issue. Doesn't the GC only mark objects once?
- Michael K: GC is inherently racy for performance reasons. It’s possible (though rare) that objects get marked twice.
- thepudds: The ideal solution means zero cost when it's off, but then you pay the cost when you dump something, but it is also more work to analyze a new kind of dump.
- Nick: It makes a lot of sense to me that the bar for performance here is really high. But it also wasn't totally clear to me whether the cost would be worth it, which is one reason why I wanted to bring it up here.
- Rhys: What is standing in the way of a process snapshotting its own memory? Aside from it being hard, is there type information that's hard to get, or something that gets in the way of a process analyzing its own memory? If we want zero overhead and not have the runtime commit to supporting new things, maybe we can have the runtime support a simple thing that lets programs do introspection into their own memory.
- Michael K: This sounds similar to what Daniel was saying about the GC visitor framework. I vaguely remember checkmark mode doing something like this.
- Bryan: The important information is mostly in internal packages, so it's a bit of the antithesis of what Go normally exports, but if it's a delve-like thing, then why not?
- Michael K: On a final note, it would be great to fix this. It's definitely a gap that Go has compared to other languages.
Michael K: Flight recorder prototype landed. Please try it out!
Felix: Compiler regression in go1.22?
- Trying out PGO in 1.22, and a benchmark decoding protobuf data that just a baseline build was 4% slower on darwin/arm64.
- Michael K: I got a message on Slack about a possible regression. It was 5% faster on linux/amd64 and 5% slower on darwin/arm64. I've been meaning to come back to that. Nothing really stood out in the profile. I was also looking at the performance dashboard and saw this yesterday. Looks like the regression came from August 2023, which was mostly compiler CLs.
- Michael P: You're probably already aware of this, but in case you're not: if you're benchmarking a tight loop, it might turn out to be a code alignment thing. If it's protobuf it's probably big enough.
- Felix: I'll double-check. I'll also double check to see exactly what platforms the regression happens on.

mknyszek · 2024-01-18T17:18:27Z

2024-01-18 Sync

Attendees: @prattmic @mknyszek @rhysh @bboreham @felixge @nsrip-dd @thepudds

Felix: How can callers of metrics.Read understand if a metric value is the same because it has not changed since the last read or because it has not been updated (e.g. by the GC). I.e. How can one avoid computing incorrect /cpu/classes/ deltas?
- Felix: Considering using GC count metrics, but it doesn't feel right.
- Michael K: We should just fix this. The issue is non-monotonicity of some metrics (that are computed as subtractions) because other metrics are computed only once per GC cycle.
- Rhys: You can use idle time as an in-band proxy for whether anything changed.
- Felix: How would I interpret an increase in the number of GC cycles but no change in stats?
- Rhys: It could be a system with very broken timers.
- Felix: Not urgent, just wanted to bring it to this group's attention. Wanted to avoid manipulating metrics from the runtime before they reach the user because they're harder to interpret.
- Michael K: I mostly added the gc metrics to replace GCCPUFraction.
- Felix: Yeah, one problem with this metric was also that it counted the GC fraction since the program started.
- Michael P: There might be a race between GC count and the CPU metrics. If a GC STW happens in between while the metrics are being read, they might be inconsistent.
Felix: testing: writeProfiles is not called after panic #65129
- Felix: We were debugging a test hang in CI, but we had a test that panicked, which made it hard to get a trace out.
- Michael K: We could call traceAdvance on a fatal panic, and then the trace tool would have to understand that it's OK to have garbage data at the tail of the trace.
- Michael P: The testing package already has a cleanup function that could stop acquiring the execution trace and other diagnostics. It's weird that it doesn't.
- Felix: Do we see any good reasons for not doing this in the defers that are already in place? My only concern is what would happen if the panic was so bad that these functions would hang. But I think that problem already exists, so we wouldn't be creating a new problem.
- Rhys: Complex proposal: we spoke months ago about how great it would be if we could extract an execution trace from a core dump. Is that a thing that we want to do? We could have the testing package be the thing that uses it. Flushing traces still has a problem because the data is being fed through an io.Writer, which has no strong way to fully flush, especially when plumbed across many goroutines.
- Michael P: Would like to have this, but I'm more skeptical about the testing package using this. OSes can make it difficult to enable core files. On Linux there's no other way to request a core file except for changing the global core pattern.
- Felix: Worried about being able to do this in CI environments. For example, it might be really hard to set this up in, for example, GitHub Actions. Just looking up how to get core dumps on macOS seems complicated.
- Rhys: I wasn't necessarily thinking of a core file, but the parent process can attach in the same way a debugger might. Just like ptrace, it would stop the process before it fully crashed. I imagine this would be more portable.
- Michael K: I think ptrace is going to be less portable probably; delve is only available on two platforms currently and that's part of the reason why. Also, if we export a core dump file via WriteHeapDump, then we can have a portable way to dump this data. If flight recording is enabled, we can even lift the trace data directly from the flight recorder buffers, which we know will be whole.
Michael P: Shout-out that one of our internal users was noticing runtime lock contention in their profiles and I was able to figure out what was going on. Thank you to Rhys for contributing runtime lock contention to the mutex profiles!
Michael K: We're gonna be shifting gears a little bit toward efficiency and performance analysis this year. Before that happens, I want to get some trace cleanup done, and specifically the last big chunk of work is getting the old trace parser behind the new API. If anyone has time to spare or thoughts on how we can make this easier and cleaner, please let me know! Dominik offered to pitch in a little bit.
- Felix: If I can find some time I'll see if that's something I can help with. I'll also talk to Dominik.
Felix: Update: have not confirmed the compiler regression from 2 weeks ago. I'll ping someone on Slack if I find anything. I also found a fun PGO thing where the average goroutine stack size went up 30% after enabling PGO.
- Michael P: You can look at the assembly to check stack frame sizes, but there's a compiler bisect tool that can be used to identify which decisions the compiler made actually resulted in the observed change. It was created for the loop variable scope change. It's called something like x/tools/cmd/bisect.
- Felix: The challenge would be to write a reproducer, since this is a big application that's using a very large number of goroutines.
- Michael P: If there's a common stack, then you could look at those frames specifically. I can send instructions on how to find out what's inlined due to PGO.
- Felix: Sometimes I see a different stack trace from PGO binaries because of inlined frame attribution differences.
- Michael P: I think what you're referring to is if a function calls a closure which gets inlined, its name gets more complicated and confusing. Otherwise it shouldn't really impact things.
- Felix: It does make profiles harder to compare.
- Michael P: Matthew Dempsky has been thinking about how to adjust these names.

dominikh · 2024-01-18T18:17:49Z

If anyone has time to spare or thoughts on how we can make this easier and cleaner, please let me know!

I don't have ideas regarding overall design yet. However, since there'll be no way around reading old traces entirely into memory to sort them, I'd like to suggest using the trace parser of Gotraceui. It started out as a copy of go tool trace's parser, but has been optimized to use less memory, put no pressure on the GC, and it uses a smarter way of merging batches that lowers CPU time. It's how Gotraceui is able to load traces in a fraction of the time and memory needed by the old go tool trace.

It's at https://github.com/dominikh/gotraceui/tree/8fbc7cfaeb3cebed8890efbc030a62a7f1ff3f81/trace — since it started as a direct copy of Go's parser, the git log for the folder should show all relevant changes, and includes some benchmarks.

There are some changes that might have to be reverted if we want to support very old traces. My parser dropped support for Go 1.10 and older and it doesn't handle EvFutileWakeup, which haven't been a thing since Nov 2015.

The only change that might not be great is 2c5675443eebc969ee07cd4f2063c2d7476f7b9b which removes support for trace formats older than Go 1.11. That change could be reverted if need be, but traces that old wouldn't benefit from my improvements to batch merging.

Edit: I've nearly completed an implementation of the conversion.

mknyszek · 2024-03-09T00:32:12Z

2024-02-01 Sync

Attendees: @prattmic @felixge @dominikh @nsrip-dd @rhysh @bboreham @mknyszek

Felix: RE: Parsing old traces in the new API. I see that Dominik has implemented it. We've tried it and it seems to work really well.
- Michael K: On my TODO list to review, I will get to it soon. And thank you again to Dominik for working on that!
Felix: RE: Panics in tests cause broken execution traces.
- Got reviews. Thanks!
Felix: RE: Tried PGO, found increase in stack memory usage.
- Have a profiler tool that takes a goroutine profile and a binary that produces a plausible stack frame size profile.
- Worth a proposal as a profile type?
- Michael P: Could be another sample type in the goroutine profile, possibly. Your proposed way of estimating the stack size should be pretty close. It probably omits some wrapper functions that add a few bytes. If we do make this a real profile type, we can put in the real stack size. The runtime has to look at the goroutine anyway.
- Rhys: How common is it for goroutines to idle with a small number of stack frames, but grow their stack pretty large transiently. That could hide stack space with this way of estimating the stack size. RE: expanding the goroutine profile, could we add extra little debug things to execution traces to get this data? For example, when stack growths happen.
- Felix: Lots of good ideas. RE: the concern about goroutines that grow and then shrink again, are you worried about capturing the HWM of stack size?
- Rhys:
- Felix: Wouldn't this be the same issue with a live heap profile?
- Rhys: I think there's a bound on the live heap.
- Michael P: I think there are two types of growing being discussed here. One is how much of the stack is being used, and the other is how much stack space is allocated. Rhys's case suggests that stacks shouldn't shrink too often, since we need a full GC to shrink. I think the case where the runtime directly records the allocated size handles that.
- Michael K: I think Felix's approach would still work well if you take many samples. Eventually you'd catch the spikes. You're going into the problem already knowing your stack memory usage is high.
- Felix: Another reason this would work well in practice is that you only care about it when you have a lot of goroutines, because that's when you have a lot of stack space.
Michael K: Flush generation on panic and accept broken trace tails
- Michael P: In theory if you have a core dump, it shouldn't be too difficult to simulate what traceAdvance does, and finish up the trace. That way you can fully extract the trace out of the core dump.
- Rhys: Does this only work well if the trace is being directed to an os.File, or does it work well in the general case? What happens when the io.Writer goroutine needs to execute?
- Michael P: When a panic starts we freeze the world. RE: what Michael was talking about, we were thinking more about
- Michael K: I was thinking more about crashes as somewhat isolated, so that most of the application could still continue running. Maybe that's wrong.
- Michael P: We could investigate that, but we'd have to not freeze the world.
Michael P: Just merged SetCrashOutput, it lets you specify an additional FD to write out panic string in addition to stderr.
- You could imagine the flight recorder has a similar API that writes out to an os.File, which will be a lot more robust than a general io.Writer.
Michael K: Still need to look into the slight regressions we've seen due to recent diagnostics work. It's only been microbenchmarks, so we expect this to not actually impact real applications. Rolling out at a bigger scale hasn't shown any significant regressions. See https://perf.golang.org/dashboard/?benchmark=regressions&unit=&repository=go&branch=master&days=180&end=2024-02-01T16%3A37 for the benchmark list we have.
Rhys: Did a bit of investigation as well: runtime: ~5% regression in ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-8 sec/op at 450ecbe #64455.

mknyszek · 2024-03-09T00:36:00Z

2024-02-15 Sync

Attendees: @nsrip-dd @rhysh @mknyszek @bboreham @felixge @prattmic @dashpole

Sysmon Preemption
- Michael P: Felix’ patch came up when somebody reported a go1.22 regression that might have been caused by preemption. https://go.dev/issue/65647
- Felix: Came up on my end this week as well. I found the preemptions useful to debug OS latency, seeing 500ms goroutine time slices in the execution trace gave me confidence that OS scheduling latency was high.
- Schedviz tool: MP started prototype to combine kernel scheduler traces and Go execution traces, effectively add a CPU dimension to the execution trace.
Michael K: Default metrics set for OpenTelemetry
- Following up on a previous discussion.
- Michael K: Exporting all metrics might be too much. I wrote up a proposal for an API to give a default set of metrics that are widely useful. Michael P convinced me that this should maybe start as documentation rather than an API.
- Michael K: API would take a go language version and give back a set of metrics. If you ask for go1.17, you get the defaults for that version.
  - func Default(toolchain string) ([]Description, error)
- Bryan: There is a prometheus client golang issue (link?). Maybe we (grafana) could dig up information on which go metrics are collected by our customers in the cloud via prometheus. We can scan dashboards and queries for 30d and report on this in aggregate.
- Michael K: That would be useful to have. I came up with a default set in my doc, but I’m not super happy with it. I could share the doc, but feels like it might get deprecated right away.
- Michael K: The default set would have strict compatibility guarantees. Anything in the set would be supported effectively forever. You could unconditionally rely on those metrics to still exist (but they might become 0).
- Rhys: Questions about RSS, VSS, why heap profile is ½ of RSS come up often. Maybe the GC guide could address some of this. I don’t understand why the API to get good metrics needs to be in the core rather than a 3rd party thing.
- Michael K: Yeah, could also go into /x/ as a middle ground.
- Rhys: I don’t know what it means for a metric to be a default. Different apps have different needs. This is also coming up b/c metrics are expensive. They are small relative to heap profiles which many folks collect.
- Michael K: I had the same thoughts – why is this a problem at all.
- Bryan: We charge $8 / month for a thousand time series. I think runtime/metrics has ~100 metrics, that would be 80 cents per month. This is multiplied by the number of processes you run.
- Felix: Usually what's expensive is metrics with tags with high cardinality. runtime/metrics are typically exceptions to this because they're process-global. I think it's cheap-ish, but different people have different opinions.
- Rhys: I agree that putting them as time series in practice gets expensive, but there's a lot of data we can get out of the runtime that we wouldn't turn into a time series. Like heap profiles and execution traces. In practice I'd like to have all the data sitting around.
- David: Our motivation is primarily around stability. Unless it's very very expensive, and it sounds like it's not, that's probably not much of a concern for us. We would like to be able to say "here's 1.0, we won't break you."
- Bryan: In Prometheus there will be 3 options: the 2017 set, all of them, recommended set. Aren't heap profiles just as large? (e.g. you have 100s of stacks and frames, if you have a bunch of heap profiles, isn't that the same)
- Rhys: I'm not sure we're looking at this the same way. For stability, could we have a string translation API?
- David: That seems OK. The library would hide this translation from users. The harder question is: is there a set of metrics that we think will continue to be useful in Go 1.30?
- Rhys: I would expect most runtime/metrics to not change. Recently, the STW metric changed.
- Michael P: I see where you're coming from David. I think to Rhys' point that most metrics don't change. But there are also details about, for example, memory classes. New classes can get added, and some might not make sense as the GC changes.
- Felix: I think the argument of stability resonates with me. I recently built a dashboard that breaks down memory and it could break. I also think I got what Bryan was saying earlier. I think you can see the heap profiles similarly to metrics, but the blob profile has different performance characteristics. Also, they tend to get taken less frequently.
- David: It's also good to have some space for metrics to be introduced that are in development.
- Michael K: Shall we move forward with documenting metrics? Would this work for you David?
- David: What this would look like is we would have a documented set that depends on Go's documented set. This would be fine to unblock us. Going forward, it would be good to have a way to make this more maintainable and detect breakage from the Go side.
- Felix: Last I looked it seemed like the Go metrics support in OTel was incomplete. What's the plan?
- David: We plan to throw away what we have and start from scratch. We have two approaches in progress: one would be to have a stable set, and the other would be to have some configuration so that users that want different metrics or more metrics don't have to wait for us.
- Felix: What's the best way to follow along this stream of work?
- David: There is an issue on the semantic conventions repo: Semantic conventions for go runtime metrics open-telemetry/semantic-conventions#535
- David: There are also WIP discussions/proposals in the Go sig: runtime instrumentation replacement proposal
OpenTelemetry OTEP 239 (pprofextended format proposed as new profiling signal format) - Felix Geisendörfer
- I wanted to call some attention to this and it's in the last call stage. It will get merged soon.
- At this point the spec could still be changed, but the big picture will probably stay the same.
- pprofextended: open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2
- Michael K: I'll bring it up to the team.
- Michael P: Alexey has been following this, which gives me confidence that there will be some alignment.
- Felix: Yeah, he basically co-designed this with Dmitri from Grafana. But AFAIK he was not the one that could make the call on what could ultimately happen with the pprof project. It's more about whether OTel determines pprof's future.
- Michael P: Yeah, it seems like the question is how they're going to be kept in sync (or not).
- Felix: Divergence would be highly undesirable because great pains were taken to make pprofextended to maintain compatibility.
Felix: Deterministic Testing for Go
- Felix: A colleague approached me with questions on how to make the Go runtime more deterministic for testing.
- Michael K: We have considered adding scheduler hooks or more deterministic scheduling.
- Michael P: runtime: scheduler testing controls #54475
- Rhys: There are some places where the Go project intentionally randomizes.
- Felix: I think my colleague was actually trying to randomize but with a fixed seed.
- Michael K: Race mode already randomizes a whole bunch of things. You could imagine setting the seed for that.
- Felix: A seed that is used for everything would be interesting.
- Michael P: Some of the randomness isn't explicit, like two threads racing for the same work. It needs explicit support to become deterministic.
- Rhys: You could imagine a record-replay type system.
- Michael K: It occurs to me that an execution trace might have the information you need for that already. It's not designed for it, but it might work.
- Michael P: That's interesting, it might work. There will still be some weird things you need to do, for example to make an async preemption land at exactly some instruction.

This change does a lot at once, but it's mostly refactoring. First, it moves most of the profile abstraction out of benchmarks/internal/driver and into a new shared package called diagnostics. It also renames profiles to diagnostics to better capture the breadth of what this mechanism collects. Then, it adds support for turning on diagnostics from configuration files. Next, it adds support for generating additional configurations to capture the overhead of collecting diagnostics, starting with CPU profiling. Lastly, it adds support for the new Trace diagnostic. (This change also fixes a bug in go-build where Linux perf flags weren't being propagated.) In the future, core dumps could easily be folded into this new diagnostics abstraction. For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

This change does a lot at once, but it's mostly refactoring. First, it moves most of the profile abstraction out of benchmarks/internal/driver and into a new shared package called diagnostics. It also renames profiles to diagnostics to better capture the breadth of what this mechanism collects. Then, it adds support for turning on diagnostics from configuration files. Next, it adds support for generating additional configurations to capture the overhead of collecting diagnostics, starting with CPU profiling. Lastly, it adds support for the new Trace diagnostic. In the future, core dumps could easily be folded into this new diagnostics abstraction. For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

mknyszek · 2024-06-20T18:13:19Z

2024-02-29 Sync

Attendees: @mknyszek @prattmic @bboreham @rhysh @nsrip-dd @dashpole Arun (from DataDog)

Deterministic simulations in Go – Arun (from Datadog)
- Arun: Testing distributed systems. We have used formal methods like TLA+ and lightweight simulation like SimPy, but they have problems. The goal is to inject failures and run the runtime in a deterministic way. FoundationDB is an example of applying the determinism concept.
- Arun: The Go runtime doesn't provide determinism. Goroutine scheduling, map iteration order, channel selects, etc.
- Arun: Exploring trying to side-step making Go deterministic. Using Hermit from Facebook. Containerizes a program and intercepts syscalls.
- Michael P: Interesting idea, we (the runtime team) have discussed testing runtime internals with more deterministic scheduling. For you, this is more about in-user applications which seems a bit easier to solve. The Go runtime could run everything on a single thread and pick a specific goroutine execution order. It wouldn't be a small feature in the runtime, but it seems feasible.
- Arun: Could we have a seed that captures the order in which the scheduler decides to do things?
- Michael P: It comes down to the implementation not taking determinism into account. There's a lot of intentional racing and time measurement that is inherent to the scheduler's implementation. At a baseline we'd need only cooperative preemption, and we'd need to enforce that we always preempt at the same time. Limiting to a single thread helps.
- Michael P: Question about Hermit: sounds promising, but how does it deal with the same sort of scheduling issues? Like Linux scheduling determinism? Does it deal with that at all?
- Arun: Not sure, but in testing spinning up multiple goroutines and running them in the same order, it seems like Hermit does actually seem to run them in the same order.
- Michael P: While I say that it feels implementable in Go, a third party project is probably better to invest in. For example, the advantage of also running your C code (in cgo cases) deterministically.
- Arun: cgo doesn't seem to work. Hermit seems to crash on a vfork call in the runtime. Can the Go runtime do anything about C code?
- Michael P: Not really.
- Bryan: I saw the blog post from antithesis, but both that time and today I'm missing something. I get that determinism is useful in trying to track down a problem, but I don't get how taking away that problem is enough to root cause analysis.
- Arun: Determinism solves part of the issue, but test cases are needed as well, for testing distributed systems.
- Michael K: Is the missing link record-replay?
- Bryan: That would do it.
- Michael P: You could fuzz the scheduler order, and use that as a regression test. Have these projects done anything to solve perturbation of deterministic orders due to source changes? That seems like a substantial problem. Does the determinism need to be defined at a higher level?
- Arun: Not aware of anyone who's solved that problem. You can definitely at least capture higher-level operations, like RPCs, and record and replay those.
- Rhys: You mentioned you haven't been able to check the model against real code execution, but I've heard of groups that run a simplified version of their model as assertions. Agree with Michael P that making little changes and determinism is set at too low of a level, that'll throw off regression tests.
- Arun: There's a Rust library that lets you run model checking against your implementation, but I'm not sure if we'll still run into challenges with determinism in Go.
- Michael K: Bringing back up idea from two weeks ago about traces for this.
- Michael P: The main missing feature though is still that you don't know how long to run before yielding, and that's not in the trace.
- Michael K: True, you'd need a function call trace or something. Didn't Jane Street have a thing for function call tracing?
- Michael P: Yeah, it uses Intel PT.
- Michael K: With the disclaimer that I haven't read your doc yet, I think we'd be open to a prototype. We would like to be able to use this ourselves anyway.
- Michael P: Logistically, it would be interesting with respect to the faketime implementation, in terms of making the Go playground more deterministic.
- Michael P: https://go.dev/issue/54475 scheduler testing issue
Bryan B: Topic: profiling maps
- Idea comes from watching Matt Kulukundis' talk on Swiss Tables
- Some portion of every map is randomly chosen for sampling
- Fetch the size, capacity, few other things.
- thepudds: There was a recent issue about map and slice presizing with PGO.
- Nick: The heap profiler today does capture map growth, but somebody made the decision to hide runtime functions at the tail end of the call stack.
- Bryan: It's hard to figure out if the problem is lots of small maps, or a few big maps. Something explicit for maps would take away a lot of the guesswork.
- Felix: It would be interesting to profile channels as well. We've gained significant capabilities to figure out when channels are contended with traces, but we don't really know why we block on the channel, or whether the channel's capacity is an issue. "How full are my channels" would be another interesting profile.
- Michael K: You could imagine that in the execution trace we just dump more information when blocking on channels.
- Felix: It's really useful to know when you're not blocked yet. Or never will be (e.g. when using select to send).
- Rhys: Knowing when you're not blocked in the execution trace would be really interesting. Sometimes you don't get to see that two goroutines don't have an interaction at all.
- Michael K: You could imagine extra trace events that log every channel operation.
thepudds: Escape analysis could be exploited further to drop old slice backing stores.
- Michael K: Definitely something we're thinking about, just reusing memory in general. TODO: Add in the rest.
- thepudds: Getting the runtime and compiler to collaborate on this seems like it would be the most beneficial.
- Michael K: +1.
Michael K: Have not gotten around to creating a proposal for a documented set of recommended metrics, I'll try to get on that.
- David: Reached out to someone on the dashboards team internally to try and figure out what metrics from Prometheus people are actually using. Reached out to other people in the OTel group to get similar information from other companies.
- Felix: We can see what we can get. Customers tend to just use MemStats, so the analysis will be harder for us. We are internally using runtime/metrics though. Did an analysis of MemStats and I can at least tell you what's definitely not being used there.
Michael K: Also an update on more goroutines in traces: I sent a CL out. There's one case left, but it's unfortunate we're playing whack-a-mole. We should perhaps do something more principled.

mknyszek · 2024-06-20T18:14:29Z

2024-03-14 Sync

Attendees: @mknyszek @prattmic @felixge @dashpole @nsrip-dd @rhysh

Nick: Checking out viewcore lately. Wondering if anybody has any experience with something like viewcore but with a UI that we can learn from.
- Felix: Maybe JMC (Java Machine Control) has something?
- Nick: Yeah, it's called "visual VM" which has a ton of stuff but I haven't tried it myself yet.
- Nick: I was able to recreate what my colleague did with root tracking with tracking a heap profile from a core dump.
- Michael P: One thing we wanted to do is to make the internals of viewcore a nice public API. We haven't gotten to it yet. We have to decide what layer that API operates at. RE: extracting a heap profile, is that something the API should be able to express, or .
- Michael K: I think the API might need to know about that. I was thinking that you could also grab a trace out.
- Michael P: You would effectively do traceAdvance from outside of the running program. I don't think it would be too difficult to do.
- Rhys: Seems useful to think of it in terms of "implement an API similar to what the runtime does, for accessing the runtime's internal datastructures".
- Michael K: I'm pretty sure viewcore is still broken for Go 1.22 (sorry…).
- Nick: Yeah, I was using Go 1.20 and it seems to work. I can try to fix it, but no promises.
- Michael K: Let me know if you need any help. I'm pretty sure Michael P tried and the main issue he ran into was that it tries to collect all GC metadata up-front. It would be simpler to type the heap up-front.
- Michael P: It's also really crufty. It would definitely be less crufty if we could delete old codepaths.
- Michael K: Been a bit of a struggle to prioritize viewcore.
- Nick: Maybe what it needs is a strong use-case?
- Felix: Would there be reviewer bandwidth for viewcore?
- Michael K: Yeah definitely. I really should just fix Go 1.22. But the main work here is that someone needs to sit down and think about the API surface.
- Michael P: We were also thinking of landing it in x/exp so it's not final the moment it lands. That way we can gather feedback on the API.
Michael K: Ran into some issues with the tracer trying it out on some big services. The issues seem old.
- Rhys: I've run the execution tracer many many times for 1 second. It usually works. There's still a low rate of not being able to create a good trace (1.20 era). I haven't seen anything that makes me think O(n^2) behavior with the old tracer.
- Felix: We have used the tracer similar to Rhys, 5 MiB or 60 seconds, whichever comes first. We also haven't tested with Go 1.22 much yet.
- Michael K: Also dumping stacks was non-preemptible, that might be new. I have a fix out for that.
- Rhys: Is there an interaction also with CPU profiling?
- Nick: Asynchronous preemption can introduce more unique stacks, too.
- Michael P: One more unique stack case is that Google does a lot of protobufs, so you end up with lots of slight variations.
Michael K: FYI, blog post going out probably today (https://go.dev/cl/570603).
Michael K: Haven't made progress on default metrics, but I should just do that. Does it need a proposal?
- Michael P: Technically not, but we are kinda making stronger guarantees.
- Rhys: Would a test be sufficient?
- Michael K: I think so?
- Rhys: You could imagine viewcore is also able to extract metrics and validate them. That could be the test. You could also look at all the Ms and see if you're in the runtime on any of them, and refuse to proceed if they are.
- Michael P: I was thinking the same thing, specifically with holding locks.
Michael K: So… what's next for diagnostics? Just brainstorming. (We already talked about viewcore, which is one thing.)
- Felix: Known papercuts with stack trace size. Relatively low stack depth specifically. In terms of novel things I still intend to create a proposal for a stack profiler. viewcore and anything to help with memory leaks and OOM kills is going to be very interesting going forward.
- Rhys: I'll agree that OOM kills are an underserved need. I have unfinished business with the execution tracer, but TBA. Perhaps with post-hoc analysis.
- Michael K: On my end, I intend to add new event types to the tracer for runtime work. Another new use-case that came up recently was possibly adding timer-related events.
- Michael P: Perfetto uses SQL queries to do post-hoc analysis.
- Felix: At some point we discussed the fact that different profile types are overlapping with each other. We discussed having an API for choosing what goes into the trace.
- Michael K: Definitely on my mind.
- Rhys:
- Michael P: Post-hoc analyses seem potentially fragile since they depend on implementation details (for example, some net functionality has to actually enter a syscall to see a stack trace in the trace itself).
- Felix: Would people here be interested in trying out something SQL based for execution traces?
- Michael P: It seems like in theory you should be able to do it already by importing JSON traces into Perfetto and running SQL queries, though you lose some information. Rhys: You said it's difficult to decide when a connection was created.
- Rhys: There can be in-bound keep-alive connections and there isn't

mknyszek · 2024-06-20T18:15:57Z

2024-03-28 Sync

Attendees: @felixge @mknyszek @prattmic @nsrip-dd @rhysh @thepudds

Michael K: Default metrics proposal.
- Felix: Being able to discover metrics programmatically is nice, but I'm concerned about guaranteeing metrics exist and return zero.
- Michael K: I see what you're saying. Returning zero is supposed to be a last resort.
- Felix: I think
- thepudds: Zero is well-defined / equivalent to “no data” be for some metrics like throughput or count, but it's annoying downstream to have zeroes show up meaning “no data” for other types of metrics (e.g., an average latency)
- Michael K: That's fair.
- Felix: Something I would like to ensure is that the totals always correctly represent the total of whatever "subdirectory" they represent.
- Rhys: I still don't totally understand this need. Having a single field that's just a boolean seems too simplistic. What about tags? Also concerned about the battles waged over which metrics should be default or not. Things that the Go team says matters a lot in strange ways.
- Michael K: The Default boolean felt incremental enough that it would be OK.
- Bryan: I find the heap goal metric really useful but I noticed it was cut.
- Michael K: Yeah, I might go back on that.
- thepudds: One thing that would go a long way would be to take the godebug list out of the main list, since it tends to clutter it. Being able to just get one metric is useful too, and it's fairly annoying today. Perhaps an example modification could help here. Also, I noticed blog posts are still using runtime.ReadMemStats.
- Michael K: runtime.ReadMemStats still has the advantage of being able to be used for measuring individual allocations, because the runtime metrics don’t flush the needed stats.
- thepudds: maybe a new runtime/metrics example calling runtime.GC and measuring individual allocations? (Calling runtime.GC might be what the testing package does as part of reporting allocations; not sure).
- Michael P: On the topic of godebug metrics and the proposal, I'm not sure
- Michael K: I'd understood the godebug metrics live indefinitely.
- thepudds, Michael P: Nope, 2 years for many GODEBUGs. (Some GODEBUGs might be supported longer or even indefinitely).
- thepudds: I don’t know if there’s been an explicit statement about what happens to a ‘/godebug/…’ metric if the associated GODEBUG ages out.
- Michael K: Summed up the changes. Are we OK with programmatic access?
- Michael P: I think programmatic access is OK, but I worry that the "Default" field is a bit too binary and restrictive.
- Rhys: The MemStats struct kinda was the default, and was simultaneously too big and too small for many years.
- Felix: Another thought on the Default boolean flag: what are the chances of new metrics becoming default? I suspect it would be low, and the main utility of this flag for me would be discovering new metrics.
- Michael K: All good points.
Felix G:
- Increase stack depth for mem/block/mutex to 128 (issue, CL)
  - Michael K: Clever. I'm supportive.
  - Michael P: Also supportive. However, I know that changing the [32]uintptr to a slice does theoretically break compatibility, but maybe nobody is using it.
  - Felix: My concern is using the record as a map key. I went out of my way to make sure that works.
  - Michael P: We can check the open source ecosystem to see if anybody's doing that.
  - Rhys: I'm paranoid about the hidden slice containing different information from the PC array if someone modifies it.
  - Felix: Currently that might get confusing, yeah. An alternative could be to have a "stack remainder" instead of having the hidden slice contain the full stack, since that would solve that problem. Also, the Stack method always returns a copy.
  - Michael K: It's not clear to me that changing the API would help that much, and this wouldn't have to go through the proposal process.
  - Michael P: I don't have a concrete alternative, but the API is also very weird. Maybe not that weird, since the array is already called Stack0.
  - Felix: If we do change our mind if we want to change the array to a slice in the future, we could still do that.
  - Rhys: This might affect CPU profiles, but I see this isn't for CPU profiles.
  - Felix: Yeah, it's not. I think it's OK to generally increase it because the memory use increase only affects those with truncated stacks, who I suspect are willing to pay the cost given how annoying it is to deal with.
  - Rhys: How far are we from being able to set this with GODEBUG?
  - Felix: I haven't tried yet, but I think this is pretty easy to wire through a GODEBUG.
  - Rhys: It couldn't be a constant if it's a GODEBUG.
  - Felix: We can hash it out in the CL.
- Stack memory profiler proposal (issue, CL)
- Reduce pprof label overhead (CL)
  - Felix: This is fun. The idea is to represent the labels as a LabelSet directly internally and to merge LabelSet lists with a merge sort. It turns out to be a lot faster. There are some applications that set this a lot and it costs a lot of CPU.
  - Rhys: I'm a fan of setting them eagerly and often for requests.
  - Felix: The only remaining overhead in terms of allocations is it's not possible to set goroutine labels without creating a new context, which forces an allocation. We'd need a different API. I would first try this CL.
Michael K: Annotating heap profiles further: GC cost proxy (scannable bytes * GCs survived as a counter per stack) and precise type information (as a tag, will split stack samples in the format, probably not a lot unless there's a lot of very dynamic reflection).
- thepudds: Do we have precise type information for all size classes?
- Michael K: Yeah, at the point of allocation. We can write that down.
- Felix: heapage

mknyszek · 2024-06-20T18:17:29Z

2024-04-11 Sync

Attendees: @nsrip-dd @rhysh @bboreham @cagedmantis @dominikh @felixge @mknyszek

Michael K: Updated recommended metrics proposal.
- Please take a look. Might be two proposals in one, and might have to break it out. If anyone has ideas on other possible future tags we might want to add to metrics, please comment on the doc! Thanks.
Rhys: Open-sourced the tracing and profiling tools I've been using for ~6 years.
- autoprof: produces, tags, and zips up profiling data for collecting at scale. Nice for teams that don't have existing continuous profiling infrastructure. Goal is to make profiling less "one-off."
- go-tracing-toolbox: a collection of tools for poking at execution traces. etgrep, which allows for grepping trace data in all sorts of fun ways. Other tools like using CPU profile samples that happen in between two events and build up an idea of what is happening between those two events.
Felix: CL for getting deeper stack traces.
- Backstory: Investigating service using GOMEMLIMIT and we noticed that heap profiles were getting truncated.
- From last time: dealing with public API being limited to 32 frames. Went with Rhys' suggestion to hide the extra frames inside the struct.
- Tried to make the max depth configurable at runtime, had all of it working, but it's probably overkill. What we could do instead is just make it configurable at startup via GODEBUG.
- Dominik: On the note of stacks, my biggest problem with the truncation in the context of tracing is not knowing that truncation happened, or how many levels got truncated.
- Michael K: Yeah, we can always do the full unwinding and add a dummy frame indicating how many stacks were lost.
- Michael K: Might be worth pooling the stack frame buffers.
- Rhys: How to decide when to attach to the M or the P.
- Michael K: Generally we try to prefer Ps because there's a fixed number and usually fewer in number than Ms. But diagnostics stuff is moving to the Ms because we want visibility into what the runtime is doing. Ps are generally about user code executing.
- Felix: Adding a new buffer to PCs to the M. I noticed there was already one there and considered just using it.
- Rhys: Yes, adding a new stack to the mprof hash map involves taking a lock, so I think it's hazardous to use that buffer.
Michael K: Core dumps!
- Would love to get this done but need help. I'll try to get Go 1.22 and Go 1.23 working after the freeze (6 weeks from now).
- Byran: I can help. I've been trying to get an answer to questions I have about a service and what memory is being kept live. Uses a very specific data structure that is staying live long, and my hypothesis is that it's referenced from a slice in a pool. Manually cleaned out elements in pools and that still didn't fix it.
- Rhys: Does viewcore integrate with heap profile data? Such that if you get lucky, it'll tell you the callstack that led to an allocation.
- Nick: Yup, I did this. I can send a patch, but the first order problem is unbreaking viewcore.
- Felix: The service I mentioned earlier, it was really about tracking down live memory.
- Bryan: viewcore makes assumptions that are only correct for x86 (didn't work on macOS). (Frame pointers.)
- Nick: I know that it only has logic for ELF core dumps, so it might be that macOS core dumps just aren't supported.
- Bryan: That seemed to work.
- Nick: I'm happy to help too!

mknyszek · 2024-06-20T18:18:34Z

2024-04-25 Sync

Attendees: @mknyszek @felixge @dashpole @rhysh @prattmic @nsrip-dd

Rhys: Mutex Contention Patches
- Freeze is in 4 weeks.
- Working on runtime mutex contention patches. Struggling with testing. Don’t want the tests to be flaky.
- Michael K: Making the mutex bigger is probably fine, especially if it doesn't bump a size class.
- Rhys: That might make it simpler. Trade-off in code complexity vs. performance costs?
- Michael P: It's subjective.
Felix:
- OpenTelemetry for Go Profiling / Metrics / Traces … ? (gh issue) (quote: “Could we integrate into the new “hot thing”, OpenTelemetry?”)
  - The idea would be something like having the runtime (or runtime/metrics), via an environment variable, export data via OTLP or some vendor-neutral protocol to an endpoint.
  - Michael P: I'm not sure how this very different from a package you import.
  - Felix: A big barrier with adopting observability, specifically adopting across a large number of repositories, users don't want to go and make a change to, for example, a package.
  - Rhys: Could we make importing packages easier as an alternative?
  - Felix: We're looking into that, but lots of organizations are dysfunctional in that way where adding an environment variable is much easier than modifying code across the org. It's a second-best option.
  - Michael P: We can't pull gRPC into the runtime, for example, but marshalling a few protos manually is probably OK.
  - Rhys: Is there something simpler and file-based we could do?
  - Felix: I think that could be a middle-ground, since something else on the host could ship them around.
  - Rhys: Go's profiles are really good for understanding Go programs, and execution traces are unusual, but good for Go programs. OpenTelemetry is applicable widely, but not designed for Go or Go's strengths. That makes the idea of having something pick up the Go-specific files and do the conversion or pass them on seem better.
  - Michael P: Worried a bit about folks being upset about binary size if every Go binary includes net/http. There may be workarounds.
  - Felix: That makes sense. Any standard interface would be nice. Integrating all the things (e.g. runtime/metrics in expvar) and issues with manually passing distributed tracing information through contexts.
  - Michael P: Worked on gVisor. We'd export diagnostic information over a Unix domain socket. Doesn't Java have something similar?
  - Michael K: Java has everything. Maybe you're thinking of JVMTI?
  - Michael P: If your binary is not a server, maybe we don't care? For example, automatically included if you're using net/http.
  - Felix: Sounds like it's not a definite no? If we were to take the ideas and writing up a proposal would it be worth it?
  - Michael K: I'm in support of anything that brings all the diagnostics together and exported in an easy way. Maybe this will go the direction of memory profiles.
  - Michael P: If I recall correctly, memory profiles are disabled in the linker if memory profiles aren't ever referenced.
  - Rhys: What if the API is "do a core dump" or "attach a debugger"? Is there anything there?
  - Michael P: That sounds really fun. The API for "attach a debugger" is "ptrace attach." The debugger can do a core dump. The runtime could do a core dump, and we've discussed it as making it the new heap dump format.
  - Felix: I really want to see more stuff being done with memory dumps, but it might be too high overhead for the use-cases I have in mind.
  - Rhys: The out-of-the-box experience could just be attaching to the process for debugging, but just doing a very simple thing.
  - Michael P: It would be very nice if we had a plan9 style interface where a process can export a new filesystem. Like /proc//go.
  - Rhys: My read on why keeping memory profiling always on is OK is that it's because the memory profile is just metadata on heap memory. But it means that other profiles just aren't used because they're disabled by default because the overhead is unclear and nobody knows what to set the profiling rates to.
  - Felix: I like the idea of turning it on by default. It's definitely hard to decide what a good value is for block and mutex profiles.
  - Michael K: I saw Nick had CLs out to make these profiles cheaper. Maybe cheap enough to be on by default?
- Workaround PGO inline issue by dropping samples? (pr)
  - The workaround is to hack up the pprof profiles passed to PGO to prevent profiling of the problematic frames in the meantime.
  - Michael P commented on the PR.
- Ready for high-level review: Configurable stack depth CLs for heap/block/mutex/goroutine profile.
  - Michael K: I'll take a look.

mknyszek · 2024-06-20T18:19:34Z

2024-05-09 Sync

Attendees: @prattmic @rhysh Jon B (from DataDog) @nsrip-dd @bboreham @cagedmantis

Freeze soon: code reviews?
- Carlos: removing cmd/trace/v2, internal/trace/v2 v2 components. CLs almost ready.
- Nick: FP for block/mutex, ready but needs submit.
- Felix: Deeper stacks CL requests a proposal from Cherry. Follow-up CL sent yesterday may avoid the need.
- Rhys: Fixing runtime mutex contention profile semantics. Is this too subtle for this point in the cycle?
  - MP: I’m not too concerned, but haven’t seen the CL yet.
Jon: Looking to add profiling/tracing without changing the source code.
- See notes from last week.
- Jon: Been playing with Go plugins. If the plugin is available, the net/http, etc may log via the plugin.
- Jon: Trying to bridge the gap with other language platforms.
- MP: Feels like we want a .so provided via environment variable, but Go doesn’t support that well.
- Bryan: https://github.com/grafana/beyla is a eBPF kernel-based approach. Maybe Go should make it easier for these kinds of tools to find the data they need?
- Rhys: In the past, we’ve collected periodic profiles/execution traces. Then do post-hoc analysis of the trace to determine the higher-level events. Perhaps net/http, etc, could add to execution traces?
- Jon: We’d like to combine this with distributed tracing, so we’d like request ID, etc metadata.
- MP: We’ve discussed adding high-level events to trace before. Adding context is a bit tricky, as we need some sort of standard of what to include.
- Rhys: It may be a bit much to automatically include certain headers, etc to traces automatically.
- Jon: OpenTelemetry has standardized HTTP headers for tracing, so that could serve as a baseline of context to include.
- MP: If we want this one automatically (via environment variable), there could also be an option for extra http tracing opt-in.
- Rhys: There is a debug call API.
- Jon: ptrace is slow, is that problematic?
- MP: Debug call is used by Delve. It doesn’t technically need ptrace, perhaps an LD_PRELOAD shared object could call it? I’m not sure what the ABI is?
  - https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=1843;drc=a8ba163695f605c72156c90de97a06b3e40566ff has the ABI/requirements. I don’t think this would be usable from a shared library as it more-or-less depends on goroutines being frozen.
- Rhys: Perhaps a debug call into a common package to register a socket/listener/etc to receive events? Maybe this is no better than an environment variable?

mknyszek · 2024-06-20T18:21:07Z

2024-05-23 Sync

Attendees: @jba Jon B (from DataDog) @prattmic @rhysh @felixge @nsrip-dd @mknyszek Milind (from Uber)

Rhys: Got in an update to runtime-internal lock contention before the freeze, but it's a bit slow, as it turns out.
- Felix: Does that collect stack traces? I mostly see the "missing" indicator from the last release.
- Rhys: It currently can capture stack traces behind a GODEBUG, but the stack trace doesn't really match the mutex profile exactly (kind of halfway between block and mutex profile). The change that I have in is to plumb through the amount of delay the current lock holder is causing is all attributed to the right call stack. That is, making it one kind of profile, which means we can get rid the GODEBUG. However it looks like it made the lock critical section ~100 ns slower and it's causing regressions.
- Michael K: The regression is minimal at 16 cores, but significant at 88 cores.
- Michael P: Did you confirm it was actually locks failing to acquire when they should be acquire because of some bug in the lock vs. legitimately failing to acquire because of a longer critical section.
- Rhys: What I've seen so far is consistent with the latter. I haven't seen data that makes me worried that it'll get slower and slower with bigger machines. What I think is going is that apps have some lock that is often the most popular lock. My idea is that it reaches some threshold and everything falls apart at that point.
Automatic Instrumentation for Go (Jon B)
- JVMs allow users to attach tracing at runtime. OpenTelemetry currently uses eBPF tricks (in a way they're not really meant to be used) to achieve something similar for some programs.
- Built a prototype where net/http will load a Go plugin and start generating trace IDs if an environment variable is set. It just writes to a log. Go plugins probably won't actually work because it's not a stable ABI, but if there's any way we can set a flag and let the Go runtime start writing out data to a UDS, that would be awesome. Don't want to be too prescriptive, but this is the problem we want to solve.
- Jonathan A: Only interested in tracing, not metrics?
- Jon B: There's ways to get metric information out already, but there isn't a great way to get distributed
- Jonathan A: You'd get runtime metrics, but not other kinds of metrics. Like from net/http.
- Jon B: Want to get the trace ID from a header and then propagate it and send it back to whatever is collecting spans.
- Jonathan A: Ah, so it's about leaf services (final span).
- Jon B: Or if it's part of a chain of requests. Each server would have this kind of tracing support in it and each server would report back and the collector would correlate them.
- Jonathan A: What would the server write to the UDS? What is the format?
- Jon B: For now it was just a little custom binary blob.
- Jonathan A: Would it be OK if it was JSON?
- Jon B: It's not a lot of data for each span, so probably. There was some discussion of including OTel in the runtime, but it's probably too big.
- Jonathan A: We wouldn't vendor the OTel library, but we could just follow the spec for serialization.
- Felix: I want to correct something from earlier, we're interested in not only tracing, but also metrics and profiling. We want to get all kinds of stuff out.
- Jonathan A: I was thinking of something similar to expvar but the output format would be OTel JSON or something.
- Rhys: It sounds like what you're describing is a read-only API where the process dumps information out. But you've also been talking about chains of services that need to know who called them. It sounds like it's not entirely read-only. It sounds like there's a need to read a header and apply it to the outbound calls.
- Jonathan A: That's actually really easy. We just put it in a context and make sure outbound requests put it in the headers. We should probably do it.
- Rhys: You're saying that's something we'd be willing to have in the standard library for all requests?
- Jonathan A: Seems like a reasonable thing to me, but I don't speak for the whole team.
- Rhys: I haven't found that there's a clear answer that we should definitely trust the headers we're getting. But I'm also hearing that there's a desire not to have the application owner have any dependencies on packages that would make these explicit choices. Am I understanding that right?
- Jon B: Yes, with an asterisk. If the trace generates a header that we're generating trace information
- Rhys: It seems like there are many components that need to line up for this to work, and it's not clear to me that it can be done without the application owner having a say in it. If the application owner is going to have to import a package anyway, why not make that the entire design?
- Jonathan A: I don't see a need to require an import except if the user wanted to add more.
- Rhys: There are security implications here.
- Jonathan A: The good news is that the user is going to opt in, and every server in the chain has to opt-in.
- Felix: In terms of the opt-in, the ideal state is that it could be set at runtime (e.g. via an env var). With net/http/pprof you have to make a small code change in main().
- Michael K:
- Felix: +1.
- Michael P: Reiterate and +1 Rhys' concern about some of the automatic trace ID propagation through net/http. That may be the most worrisome point. That one does feel potentially problematic from a security perspective. I know we said you have to opt-in, but we're not so much thinking about individual applications opting, but a cloud provider opting everyone in. I don't know if it's really a problem, but it's something we definitely have to think about more.
- Michael K: We have a bunch of different ways of getting data out of Go processes, and there are a bunch of different formats for that and a bit of a discoverability problem. [more.] Maybe it’s time for an issue, for active discussion and planning?
- Jon B: I can do that.
- Michael K: Or a draft design. This is genuinely exciting!
Work on metric for number of “runnable” goroutines (proposal). (Felix)
- Some renewed interest to get a runnable goroutine metric with the idea to do load shedding based on it. It's a little bit more direct than scheduling latencies. Just wanted to call dibs on working on this early in Go 1.24.
- Rhys: I think the idea of this metric for the purpose of load shedding has been on the issue tracker for a long time, and it existed as a patch, but it didn't really work for Google. Totally normal programs can spin up a zillion goroutines that instantly finish.
- Felix: There are internal users who insist that runnable goroutines are the answer. My idea is to just give it to them and let them experiment.
- Rhys: +1 that the scheduling latency histogram is useful.
- Felix: Want to check the temperature on implementing the proposal.
- Michael K: I think it's fine to have the metrics. Go for it!
Milind: Talking of runnable goroutine metric: we have the need to get a metric of blocked goroutine with blocked wall-clock time in pprof. This helps detect deadlocked goroutines.
- Michael K: This has also been on the issue tracker for a while but we haven't come back to it.
- Rhys: +1 for collecting goroutine profiles every few minutes. Being able to connect them back to the process that created them is incredibly useful.
- Milind: So we may not need to put block time in the profiles, and there's a different way to get the same information?
- Rhys: Yeah, that'll show up in goroutine profiles.
- Milind: Those that leak slowly are harder.
- Michael K: In theory if you made goroutine profiles inconsistent, you could drop the STW.
- Rhys: In the design I found that a consistent snapshot is important.
- Nick: There is a way to get a goroutine profile with waiting time. It is one of debug mode levels. Is that good enough?
- Milind: I'll take a look.
Milind: Is this issue I created some time back in the scope of this group's area of work? proposal: cmd/compile,runtime: -race should report the allocation site of the object involved in the data race #67221
- Felix: Yeah, I think so.
- Michael K: +1. I think we could do it optimistically based on whether heap profiling is on. Collecting stacks on each allocation is quite expensive.
- Milind: Yeah, so if someone wanted to catch it with high certainty, they can turn up the sample rate.
- AI(mknyszek): Comment on the issue.

mknyszek · 2024-06-20T18:22:19Z

2024-06-06 Sync

Attendees: @rhysh @felixge @nsrip-dd @bboreham @cagedmantis Milind (from DataDog) @dashpole @aclements

Felix: Any thoughts on shadow stacks? Obviously not the hacky thing I did, but about a future where operating systems provide us access to hardware shadow stacks?
- Felix: Inspired by Bytehound’s profiler’s cache. Caches the walked stack and leaves a trampoline on a return address in the stack to invalidate that cache. Tried it out in Go with frame pointer unwinding and in the best case it’s 8x faster! Even if the stack is constantly changing, it helps for really deep stacks. I don’t intend to really propose this because it’s scary and FP unwinding is already really fast.
- Felix: The hardware is already capable of maintaining shadow stacks. If in the future the OS provided them (there are Linux patches in flight), would we be interested in using them? Maybe FPs would no longer be needed.
- Austin: I think it all depends on cost. In particular, if it makes it expensive to switch goroutines, then I can’t see it being an overall win. But if not, then sure I think we’d be open to it.
Nick: coroutines in execution traces
- Nick: One of the biggest changes for execution traces is that we now have coroutines. We’re looking at how to represent these in the trace UI. It looks like it’s implemented as a lightweight goroutine switch. In our current UI, we have lanes for each G. With coroutines, I think this will look like a rapid ping-pong between goroutines. Has anyone thought about whether that’s good or if there are better ways to represent this?
- Felix: I think I brought this up in the previous meeting. My concern was more on the data volume size. If this becomes popular, it could bloat traces. In terms of visualizing, if we chose not to record this data for size reasons, there’s nothing to visualize.
- Austin: We’re hoping that this doesn’t get used all over the place.
- Felix: Are we leaning toward not using iterator patterns all over std?
- Nick: I think the idea is to use the push pattern, which doesn’t involve coroutines.
- Austin: Right. We do plan to add iterators to many places in std, but those just use yield functions, not coroutines.
- Felix: I think it’s probably okay to just represent these as goroutine switches.
- Nick: If the expectation is that these won’t be common, then maybe we just leave it alone unless people complain.
- Austin: I will note that these coroutines are first-class, so most of the time it’ll be a ping-pong between two goroutines, but you can pass them to other goroutines and then things get really weird.
Bryan: pprof labels for memory profiles? go/issues/23458
- Bryan: A colleague asked about this. Is there any recent thinking? If you did have “tell me all the memory allocations in the last 30 seconds”, you could do something like CPU profiles.
- Rhys: I would like to see memory profile sample events in the execution trace. You can kind of see which goroutines are allocating memory from the “heap in use” change events. If we had a trace event for when an allocation is profiled, that would be really useful.
- Austin: Ah, then you could construct a labeled memory profile from a trace.
- Bryan: I guess I am proposing memory delta profiles.
- Felix: I would love to have that option in the execution trace, as long as its an option. I’m worried about data volume. At some point we need to think about an API for controlling what goes into a trace.
- Nick: I think we can trade one for the other. I think we might get more events from the current “heap in use” events than you would from memory profile events.
- Rhys: Heap in use also goes down, so I don’t know if we could do a direct trade. At the current default memory profile rate, you wouldn’t get many events, but if you set that to 1 you’ll be sad.
- Rhys: A few years ago Austin talked about log-based profiling with offline aggregation. Is the execution tracer the way to do that, with some API to filter?
- Austin: I think that would be a good way to do that. We’ve already gone down that path a bit! I don’t think we can change the pprof format, though.
- Rhys: A notion of time in profiles is amazing. See the Firefox profiler.
- Felix: OpenTelemetry has had a lot of discussions about timestamps in their profile format.
- Austin: Also it’s not that hard to write a perf.data file!
- Austin: Back to memory profile labels. I think we could have labels on the “inuse” profile, but not the “alloc” profile.
- Bryan: We’re attaching trace IDs to allocations. I think they’d have to age out.
- Austin: The other thing is memory delta profiles. I think the execution trace approach is a good way to do that. Or we could add an API specifically for a delta profile. “Allocations between point A and point B.”
- Felix: To me goroutines can be grouped together by what initial function they execute in. You might be able to infer heap labels from labels in a CPU.
- Bryan: The pathological case is you have something allocating a 1 MiB buffer. It won’t spend much time doing that, versus something allocating lots of 8 byte objects.
- Felix: What use case are you trying to solve? OOMs?
- Bryan: I’m interested in the things allocating and freeing a lot of memory, which is driving GC.
- Rhys: I think execution traces are really good. If you have a profile you’re trying to solve, you’re going to want other data from the execution trace to pin that down once you know the general area that’s in. Also, if tracing is too high overhead, but this is happening a lot, you can sample when you’re tracing.
- Bryan: We have a distributed trace with head sampling and we can attribute CPU profiles to the trace, but not memory profiles.
- Austin: And you’re using labels on the CPU profile to correlate it with the distributed trace spans?
- Bryan: Yes.
- Austin: And an “inuse” profile doesn’t make much sense for attributing to a distributed trace. You really want a memory delta profile with labels so you can say “this span allocated stuff.”
- Bryan: Yes.
- Rhys: Maybe if we can really trim down the events recorded by the execution tracer, the overhead would be low enough that you could use it for this.
- Felix: +1 to being able to collect execution traces with just CPU samples. You get timestamps and you can afford to send the data somewhere. I would love to just do that.
- Rhys: And then you also don’t have the mysterious 200ms in stopping the CPU profile before you can start a new one. runtime/pprof: collecting a CPU profile takes ~200ms of wall time #63043
- Bryan: So memory allocation events would have to be added to the execution trace, and there would have to be a filtering API. It seems like nothing else we talked about really solved the problem.
- Austin: We want that filtering API anyway. That’s the hard part but very desirable.
- Felix: Another complication is that we currently don’t include labels in the execution trace.
- Felix: For the next release, if there’s interest in working on a filtering API, we’d be interested in proposing something.
- Austin: +1.
Rhys: clocks .. why do some platforms have cputicks==nanotime, is there a reason we don’t (or should start to) use cntvct_el0 on arm64 (relates to runtime-internal mutex contention profile, and perhaps also the execution tracer)
- Rhys: I assume cputicks is supposed to be really fast. Is there a reason we haven’t used the instruction on ARM?
- Austin: How tightly are the clocks synchronized between CPUs? And are they constant rate or do they fluctuate with the clock rate?
Milind: is it possible to get the full ancestry of goroutines in data race reports?
- Austin: Those stacks are tracked and printed by TSAN itself, so I think this would require suppose on the TSAN side. It unfortunately doesn’t just call back into the runtime for these.
- Rhys: If you set gotracebackancestors to a large value, you might be able to correlate the output of the race detector with the Go traceback.

mknyszek · 2024-06-20T18:23:17Z

2024-06-20 Sync

Attendees: @rhysh @bboreham @cagedmantis @dominikh @mknyszek @prattmic

Rhys: is a mutex contention profile really the best way to find problematic contention (especially for runtime-internal locks), or is there a way we could measure the amount of time that a particular lock is held, to provide an earlier warning? (relates to scalability of apps where a single runtime-internal lock is the bottleneck, either sched.lock or in a channel)
- Runtime-internal lock contention is not really a problem until it's a terrible problem. We had a problem with sync.Mutex in Go 1.9 (queuing slow path) and Go 1.13 (priority boost when unlocking).
- Looking for prior art on identifying locks with heavy load before contention.
- Michael K: Worried about whether such a metric would be ignored by default.
- Rhys: But for runtime-internal contention, this could be really useful for maintainers.
- Michael K: Interesting. Is this apparent in CPU profiles?
- Bryan: Had fun with a read-write lock. One pending writer holds up all future pending readers and people don't know this. There are a few PRs in Prometheus related to this. It would be great in these situations to know better when these things are happening.
- Michael K: Maybe this is extractable from a trace? Maybe we already have this (or a weaker version of it) in the trace viewer?
- Michael P: You could imagine a mutex viewer in the trace tool that indexes by mutex.
- Rhys: We don't have the address of the sync.Mutex in the execution trace, and we also don't see mutex operations that didn't cause blocking/unblocking.
- Rhys: There's an interesting instrumentation technique from a paper Milind shared that could be used to annotate CPU profile samples. Could be an easy way to get data on what's going on in critical sections for runtime-internal locks.
- Bryan: Observation: traces are not a tool we're reaching for, and historically it was expensive and not that great. A different but related thing is a tool that would warn you that you used an RWMutex when you should have used a Mutex.
- Michael K: RE: RWMutex, maybe we should put a warning in the docs. This is definitely not the first time someone was burned by it.
- Rhys: What information do you want in a bug report?
- Michael K: It really depends. Maybe we need a flow chart?
- Michael P: RE: CPU profiling to estimate time in critical sections. Could the lock code itself sample the duration of critical sections even in uncontended sections? The downside is having to make the uncontended fast path slower. It would give us numbers that we could be more confident in.
- Rhys: Should we put mp.locks in the CPU profile sample? And also, could we get a measure of lock load before they get contended. And yeah, we also have to be careful about adding anything to the critical section.
- Rhys: What other tools would be helpful? For example, would better bpf support (tell a large swathe of users to run a bpf program to get some data) help in making decisions? This is in the same vein as a Go build cache experiment to understand how people were actually interacting with the cache. Not perfect, but it's still useful data.
- Michael K: Runtime diagnostics face a bigger hurdle than build diagnostics because production environments tend to be hard to get data out for.
- Rhys: Continuous profiling/tracing helps. For example, getting a page trace out is hard but getting a trace with certain fields set is much easier because of policy.
- Michael K: The page trace is now replaced with some "experimental events" that you can parse out of a trace: https://pkg.go.dev/golang.org/x/exp/trace#ExperimentalBatch.
- Michael P: https://go.dev/cl/588696
- Dominik: Do the alloc/free experimental events have stack traces?
- Michael K: Not at the moment because I was worried about overhead, but you could imagine stealing the heap profiling stack when we're taking it anyway as a low-cost way.
- Dominik: One user just said they'd be happy with opt-in stacks that have a 90% slowdown, for the sake of good data. It's useful in dev environments, at least.
Michael K: Related to adding mp.locks to CPU profile samples, I personally would really like to see # of GCs survived on heap profile samples.
- Rhys: Are you thinking something like being able to see a histogram for a heap profile sample of how long it survived?
- Michael K: Yeah.

mknyszek · 2024-08-05T20:36:01Z

2024-07-18 Sync

Attendees: @mknyszek @felixge @rhysh @nsrip-dd @bboreham @cagedmantis

Felix: New Trace Viewer
- Working on a nicer trace viewer with minimal JS dependencies to maybe hopefully make it upstream. Anything you all would want out of this?
- Michael K: Awesome! No concurrent plans on our side.
- Rhys: Grouping related goroutines or a subset of them. Also composability of features and deep-linking of UI elements back to trace data.
- Michael K: Would like the trace viewer to be more all-encompassing (so features are more discoverable) and for the trace viewer to scale to arbitrarily sized traces with help from the server.
- Rhys: Could the experience be good even if the execution trace is not on the same machine?
- Felix: Do you mean live streaming of the trace?
- Rhys: I just mean the file being remotely accessed and relatively quick to load. So that you could build a web service to view execution traces. Also to clarify viewing related goroutines, I mean interactions with other tools (so, controlling what's viewed on screen via external tooling).
- Felix: Were you imagining a view that's goroutine-centric as opposed to the default proc-centric view?
- Rhys: Ideally I'd have a view that's centered on the right goroutines. Probably can't be done in the general sense, but fine for investigating specific issues.
- Michael K: I wonder if there's a good high-level data model for visualizing trace data (something higher level than what the parser spits out).
- Felix: Perfetto uses SQL, and I played around with that a bit, but it was pretty hard to make work nicely.
- Rhys: I recently built a tool that extracts state machines from individual goroutines, so if you have an HTTP/1.1 client then the execution trace data + this tool can produce that into a graph SVG with a call stack at each node. A step toward a call graph that spans goroutines describing what it takes to serve an RPC. 
- Bryan: The arrows just represent the next thing that happened regardless of how long it took.
- Rhys: Right, there's no concept of time. The thickness of the line describes frequency of the event, however.
- Michael K: Really cool. Could put a distribution of time spent in each state on each box or something.
- Felix: +1, really cool. This is always the same goroutine right?
- Rhys: This is a couple thousand of the same goroutine. But it's also hard to say "same goroutine" because it's hard to name a goroutine. Sometimes it's easy (where did it start) but with things like errgroup and goroutine pools, it's much harder.
- Felix: I did something similar trying to map communication across goroutines.
- Nick: I was confused by some of these state transitions. If a goroutine is runnable, but shouldn't it always be running after?
- Rhys: The transition from runnable to waiting was running and the execution trace tells us how long, but we don't get any call stacks, so we don't know exactly what it's doing. Also in the tool if you mouse over any arrow it'll give you an example goroutine ID and a timestamp. I also did a whole bunch of filtering out.
- Bryan: RE: finding/naming goroutines, there are goroutine (pprof) labels, and then there was proposal: runtime: permit setting goroutine name to appear in stack dumps #35178. JetBrains also came up with a workaround.
- Felix: Why is Running sometimes there?
- Rhys: Unblock events.
- Michael K: You could put in dummy events to retain the original state machine.
- Rhys: Yeah, we could do that, but it was hard to name them for dot. RE: naming goroutines, names can be situational
Michael K: How did GopherCon EU go re: execution tracing?
- Felix: I seemed to have gotten more people excited about execution traces, so success!
- Felix: Talked with @jba about runtime observability out of the box.

mknyszek · 2024-08-05T20:37:50Z

2024-08-01 Sync

Attendees: @mknyszek @prattmic @bboreham @felixge @nsrip-dd @rhysh Milind (from Uber)

Felix: Trace Viewer Demo.
- Super fast, renders a ton of goroutines quickly.
- Immediately get data when you open the trace – streaming.
- Felix: It would be nice to have a public API to skip over the trace quickly.
- Michael K: Agreed. You can do it, it's not that much code, but it is annoying to have to write it out and depend on the wire format.
- Michael P: +1, nice to have the public API for skipping batches for things like trimming (simple, high-level operations).
- Felix: Working on streaming a trace. Basically, you connect the tool to the online trace
- Michael K: Did you run into any issues with CPU usage in the x/exp/trace parser? I saw 25 MB/s, which isn’t as fast as I’d like.
- Felix: Right now it seems good. 10 MB trace in 300 ms in the Go program, plus 300 ms more in the browser.
- Michael K: There are probably low-hanging fruit. Lots of places where maps were convenient, but maybe we could do better with specific datastructures.
- Felix: As long as the new tool is 10x faster than what we have today, it’s a win.
- Rhys: Really cool! When can we use it? :) Going back to two weeks ago, can we configure the goroutine sorting functionality you depicted? For example, custom clustering.
- Felix: Short answer yes, long answer is that I tried to come up with a query language but quickly realized that was a rabbit hole. But what would be really easy is to just let someone hook into the JavaScript objects containing all the goroutine data before presenting it.
- Michael P: Perfetto exposes a SQL query box on their page which is horrible but that's how they do it at least. A more serious suggestion that might work well with Go users: the tool is importable (app.Main(...)) where you can configure with custom sorting on the server side in Go.
- Felix: Not opposed, but I prefer to do a lot of the heavy lifting on the browser side because serialization and sending data around is expensive. I was thinking about dynamically loading parts of the trace under memory constraints, but maybe it won't be necessary – I think I can get the JavaScript object encoding down even smaller.
- Michael P: I was suggesting a JS text box, not actually SQL
- Rhys: Could just the sorting be pluggable? And the grouping.
- Felix: Already have some ideas for grouping. Open to contributions.
- Felix: What dependencies might be allowed if this is to become part of the Go project? It’s lightweight today, a couple MIT/Apache2.0 libraries.
- Michael K: There’s potential. We have several other JS dependencies. It would need to go through proposal review.
- Felix: Currently I don't really need a framework (just use canvas) but for more traditional UI elements and tracking their state, a framework would streamline things a lot.
- Michael K: May want to reconsider the rest of the UI as well.
- Felix: Yeah, I'd want to also drop users directly into something useful instead of asking them to make a choice up-front. We'd probably want to keep some things, like the MMU views. We'd also want some new stuff, like a flame chart view on a specific goroutine. Or like Rhys' graphviz visualization of a goroutine's "state machine".
- Rhys: Some stuff was already broken in cmd/trace like the pprof. How do we know if it's being used?
- Michael K: Telemetry (the new opt-in kind)? We already have it for invocation counts.
- Felix: When tools are broken, it's hard for inexperienced users to know that there's something wrong, rather than they just did something wrong. Absence of bug reports != absence of usage. These tools also tend to be considered lower priority, and so get fewer bug reports and also less attention.
- Michael K: RE: trust in tools, a good testing strategy would go a long way. cmd/trace never had a great testing story.
- Michael P: You can check telemetry.go.dev for what's configured. Non-compiler non-go-command tools aren't currently collected in the default configuration. All the tools are instrumented, though.
  - https://telemetry.go.dev/config
- Felix: RE: testing, I've thought about it while writing this, but my initial approach has been not to test anything yet. UI testing is hard. Testing might freeze the implementation and inhibit exploration. We may even need to burn it to the ground and build it back up. Already done so 3 times following some dead ends.
- Michael K: +1, it's more of a "this would be a big win" when it comes to the proposal to replace the existing cmd/trace. Definitely a long-term concern.
Michael P: Public gocore / gobinary packages
- https://go.dev/issue/57447 Proposal from a while ago
- Mostly a heads-up.
- There are two parts to it: analyzing binaries, and analyzing core files (with the binary). We'll probably start on analyzing binaries since that's making things harder for other people. For example, govulncheck manages a fork of debug/gosym. Delve has a fork of govulncheck's fork.
- Michael K: Taking a different approach, will probably try to come up with a new API. Getting rid of the old cruft may also make work happen faster.
- Michael P: We also plan to do things in pieces, and not necessarily try to design the entire API at once.

mknyszek · 2024-08-15T17:49:47Z

2024-08-15 Sync

Attendees: @mknyszek @rhysh @felixge @nsrip-dd @cagedmantis

Rhys: I’ve been thinking about runtime-internal locks, and wondering if we should consider having a different implementation we use for important singletons like “sched.lock” or “mheap_.lock” than the one we use for, say, the zillions of chan values an app might have. I expect we’ll want to collect data to help make that decision. The question for this group is: Where/how should the instrumentation to collect that sort of data exist? I’m thinking of something that measures the length of critical sections and reports a sort of histogram of that via the mutex profile, or the high-water mark of the number of locks held by the critical section (once m.locks returns to 0). Leave it in a CL that gets updates for a couple releases and then goes to rot? Or something like a GOEXPERIMENT or build tag that’s usually compiled out completely, but is more ready to use? Or, send code first and we’ll discuss it case by case :)
- Michael K: I think this could work as any of GODEBUG, GOEXPERIMENT, or a build tag. Feel free to just send the code and we can hash it out. Sounds useful, I've definitely written one-off things in the past to dump critical section time histograms.
Felix: Small updates on trace viewer and chats with Lalit Maganti (Google) about Perfetto.
- Made backwards progress; taking everything apart and putting it back together. Data transfer between Go and the browser was slower than it should be by an order of magnitude. I was batching JSON-marshalled messages, but now I'm using a streamed varint-encoded (but higher level) trace format. Zooming out for big traces also has framerate issues, so I'm also working on making summaries.
- Got an email from Lalit Maganti who is interested in trying to spend some time to drive adoption of Perfetto in open source projects, and is interested in Go. I asked if there is a way to stream the data in and it turns out no. There's also no stack trace support and Perfetto has no concept of associating a time slice with a stack trace. You can kind of align it with a separate track, but it's not first-class.
- Michael K: Thanks!
- Rhys: I had some practical experience with what I feel is a real need for rendering a remote trace. I had a 600 MiB trace and cmd/trace needed 51 GiB of memory to work. Launched the server on a separate machine but viewed the trace remotely.
- Felix: I intend to support this use-case. But is cmd/trace's approach good enough, or are you looking for something different?
- Rhys: It worked, so I think so! I was trying to find GCs and it was really annoying to try to find them.
- Michael K: It might be straightforward now, thanks to Felix's refactoring of the trace viewer API, it's much easier to add new views. We can have a simple GC-only view in the existing go tool trace.
- Felix: Here's a sketch of how this would look like for this use-case. My vision for the trace viewer is that you should be able to look at a high level summary and smoothly look across the trace.
- Rhys: I have a tool kind of like go tool trace -d but allows filtering. Would this be useful to send upstream?
- Michael K: We'd be happy to take that contribution. Just file an issue to get others aware and to state intent. I don't think it needs a proposal, the trace CLI is (IIRC) not covered by the compatibility guarantee.
- Felix: +1 for Rhys' tool. I usually use awk but it's cumbersome.
Felix: Nick wrote a blog post: https://nsrip.com/posts/oneinstruction.html
Nick: Clarifying comments on the flight recorder issue (proposal: runtime/trace: flight recording #63185 (comment))
- Michael K: See proposal: runtime/trace: flight recording #63185 (I described this in the meeting).
- Rhys: Maybe call it a Limit and don't specify upper or lower? :P I think this could benefit from worked-out examples. Russ also said that
- Michael K: This is making me think the config should just always be global. It would be way too complex to have different subscribers have different configurations. I will sketch out a different API.
- Felix: +1 to both worked out examples and a global configuration. I'll note that this is consistent with how profiles work, too.

mknyszek · 2024-09-26T17:06:21Z

2024-08-29 Sync

Attendees: @prattmic @bboreham @felixge @nsrip-dd

Michael K: FYI I have a fix for gocore for Go 1.22+. It was easier than expected.
MP: Runtime lock contention on channels
- Fastly issue discussed at GopherconUK: https://exactly-right-airedale.edgecompute.app/Go%20Channels%20Slow%20Down.pdf
Felix: trace viewer updates
- Demo!
- Go program parsing is the bottleneck (1.3 GiB trace: 42s to load the first time, 20s second time).
- Generations could be parsed concurrently?
  - Parsing is papering over some differences. e.g., generations parsed independently start goroutines in undetermined state. The parser fixes this up.
Felix: Frame pointer cleanups
- Lots of different representations of stack traces (logical, physical), expanded or not. Felix would like to clean this up for 1.24.
Nick: Looking at getting rid of the 100ms sleep in the profile reader.
- Mostly working for standard reader. Trace reader still has difficulties with doing the wakeups at the right point.

mknyszek · 2024-09-26T17:08:36Z

2024-09-12 Sync

Attendees: @mknyszek @prattmic @felixge @nsrip-dd @chabbimilind

Michael K: WDYT about slightly expanding the scope of this meeting to include all runtime topics?
- Most other runtime topics are closely related to diagnostics, for example performance.
- The regular attendees seem like a good group for such discussions.
- Everyone in attendance seems supportive, we'll discuss next time and try to get more feedback on the topic from more attendees.
Michael K: See my latest comment on the flight recording proposal.
- TL;DR: I think we should stick with the original proposal, but decouple the internal rate of traceAdvance calls from the amount of trace data kept around for flight recording.
- If we want to expose the internal rate of traceAdvance calls, let's do so in a separate proposal.
- Felix: WriteTo calls traceAdvance.
- Michael K: That's true, Cherry and I were thinking
- Felix: What is the use case for instantiating multiple flight recordings?
- Michael K: Mostly about consistency across different diagnostic APIs. e.g., CPU profiling
- Michael P: How does the global and flight recording
- Felix: I think the rates make more sense to configure globally. We probably don't want to have different instances. Same for CPU profilers. Consuming the data should allow multiple subscribers, but if resources can be shared (flight recording buffers) that would be ideal. It would be neat if we could have packages realize there is a clash in settings, but I don't know how to do that.
- Milind: RE: concurrent entities trying to profile, we have services for continuous profiling (e.g. pyroscope) but we also have a use-case where users of the services want to take a one-shot profile. When the continuous profiler is running, these other tools don't get anything. But these one-shot profiles are higher-priority. Could it be a queue of subscribers? Having multiple subscribers probably solves that problem.
- Felix: That's exactly what I had in mind to support.
- Michael P: This feels like an important and common case. For CPU profiling all you can control is the rate, and AFAIK nobody modifies the rate. It gets complicated when there are other settings.
- Milind: With concurrent profile collectors, each profiler would probably want its own buffer to avoid seeing old samples.
- Michael P: We would have to direct samples to multiple buffers. It should be possible, but there may be some scalability concerns. 1000 profiler consumers would be problematic. 2 is fine.
- Milind: Is stack merging happening in the signal handling or is it done in a separate goroutine?
- Several people: Discussing some implementation details on how this would work. Basically, the profiler goroutine should split out and duplicate the samples, not the signal handler.
- Felix: RE: the central/global configuration issue, the semantics we want are kind of like RWMutex. Many readers, but only one "writer" that controls the settings. Only one owner can get control of the tracer/profiler/etc, but anyone can subscribe.
- Michael K: This can be done with the tracer as well.
- Michael P: It depends a bit on whether the consumer cares what the settings are.
- Michael K: Maybe the "writer" queues up, as Milind suggested earlier?
- Michael P: In perf, with specific events, you can say sampled is OK, and you might get a warning. If you want it to be enabled all the time, you'll actually get an error if perf cannot arrange that.
- Milind: In Intel PMU events you can say "no skid events" and it works similarly.
- Michael P: Most recent proposal: https://go.dev/issue/53286. It paused partially because figuring out new C-created threads is complicated.
- Milind: This is one I filed earlier on: https://go.dev/issue/36821.
- Michael P: perf just scrapes /proc/* racily.
- Milind: I think there's a way to do that not racily with a two-pass algorithm.
Nick: Mutex/block profiler frame pointer troubles. Recent issues for example: runtime/pprof: block and mutex profile stacks sometimes have "gowrap" root frames #69294 https://github.com/golang/go/issues/69335
- Trying to introduce frame pointer unwinding for mutex/block profiling caused a bunch of bugs, mainly that it doesn't quite work exactly like the old profiling. Adding lots of kludges and special cases to make it work.
- Michael K: I think we should keep it, if you've got time to keep fixing bugs. Diagnostics are hard – we're not going to get as good coverage around them as we do for other things. We (unfortunately) have to ship it to find out.
- Michael P: From a maintainer's perspective, reverting changes also comes with a risk.

mknyszek · 2024-09-26T17:09:50Z

2024-09-26 Sync

Attendees: @mknyszek @prattmic @cagedmantis @nsrip-dd @rhysh @thepudds @bboreham @chabbimilind

Michael P: Heap profiler memory usage increase in Go 1.23 (https://go.dev/issue/69590)
- Thinking about potential mitigations and couldn't decide why this wasn't already an issue. It would already be an issue if you had 32 random stack frames and generated all permutations of those stack frames. What seems to explain why this is not a problem now is that most of the time the bottom frames are identical.
- Nick: Agreed. In theory this is a problem today even before the Go 1.23 changes.
- Michael P: This makes it seem less concerning to me then.
- Michael K: Should we do more about deduplicating? We only deduplicate whole stacks.
- Michael P: Thinking about bounding the maximum size by dropping samples. That means the profiling rate would basically change.
- Milind: Could use reservoir sampling.
- Michael P: Right, that's what I was thinking of.
- Rhys: One thing that makes this tricky is that it's not even the heap profile that's growing necessarily but the off-heap memory the runtime uses.
- Nick: The way we found out about this is someone reported a memory leak. We're working on getting memory stats in front of people for exactly this reason. We also report delta profiles, and we wouldn't notice in our own metrics because the delta is basically zero most of the time. If we had a built-in delta that would help us.
- Rhys: Could we have a dummy frame for off-heap memory in the heap profile?
- Michael K: We could actually take stack traces from sysAlloc, etc.
- Rhys: Should we have log-based profiles internally and then turn that into whatever we want? (RE: delta heap profiles.)
- Michael P: So it's not really a public API, but the way CPU profiles work is effectively that. They're events that get processed and grouped into buckets.
- Rhys: Block and mutex profiles also accumulate forever, but you have to ask for them. Teams haven't turned them on and off. But, execution traces let us get similar information and that is possible to enable for short bursts.
- Michael K: The memory profiling code is very old and if there's an opportunity to just eliminate the problem that started this discussion, that would be ideal.
- Michael P: What kind of deduplication do you have in mind? The random stacks case is highly incompressible.
- Michael K: Nothing too specific, possibly deduplicating smaller sections of stacks, but you have to be careful about where you set boundaries. I wonder where the rest of the world has gone on this in the meantime.
- Rhys: Many of the records are not "in use," as in not part of the in-use profile. While they correspond to an “in use” memory allocation, they represent 512kiB of allocations (on average, by default), so we can afford to use a small fraction of that for this metadata. But when the memory is freed, the metadata in the “allocs” profile no longer is a small fraction of the memory, since the “live” part of the allocation is now 0 bytes.
- Michael P: What if we had a “reset” mode on WriteHeapProfile that reset all allocation counts to zero? Then it could throw away records from anything not in use anymore.
- Rhys: Resetting the global state makes me nervous because it's global state. Changing the heap profile rate once the program has really gotten started is not something I've seen anyone do. We've had similar discussions about the flight recorder and execution traces. It's fine if only one part of the application cares about the data, but without complete coordination, there's always a concern.
- Michael K: It's interesting that this discussion applies to heap profiles too. The current heap profile API also has some really annoying implementation restrictions in malloc. I also found a bug (not that anyone would run into it in practice).
- Rhys: One way to get programs to coordinate could be to set the baseline heap profile sample rate to zero. CPU profiles and traces are somewhat easier to coordinate.
- Michael P: It's early for Friday afternoon bad ideas, but the compiler could rewrite writes to MemProfileRate to function calls.
- Rhys: Is there a way to make sure MemProfileRate is assigned to its own page and forces a fault on access.
Michael K: Continue meeting scope discussion.
- Seems all are in favor.
- Rhys: If we're still not sure, we can still give priority to diagnostics topics.
Rhys: If the above is approved, I’d like to discuss strategy around landing mutex changes
- TLA+ explained the bug I fixed in PS 10 of https://go.dev/cl/601597
- Using Xchg8 requires per-arch support. Or, we could use Xchg (Xchg32) and support all 64-bit architectures right away
  - Michael K: Supporting just what's needed to start is fine.
- Is using futex a good idea? The semaphore-based implementation gives a linked list of waiters, which is very convenient!
  - All: shrug Maybe worth an experiment?
  - Michael P: Might be worth investigating the futex API which lets you pick which thread to wake.

mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 8, 2022

mknyszek added this to the Go1.21 milestone Dec 8, 2022

prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 8, 2022

felixge mentioned this issue Dec 14, 2022

cmd/trace: support for perfetto #57315

Open

irfansharif mentioned this issue Dec 15, 2022

proposal: runtime: add per-goroutine CPU stats #41554

Open

tbg mentioned this issue Feb 15, 2023

obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing cockroachdb/cockroach#97215

Open

dashpole mentioned this issue Nov 16, 2023

Semantic conventions for go runtime metrics open-telemetry/semantic-conventions#535

Closed

mknyszek modified the milestones: Go1.22, Backlog Nov 28, 2023

felixge mentioned this issue Mar 27, 2024

proposal: runtime/pprof: add goroutine stack memory usage profile #66566

Open

DarrylWong pushed a commit to DarrylWong/go-benchmarks that referenced this issue Apr 22, 2024

sweet: add support for execution traces and measuring trace overhead

715fba1

For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

cforce mentioned this issue Sep 17, 2024

hostmetrics featuring ebpf - resource efficient scraping open-telemetry/opentelemetry-collector-contrib#32446

Open

prattmic mentioned this issue Sep 27, 2024

runtime: significant heap profiler memory usage increase in Go 1.23 #69590

Open

runtime: diagnostics improvements tracking issue #57175

runtime: diagnostics improvements tracking issue #57175

Comments

mknyszek commented Dec 8, 2022 • edited Loading

Runtime tracing

Heap analysis (see #57447)

mknyszek commented Dec 8, 2022

2022-12-07 Sync

felixge commented Dec 9, 2022

tbg commented Dec 9, 2022

thediveo commented Dec 9, 2022

qmuntal commented Dec 9, 2022

mknyszek commented Dec 9, 2022

qmuntal commented Dec 9, 2022

qmuntal commented Dec 14, 2022

gopherbot commented Dec 21, 2022

mknyszek commented Dec 22, 2022 • edited Loading

2022-12-22 Sync

mknyszek commented Jan 13, 2023 • edited Loading

2023-01-05 Sync

mknyszek commented Jan 19, 2023 • edited Loading

2023-01-19 Sync

mknyszek commented Feb 2, 2023

2023-02-02 Sync

qmuntal commented Feb 2, 2023 • edited Loading

mknyszek commented Feb 16, 2023 • edited Loading

2023-02-16 Sync

mknyszek commented Mar 3, 2023 • edited Loading

2023-03-02 Sync

dominikh commented Mar 4, 2023

mknyszek commented Mar 6, 2023 • edited Loading

dominikh commented Mar 6, 2023

mknyszek commented Mar 6, 2023

dominikh commented Mar 6, 2023

mknyszek commented Mar 6, 2023 • edited Loading

dominikh commented Mar 6, 2023

mknyszek commented Dec 7, 2023

2023-12-07 Sync

mknyszek commented Jan 4, 2024

2024-01-04 Sync

mknyszek commented Jan 18, 2024

2024-01-18 Sync

dominikh commented Jan 18, 2024 • edited Loading

mknyszek commented Mar 9, 2024

2024-02-01 Sync

mknyszek commented Mar 9, 2024 • edited Loading

2024-02-15 Sync

mknyszek commented Jun 20, 2024

2024-02-29 Sync

mknyszek commented Jun 20, 2024

2024-03-14 Sync

mknyszek commented Jun 20, 2024

2024-03-28 Sync

mknyszek commented Jun 20, 2024

2024-04-11 Sync

mknyszek commented Jun 20, 2024

2024-04-25 Sync

mknyszek commented Jun 20, 2024 • edited Loading

2024-05-09 Sync

mknyszek commented Jun 20, 2024

2024-05-23 Sync

mknyszek commented Jun 20, 2024

2024-06-06 Sync

mknyszek commented Jun 20, 2024

2024-06-20 Sync

mknyszek commented Aug 5, 2024

2024-07-18 Sync

mknyszek commented Aug 5, 2024

2024-08-01 Sync

mknyszek commented Aug 15, 2024

2024-08-15 Sync

mknyszek commented Sep 26, 2024 • edited Loading

2024-08-29 Sync

mknyszek commented Sep 26, 2024

2024-09-12 Sync

mknyszek commented Sep 26, 2024

2024-09-26 Sync

mknyszek commented Dec 8, 2022 •

edited

Loading

mknyszek commented Dec 22, 2022 •

edited

Loading

mknyszek commented Jan 13, 2023 •

edited

Loading

mknyszek commented Jan 19, 2023 •

edited

Loading

qmuntal commented Feb 2, 2023 •

edited

Loading

mknyszek commented Feb 16, 2023 •

edited

Loading

mknyszek commented Mar 3, 2023 •

edited

Loading

mknyszek commented Mar 6, 2023 •

edited

Loading

mknyszek commented Mar 6, 2023 •

edited

Loading

dominikh commented Jan 18, 2024 •

edited

Loading

mknyszek commented Mar 9, 2024 •

edited

Loading

mknyszek commented Jun 20, 2024 •

edited

Loading

mknyszek commented Sep 26, 2024 •

edited

Loading