Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: diagnostics improvements tracking issue #57175

Open
10 of 19 tasks
mknyszek opened this issue Dec 8, 2022 · 64 comments
Open
10 of 19 tasks

runtime: diagnostics improvements tracking issue #57175

mknyszek opened this issue Dec 8, 2022 · 64 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Dec 8, 2022

As the Go user base grows, more and more Go developers are seeking to understand the performance of their programs and reduce resource costs. However, they are locked into the relatively limited diagnostic tools we provide today. Some teams build their own tools, but right now that requires a large investment. This issue extends to the Go team as well, where we often put significant effort into ad-hoc performance tooling to analyze the performance of Go itself.

This issue is a tracking issue for improving the state of Go runtime diagnostics and its tooling, focusing primarily on runtime/trace traces and heap analysis tooling.

To do this work, we the Go team are collaborating with @felixge and @nsrip-dd and with input from others in the Go community. We currently have a virtual sync every 2 weeks (starting 2022-12-07), Thursdays at 11 AM NYC time. Please ping me at mknyszek -- at -- golang.org for an invite if you're interested in attending. This issue will be updated regularly with meeting notes from those meetings.

Below is what we currently plan to work on and explore, organized by broader effort and roughly prioritized. Note that this almost certainly will change as work progresses and more may be added.

Runtime tracing

Tracing usability

Tracing performance

Heap analysis (see #57447)

  • Update viewcore's internal core file libraries (gocore and core) to work with Go at tip.
  • Ensure that gocore and core are well-tested, and tested at tip.
  • Make gocore and core externally-visible APIs, allowing Go developers to build on top of it.
  • Write go.dev documentation on how to do heap analysis with viewcore.

CC @aclements @prattmic @felixge @nsrip-dd @rhysh @dominikh

@mknyszek mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 8, 2022
@mknyszek mknyszek added this to the Go1.21 milestone Dec 8, 2022
@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 8, 2022

2022-12-07 Sync

Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @rhysh

Notes:

  • pprof labels in execution traces
    • Michael K: I need to follow up on the CL and issue.
    • Michael P: Have you considered runtime/trace regions?
    • Nick: Yes, but it doesn't quite hit our use-cases.
    • Rhys: Could use logs instead. Is the point to follow goroutines?
    • Felix: Yeah.
    • Rhys: Started parsing the trace format for some internal tooling.
    • Felix: Prefer labels because inheritance. Mostly like it for profile tools on top, but maybe doesn't matter for tracing tools. The parent goroutine gets recorded so things like request IDs can be tracked in post-processing.
      • Also useful for attributing CPU time.
    • Michael P: Tasks are inherited, but inheritance is processed downstream.
      • Would be nice to have pprof labels just so you don't have to think about which one to use.
    • Michael K: Useful even just to bridge pprof and runtime/trace.
    • Austin: Agreed. We can one day get to the point of deprecating the old APIs as well.
    • Rhys: RE: Attributing CPU time, can see at better than 10 ms of granularity already (even if Ps aren't CPU time, it's still time the rest of the app couldn't run).
    • Michael P: There's an issue about measure CPU time on-line. (TODO: Find it.)
  • Trace parsing API
    • Michael K: How important is this? Priority?
    • Felix: Important for the community, but we can't make use of it in our case. Will the trace format change in between releases?
    • Michael K: I think we can always guarantee a trace format for a release.
    • Michael P: How high level should this API be?
      • cmd/trace has two levels:
        • Low-level that understands format and events.
        • Higher-level that understands relationships between goroutines, etc.
    • Michael K: Page trace splits this into "parser" and "simulator." The latter is more stateful.
    • Felix: Intuitive feeling toward lower level API.
    • Rhys: +1 to low level.
    • Austin: Scalability of processing traces.
      • Currently not in a good state in low or high level format (currently requires the whole trace).
      • Can fix trace wire format for low-level parsing scalability issues, but it's much less clear how to do this for the high-level format.
    • Austin: Flight recorder idea.
      • Interacts interestingly with simulation. Current trace snapshots everything.
      • Solved this in debuglog; reads its own tail and keeps local state updated.
      • Complicated trade-offs in this space.
    • Felix: We use a lot of JFR, one thing that's nice is it's broken down into self-contained chunks.
  • Michael K sent out a very half-baked trace format revamp. (Thanks for the comments! Far from ready to share more widely.)
    • The next step is to measure the actual current overhead.
      • Maybe add a mode to Sweet?
      • Rhys: Have been collecting CPU profiles and execution traces. 20% of CPU time during execution trace is for execution trace itself. 95% of overhead is collecting stack traces.
        • Collect 1 second every 5000 seconds and no one complains. People do complain about goroutine profiles every 2 minutes.
      • Michael K: Shooting for KUTrace overhead, so making stack traces optional/faster is just step 1.
      • Felix: Trace effect on tail latency.
        • Rhys: Traces are my view of tail latency.
      • Felix: Benchmark for pathological cases and worst case.
      • Austin: Linked to trace overhead issue, Dmitry proposed switching to frame pointer unwinding.
      • Felix: At some point implemented frame pointer unwinding in userland and it was 50x faster (link).
      • Rhys: Not sure what kind of tool you could build without stack traces in an app that doesn't set pprof labels, tasks, regions, trace logs, etc.
      • Michael K: Integration with structured logging?
      • Michael P: It does add yet another way to add things to runtime/trace.
      • Rhys: The standard library (e.g. database/sql) doesn't currently use runtime/trace at all, maybe it should.
      • Michael K: This connects to deciding what goes into a trace. I think this is a very good idea.
      • Felix: +1. Java world and JFR does this.
  • Deciding what goes into a trace
    • Disabling stack tracing / reduce stack trace depth
    • Filtering by pprof labels
    • Specific event classes
    • Standard library events
    • Rhys: I've made this decision for my organization. Expected that you do profiling for a long running service. No opportunity for app owners to express opinions. People who complained forked the package, turned it off, and now coming back. I kind of want everything.
    • Felix: I would love to be in a place where we can do that, but we get pushback from users when the overhead is too high.
    • Rhys: The question is how close we get within 1% overhead. My choice was to get everything, but less often.
    • Felix: Desire to get more of everything is in conflict with adding more kinds of things in the trace.
    • Michael P: Agreed. Ideally we have tracing that's sufficiently fast that we have on all the time, but if libraries are allowed to add new traces then it could be a problem. It would be nice to turn that off without forking a library.
  • Before next sync:
    • Michael K: Unblock pprof labels patch and benchmarking trace overhead.
    • Felix: I can contribute a worst case benchmark.
      • Currently blocked on pprof labels in trace.
    • Felix: Started to work on gentraceback. Might work on it over the holidays.
      • Trying for bug-for-bug compatibility.
    • Michael P: Austin has been working on this too.

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 8, 2022
@felixge
Copy link
Contributor

felixge commented Dec 9, 2022

I'll miss the Dec 22nd meetup because I'm traveling for the holidays. That being said, if I find time I might also look into #57159 . Getting a proof of concept for Perfetto UI integration (ideally using their protocol buffer format) is probably more important than the gentraceback refactoring at this point. I just tried to work with a 300 MB (15s of prod activity) yesterday, and it was a real eye opener to the way the current UI struggles.

@tbg
Copy link

tbg commented Dec 9, 2022

I don't know if it's relevant (probably nothing new for the folks on this thread), but I had similar problems with the go tool trace viewer where it would freeze on me all the time, esp. in the per-goroutine view (/trace?goid=N). I figured out you can download perfetto-compatible JSON data from /jsontrace?goid=N. (/jsontrace gives the default view). This can then be uploaded to ui.perfetto.dev. This doesn't show all the information in the trace so it's not as great, but I was glad to have something that worked.

@thediveo
Copy link

thediveo commented Dec 9, 2022

would the pprof labels also show up in goroutine traces?

@qmuntal
Copy link
Contributor

qmuntal commented Dec 9, 2022

I'm working on a PoC that improves native stack unwinding on Windows by adding additional information to the PE file. This will help debugging with WinDbg and profiling with Windows Performance Analyzer. Would this work fit into the effort tracked by this issue?

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 9, 2022

@thediveo I think that might be a good question for #56295, or you could file another issue. Off the top of my head, that doesn't sound like it would be too difficult to do.

@qmuntal Oh neat! That's awesome. I think it's a little tangential to the work we're proposing here, unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback). Then again, if one of the goals is better integration with the Windows Performance Analyzer that's certainly more in the same spirit. Do you have an issue for tracking that already?

@qmuntal
Copy link
Contributor

qmuntal commented Dec 9, 2022

Do you have an issue for tracking that already?

I still have to prepare the proposal, I plan to submit it next week.

unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback).

Not for now, but once I finish this I want to investigate how feasible is too unwind native code and merge it with the Go unwinding, in case the exception happens in a non-Go module.

@qmuntal
Copy link
Contributor

qmuntal commented Dec 14, 2022

Do you have an issue for tracking that already?

I do now #57302 😄

@gopherbot
Copy link

Change https://go.dev/cl/459095 mentions this issue: sweet: add support for execution traces and measuring trace overhead

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 22, 2022

2022-12-22 Sync

Attendees: @mknyszek @aclements @prattmic @bboreham @rhysh @dominikh

  • Organizational stuff
    • OK to record meetings?
    • Meeting recorded with transcript this week (please ask if you would like to see it).
  • Trace overhead benchmarks
name                                old time/op            new time/op            delta
BiogoIgor                                      17.7s ± 3%             17.5s ± 4%     ~     (p=0.190 n=10+10)
BiogoKrishna                                   15.1s ± 4%             15.1s ± 4%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             5.78s ± 7%             5.76s ±11%     ~     (p=0.853 n=10+10)
BleveQuery                                     2.37s ± 0%             2.37s ± 0%   -0.26%  (p=0.016 n=8+10)
FoglemanFauxGLRenderRotateBoat                 16.9s ± 9%             16.9s ± 7%     ~     (p=0.796 n=10+10)
FoglemanPathTraceRenderGopherIter1             36.7s ± 1%             44.4s ± 2%  +21.01%  (p=0.000 n=10+10)
GoBuildKubelet                                 47.0s ± 2%             48.8s ± 3%   +3.72%  (p=0.000 n=10+10)
GoBuildKubeletLink                             8.89s ± 2%             8.88s ± 4%     ~     (p=0.720 n=10+9)
GoBuildIstioctl                                45.9s ± 1%             47.8s ± 2%   +4.09%  (p=0.000 n=10+10)
GoBuildIstioctlLink                            9.07s ± 2%             8.99s ± 2%     ~     (p=0.095 n=10+9)
GoBuildFrontend                                15.7s ± 4%             16.1s ± 2%   +2.45%  (p=0.043 n=10+10)
GoBuildFrontendLink                            1.38s ± 2%             1.37s ± 3%     ~     (p=0.529 n=10+10)
GopherLuaKNucleotide                           27.9s ± 0%             27.9s ± 1%     ~     (p=0.853 n=10+10)
MarkdownRenderXHTML                            256ms ± 2%             256ms ± 2%     ~     (p=1.000 n=9+9)
Tile38WithinCircle100kmRequest                 618µs ± 7%             657µs ±10%   +6.30%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             722µs ± 6%             773µs ± 4%   +6.96%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  508µs ± 3%             532µs ± 3%   +4.73%  (p=0.000 n=10+10)

name                                old average-RSS-bytes  new average-RSS-bytes  delta
BiogoIgor                                     68.8MB ± 2%            71.8MB ± 4%   +4.40%  (p=0.000 n=10+10)
BiogoKrishna                                  4.42GB ± 0%            4.42GB ± 0%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             194MB ± 2%             198MB ± 3%   +1.91%  (p=0.008 n=9+10)
BleveQuery                                     536MB ± 0%             537MB ± 1%     ~     (p=0.190 n=10+10)
FoglemanFauxGLRenderRotateBoat                 444MB ± 1%             446MB ± 0%   +0.41%  (p=0.035 n=10+9)
FoglemanPathTraceRenderGopherIter1             132MB ± 1%             142MB ± 4%   +7.61%  (p=0.000 n=10+10)
GoBuildKubelet                                1.75GB ± 1%            1.85GB ± 1%   +5.51%  (p=0.000 n=10+10)
GoBuildIstioctl                               1.35GB ± 1%            1.42GB ± 1%   +5.49%  (p=0.000 n=10+9)
GoBuildFrontend                                511MB ± 2%             543MB ± 1%   +6.31%  (p=0.000 n=10+9)
GopherLuaKNucleotide                          37.0MB ± 1%            40.4MB ± 2%   +9.24%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           21.8MB ± 3%            24.0MB ± 3%  +10.14%  (p=0.000 n=9+8)
Tile38WithinCircle100kmRequest                5.40GB ± 1%            5.38GB ± 1%     ~     (p=0.315 n=10+10)
Tile38IntersectsCircle100kmRequest            5.72GB ± 1%            5.71GB ± 1%     ~     (p=0.971 n=10+10)
Tile38KNearestLimit100Request                 7.26GB ± 0%            7.25GB ± 0%     ~     (p=0.739 n=10+10)

name                                old peak-RSS-bytes     new peak-RSS-bytes     delta
BiogoIgor                                     95.9MB ± 4%            98.5MB ± 3%   +2.70%  (p=0.030 n=10+10)
BiogoKrishna                                  4.49GB ± 0%            4.49GB ± 0%     ~     (p=0.356 n=9+10)
BleveIndexBatch100                             282MB ± 3%             284MB ± 4%     ~     (p=0.436 n=10+10)
BleveQuery                                     537MB ± 0%             538MB ± 1%     ~     (p=0.579 n=10+10)
FoglemanFauxGLRenderRotateBoat                 485MB ± 1%             483MB ± 0%     ~     (p=0.388 n=10+9)
FoglemanPathTraceRenderGopherIter1             180MB ± 2%             193MB ± 3%   +7.19%  (p=0.000 n=10+10)
GopherLuaKNucleotide                          39.8MB ± 3%            46.0MB ±20%  +15.56%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           22.1MB ± 3%            25.5MB ± 7%  +15.45%  (p=0.000 n=9+10)
Tile38WithinCircle100kmRequest                5.70GB ± 1%            5.68GB ± 1%   -0.45%  (p=0.023 n=10+10)
Tile38IntersectsCircle100kmRequest            5.93GB ± 1%            5.91GB ± 2%     ~     (p=0.631 n=10+10)
Tile38KNearestLimit100Request                 7.47GB ± 1%            7.46GB ± 0%     ~     (p=0.579 n=10+10)

name                                old peak-VM-bytes      new peak-VM-bytes      delta
BiogoIgor                                      802MB ± 0%             803MB ± 0%   +0.11%  (p=0.000 n=10+10)
BiogoKrishna                                  5.24GB ± 0%            5.24GB ± 0%   +0.01%  (p=0.001 n=10+10)
BleveIndexBatch100                            1.79GB ± 0%            1.79GB ± 0%   +0.05%  (p=0.000 n=8+8)
BleveQuery                                    3.53GB ± 1%            3.53GB ± 1%     ~     (p=0.237 n=10+10)
FoglemanFauxGLRenderRotateBoat                1.21GB ± 0%            1.16GB ± 4%     ~     (p=0.163 n=8+10)
FoglemanPathTraceRenderGopherIter1             875MB ± 0%             884MB ± 0%   +1.02%  (p=0.000 n=10+10)
GopherLuaKNucleotide                           733MB ± 0%             734MB ± 0%   +0.11%  (p=0.000 n=9+10)
MarkdownRenderXHTML                            733MB ± 0%             734MB ± 0%   +0.10%  (p=0.000 n=10+9)
Tile38WithinCircle100kmRequest                6.42GB ± 0%            6.39GB ± 1%     ~     (p=0.086 n=8+10)
Tile38IntersectsCircle100kmRequest            6.62GB ± 1%            6.61GB ± 2%     ~     (p=0.927 n=10+10)
Tile38KNearestLimit100Request                 8.16GB ± 1%            8.18GB ± 0%     ~     (p=0.649 n=10+8)

name                                old p50-latency-ns     new p50-latency-ns     delta
Tile38WithinCircle100kmRequest                  144k ± 3%              159k ± 3%  +10.56%  (p=0.000 n=9+9)
Tile38IntersectsCircle100kmRequest              215k ± 1%              232k ± 2%   +7.91%  (p=0.000 n=9+10)
Tile38KNearestLimit100Request                   347k ± 2%              373k ± 1%   +7.21%  (p=0.000 n=10+10)

name                                old p90-latency-ns     new p90-latency-ns     delta
Tile38WithinCircle100kmRequest                  908k ± 6%              956k ± 9%   +5.22%  (p=0.043 n=10+10)
Tile38IntersectsCircle100kmRequest             1.07M ± 4%             1.11M ± 5%   +4.33%  (p=0.001 n=10+10)
Tile38KNearestLimit100Request                  1.03M ± 3%             1.05M ± 4%   +2.64%  (p=0.011 n=10+10)

name                                old p99-latency-ns     new p99-latency-ns     delta
Tile38WithinCircle100kmRequest                 7.55M ± 9%             7.93M ±13%     ~     (p=0.089 n=10+10)
Tile38IntersectsCircle100kmRequest             7.81M ± 8%             8.39M ± 2%   +7.36%  (p=0.000 n=10+8)
Tile38KNearestLimit100Request                  2.03M ± 4%             2.08M ± 5%   +2.52%  (p=0.019 n=10+10)

name                                old ops/s              new ops/s              delta
Tile38WithinCircle100kmRequest                 9.73k ± 7%             9.16k ±11%   -5.83%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             8.31k ± 6%             7.77k ± 4%   -6.55%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  11.8k ± 3%             11.3k ± 3%   -4.51%  (p=0.000 n=10+10)
  • Introduction: Bryan Boreham, Grafana Labs
    • Questions within the team about whether useful information has been derived from Go execution traces.
    • Phlare: continuous profiling. Interested in linking together various signals (distributed tracing, profiling)
    • Michael K: Interesting data point about usability.
    • Michael P: Hard to link application behavior to trace.
    • Bryan: Example: channels. Still don't really know where to find that data.
    • Dominik: One of the reasons I started on gotraceui was to surface more information and do more automatic inference and analysis of the data.
    • Rhys: Execution trace technique: get data out of them to find the interesting traces. Try to extract features that would be interesting up-front.
      • Starts with internal trace parser. Have code to find start and end of HTTP requests, DNS lookups, etc.
      • Tooling on the way to get open sourced.
  • Heap analysis plan (proposal: x/debug: make the core and gocore packages public #57447)
    • Austin: Additional context is we're confident in the API we're planning to export, as opposed to tracing which we have nothing for yet.
  • https://go.dev/issue/57307 proposal: cmd/trace: visualize time taken by syscall
    • Austin: Does Perfetto do better with instantaneous events?
      • Michael P: Yes, there's a 20px wide arrow but we have so many.
      • Rhys: Hold shift, draw a box. If you aim well, you get what you want.
    • Rhys: Why is there only one timestamp on some events?
      • Austin: We can add another timestamp.
      • Michael P: Syscall fast path does a lot less.
  • pprof labels in traces
    • Michael K: I think I've unblocked Nick. Michael and I are reviewing.
  • runtime.gentraceback cleanup
    • Austin: Back and forth on the issue about making it an iterator, sent out CLs, not tested yet.
  • Next meeting: Jan 5th, Michael P and Michael K won't be here, so Austin will run it.
  • Action items:
    • We're slowing down for the holidays, so no strong expectations
    • Michael K:
      • Try to land execution trace benchmarking.
      • Might look into heap analysis stuff.
      • After break, might want to start working on trace format more seriously.
  • Happy holidays!

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 13, 2023

2023-01-05 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham vnedkov @dashpole

  • Organizational stuff
  • Benchmarks: Can we add a goroutine ping pong example? (Felix)
    • Tracer benchmarks all show relatively low overhead. Can we add a benchmark that demonstrates the worst case?
    • Austin: Sweet probably isn’t the right place because that’s application-level. Maybe add to Bent?
    • Felix: Next step on these benchmarks? Land MK’s trace benchmark support?
    • Austin: It’s certainly fine to land. We don’t have a good way to integrate these “extra dimensions” into our automated benchmarking.
    • AI(austin): Bring up monitoring extra benchmarking dimensions.
    • Austin: “Unit benchmarks” would be the perfect place for a ping pong benchmark (we already have one in the main repo), but we never quite got to integrating these into automated monitoring.
  • Are only GC STW recorded? Would it make sense to record other STW events (read metrics, goroutine profile, heap dump)? (Felix)
    • Rhys: You get ProcStop events
    • Austin: Yeah, you’re right that we trace high level GC STW events.
    • Rhys: Currently the GC traces the “best case” STW, which can be really misleading.
    • Austin: We could definitely have a “stopping the world” and a “world stopped”. Maybe don’t need that for start.
    • Felix: That would be great. We’re investigating rare long STWs right now.
    • Rhys: Starting the world can take a while. Problems with heap lock contention. I would love to have more visibility into the runtime locks.
    • Austin: Runtime locks are a bit of a mess. I also wonder if they should be “scalable”.
    • Rhys: I’d love to discuss that. C&R office hours?
    • Austin: Perfect.
    • Conclusion: Let’s add events for all STWs and also separate “stopping” from “stopped”.
  • Updates on Perfetto UI (Felix and Nick)
    • Add to UI CL: https://go.dev/cl/457716 
    • Felix: The JSON currently produced by the trace tool is basically compatible with Perfetto. Doesn’t let us open really large traces without splitting, which was one of the hopes. And it takes a while to load. I was able to use the command line tools to load a 280MB trace into a 9.8GB JSON trace and load that in Perfetto, but it took 20 minutes. Nick has been working on outputting proto directly, which will hopefully produce less data than JSON.
    • Rhys: When I tried this a while ago, the connection of data flow wasn’t quite right.
    • Felix: This CL doesn’t fix that. I’m hoping it’s an upstream issue, which they’re pretty responsive to. I’m hoping protobuf will just make it go away, since that’s their canonical input.
    • Nick: Stack traces seem to be missing from protobuf, which we definitely want. We might need upstream changes to support that.
    • Felix: I suspect there may be some long tail of issues. But the initial plan would be to keep both viewers until we feel this is solid.
    • Austin: How does the streaming work?
    • Felix: They have an in-memory column store with a SQL interface on top of it. Large traces would still be a problem because they’d need to be loaded fully into memory.
    • Austin: In principle we could swap out that column store for our own streaming thing, but that sounds like a significant amount of work.
    • Felix: On Go Time someone said they only use runtime trace when they’re really desperate and then they can’t figure it out anyway. Most people don’t think about their program from the perspective of the scheduler. I’d like to have different pivoting, like one timeline per G (or M). We sort of have that in the goroutine analysis, but that only shows on-CPU time. Dominick did that in gotraceui.
  • Updates on pprof labels (Nick)
    • Nick: In MK’s recent comments on pprof labels CL, he wondered about a size limit on labels being recorded in the trace. Thinking about trace overhead. Users can also add arbitrary logs (limited by trace buffer size). My thought is that users are deciding to make these as big or as small as they want.
    • Austin: My reaction is “do what the user said”
    • Rhys: It seems like we already don’t have a limit on the pprof labels (number/length/etc) and maybe it would have been good to have a limit, but we already don’t.
    • Bryan: For me it’s more important to be able to find out how much damage you’re doing with this data. Inevitably people want one more byte than the limit and will be frustrated.
    • Felix: Two sides to this problem: how to get the data in the trace while keeping overhead low, and the other is keeping the memory usage low for keeping all these labels. For trace overhead, I’m thinking we want two or three levels of filtering: filter what events, filter events by properties (e.g., duration). JFR supports both of these. And potentially a way to modify events (maybe too far), like truncation. At some point you can almost guarantee fixed-cost tracing. E.g., turn off everything except profile events; now you have timestamps on profile events without all the other overhead.
    • Austin: MK and I have definitely been thinking in that direction. The current trace viewer is almost purpose-built for analyzing the scheduler and needs to understand how a lot of events relate. But if we open up reading traces, the trace viewer becomes just another tool and maybe it’s fine for it to say “I need these events” (kind of like “perf sched” or similar).
    • Felix: I can ask my Java colleagues about how this works in JFR.
    • Rhys: Curious how you’re thinking about filtering.
    • Felix: Simpler is better. You could imagine a callback, but that’s not simple. Probably something like runtime/metrics where you can discover the events and select.
    • Rhys: Definitely need a header saying which events are included.
    • Felix: Agreed. Also nice for viewers so they don’t have to hard-code all of the events.

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 19, 2023

2023-01-19 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham @mknyszek @prattmic @dominikh @dashpole

  • Felix: gentraceback iterator refactoring
    • Felix: What's the progress?
    • Austin: Made progress. Running into issues with write barriers and trying to knock down all the write barriers one by one. Big open question of testing; so many horrible corner cases. No good answers.
    • Felix: Tried to do it incrementally instead of all at once; also painful. RE: testing, would it be useful to have the ability to instrument a PC and do a traceback from there?
    • Austin: That would help. The worst parts are weird though, like signals. If we had a good way to inject a symbol, like a breakpoint, that would help a lot.
      • Idea: could use hardware breakpoints via perf-event-open (Linux only, but at least architecture-independent) which could get enough coverage for Austin to be happy.
      • Could potentially synthesize other signal tests from a single signal.
    • Felix: I'll give it a shot.
    • Michael K: What work could we do in parallel?
    • Felix: Could write a frame pointer unwinder separately for tracing just to get an idea of the overhead.
      • Austin: +1. Tricky things include logic in gentraceback for filtering out frames. Maybe it doesn't matter for the trace viewer (i.e. don't filter). Also inline unwinding. Trying to totally separate inline unwinding in gentraceback. Once its its own separate thing, it'd be straightforward to plumb that into a frame pointer unwinder.
      • Michael K: Could we skip inline unwinding for the experiment?
      • Austin: Yeah.
      • Michael P: +1 to separating out inline unwinding. Already "runtime_expandFinalInlineFrame" in the runtime which is a good reference point for this.
      • Felix: Also all the complexity with cgo traceback, but we should just ignore that for the experiment.
      • Michael K: The cgo traceback tests are also really flaky, and if we could have better testing around that that would be great.
  • Felix: Perfetto UI blues … (timeline bug, link bug, stack traces, large traces, small screens, protocol buffer format) … gotraceui w/ wasm? Having an online tool with independent release cycle is tempting?
    • CL out that makes Perfetto work. Limitations:
      • Limited for very large traces as-is.
      • Doesn't seem easy to make it work as well as go tool trace (bugs). e.g. timelines not named correctly. Events not connected correctly.
        • Harder: getting stack traces to show up. Nick has tried to make it work. Protobuf format doesn't have an obvious stack trace format?
        • Nick: Not a one-to-one mapping between Catapult format and Perfetto. Can stick a single location in the Perfetto format, but not a full stack trace. Little things in the protobuf format that aren't well-documented. e.g. string interning only works if you include a number in the header.
        • Michael K: MP and I looked into this. Perfetto knows how to do this for some traces, but it’s built into a C++ library, so we’d have to rewrite that in Go or call into it from Go. I’m not sure it even has strong backwards compatibility.
        • Michael P: There is the Perfetto tool that runs the RPC server. (trace_processor.) That loads into a SQLite in-memory DB, but does do better than the fully in-browser implementation. It can do bigger traces, though is still limited. That seems like enough of an improvement to me.
        • Felix: I have a 280MB trace that gets split into 90 parts for 15 seconds on a busy server. Maybe we should start with deciding what size trace we want to have a good experience for.
        • Michael K: I think 280MB is a big trace, though it’s only 15 seconds. I think we should be targeting bigger than that. It’s easy to get a 1GB trace. But we can start with Perfetto as long as it’s better and work toward that.
        • Austin: Is that better with Perfetto?
        • Felix: I think it would be better. Maybe 5x better, so a second at a time (don’t quote me on that).
        • Michael P: The trace_processsor is better, but still limited by the in-memory SQLite DB. Presumably that could be on disk. I don’t know if the trace loading is also linear in the trace size.
        • Rhys: What do you even do with an execution trace that large? How do you get value out of that?
        • Felix: This trace was from a colleague from an instance that was struggling with pauses. It looked like a straggling procstop. It was debugging the behavior of a whole application that was behaving poorly.
        • Rhys: So you were looking for behavior that was pretty zoomed-out.
        • Felix: Yeah.
        • Michael K: Part of the problem with existing traces is the usability of this. I think it’s a valid question about whether big traces are all that useful. Sometimes you’re not even really sure what you’re looking for. Say I wanted to run a full trace on every invocation of the compiler. You don’t necessarily know what you’re looking for to improve compiler speed.
        • Austin: I bet if you were to profile the space of large trace file, the vast majority of that would not be useful to you looking at it at a high level. Suggests a solution here for filtering is to just reduce what goes into the trace.
        • 280MB Trace Size Breakdown
        • Michael K: Maybe just proc start/proc stop for what Felix was describing.
        • Rhys: But once you find the problem, you want more detail. It's hard to catch the end of a garbage collection cycle because of the rules of starting a trace during a GC cycle.
        • Michael K: Fixing the mark phase issue should be easier than before.
        • Austin: Awesome breakdown!
      • User group said "please don't do this" because Perfetto isn't nice to small screens.
      • Felix: gotraceui
        • Viewing timelines for goroutines is great.
        • Would like Dominik to talk about gotraceui some more.
        • I want to be intentional about choosing Perfetto.
        • Michael K: I think the dependency on gio was a concern.
        • Dominik: Gio (the UI library I use) supports wasm, so it should be fairly
          straightforward to have gotraceui run in the browser if we want to go
          down that road.
        • Dominik: I still rely on loading entire traces into memory (but using significantly less memory than current go tool trace), but with the upcoming format changes, streaming data might be possible.  We currently load everything into memory because when the user zooms out far enough, we need all events to compute what we display. But we could probably precompute these zoom levels, similar to mipmaps.
        • Dominik: For the current trace format, gotraceui needs roughly 30x the size of the trace in memory. so a 300 MB trace needs 9 GB.
        • Michael K: I have been thinking about an HTML UI that does something like Google Maps tiles to scale. We could skip a lot of work if we could take gotraceui as the UI, but port it into something more portable than Gio. OTOH, it’s even more work to build something from scratch.
        • Dominik: WRT gotraceui's use of Gio, there'll be pretty rich UI, and I don't fancy writing UIs in HTML/JS. But all of the processing of trace data could live externally
        • Michael P: It’s not necessarily a hard requirement that the Go project itself ship a trace viewer. We have to now because there’s no API. But if we shipped an API, it wouldn’t be a hard requirement. Much like we don’t ship a debugger.
        • Michael K: One option is that we ignore the UI situation entirely and build something that you can parse separately and ship something really bare later. In the meantime, point at a little tool that will shove it into trace_processor and point people at Perfetto. For a brief time, stop shipping our own. It’s very convenient that you only need a Go installation to view these traces, but I think you’re right that we could stop shipping a UI. We could also keep the existing UI working/limping while we do other things in parallel.
        • Felix: Is Dominik looking for contributors? (That comes with its own overheads)
        • Dominik: I'm usually not big on contributions in the form of code; but ideas and feedback are hugely appreciated
        • Michael K: We don’t have to make a decision on using Perfetto now. Maybe we should plug along for two more weeks (with Perfetto) and figure out if we can fix the issues without too much effort, and then make a hard decision on what to do at the next meeting.
        • 👍
  • Felix: traceutils anonymize & breakdown and ideas: (flamescope, graphviz, tracer overhead)
    • Implemented anonymization of traces. Breakdowns, too.
    • Tracer overhead tool that uses profile samples in the trace to identify overheads.
  • Felix: Format: Consistent message framing, remove varint padding for stacks
    • 4 different cases for how an event can be laid out.
    • Maybe a way to skip messages and layouts it doesn't understand.
    • Austin: self-descriptive header giving lengths for each opcode
    • Michael K: Any state in the trace makes things hard to push it up into OTel, since that’s completely stateless.
    • Felix: We’re actually trying to do two things in OTel. Including binary data blobs, like pprof and JFRs. And something to send stateful things like stack traces, etc, where you can refer back to them efficiently.
    • David: For trace I wouldn’t expect a stateful protocol to be introduced any time soon. But for profiling it may be a possibility.

@mknyszek
Copy link
Contributor Author

mknyszek commented Feb 2, 2023

2023-02-02 Sync

Attendees: @aclements @felixge @nsrip-dd @thepudds @bboreham @dashpole @mknyszek @prattmic

  • Felix: Discuss results from frame pointer unwinding experiments (blog, sweet results) and next steps
    • Targeted ping-pong example, worst case. Worth noting that the stack depth in that benchmark is 2. Went from +773% -> +30%, apparently doing 50% more work too!
    • Sweet: 10% -> 2% overhead!
    • Michael K: Michael P mentioned missed cases.
    • Michael P: Inlined frames are one example. Maybe we just accept slightly less accurate traces in the tracer.
    • Austin: +1 to missing inlined frames, but we can also expand that after the fact.
    • Michael K: Do you need the binary for that?
    • Austin: Today, yes.
    • Felix: The tracer already de-duplicates stack traces. If we do inline expansion at the end, there's probably not that much work to do.
    • Michael P: Other avenue, do we need stack traces on every event? Maybe remove stack traces for some events?
    • Michael K: Where does the rest of the time go?
    • Felix: In the blog post. Frame pointer unwinding is only 9% of the trace overhead. 28% is cputicks. 21% is stack put.
    • Austin: Shocked that cputicks is 28%. It's one instruction. I guess that's a good sign?
    • Austin: (FP unwinding is also relevant for #53286. In that case it’s the kernel’s FP unwinder, but it means our FP data is going to have to be high quality for both.)
    • Thepudds: Or maybe an option for sampling of stack traces?
    • Michael K: I think it depends. As traces are used today, you probably want 100% sampling. For larger scale aggregation, I think it's a solid option.
    • Michael K: Dream of nanotime to line up clocks.
    • Austin: It might not be that bad. RDTSC is serializing so the extra math in nanotime might not make much of a difference in overhead.
    • Michael K: We should definitely pursue this, at least for tracing.
    • Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.
    • Austin: Status of FPs is better than it once was. Saving grace of the tracer is you often don't have an assembly frame on the stack. Talked about making vet complain if you clobber the frame pointer in assembly code. Would be surprised if there are problems in the compiler generated code; worry much more about assembly.
    • Felix: Worried about stack shrinking / relocation. Growing shouldn't happen while in unwinding, but not sure about shrinking.
    • Austin: I think you always see a fully formed stack.
    • Felix: There's no chance of seeing the stack mid-move?
    • Austin: The goroutine that's getting moved has to be stopped.
    • Nick: If unwinding happens asynchronously then it's a problem, like CPU profiling. We could use gentraceback in the difficult cases.
    • Felix: Plan on working on better unwind testing. That machinery could be used to harden frame pointer unwinding as well.
    • Michael K and Austin: Not a blocker to have the testing.
    • Austin: FP on x86 is specified as part of the Go internal ABI. If the compiler is messing that up that's a violation of the ABI and definitely a bug. Doesn't apply to hand-written assembly.
    • thepudds: One of the older CLs mentioned its approach depended on the stack not being copied while walking the frames, along with the comment “currently ok, but won't be if we preempt at loop backedges”... but maybe that old concern is not a current concern....
    • Michael K: I think loop backedges aren't a concern, and async preemption as it exists shouldn't be an issue.
    • Michael P: Traceback itself would just disable preemption just for consistency, but just because it's in the runtime package, we won't ever async preempt.
    • Austin: I'm not sure why loop backedges would be a concern.
    • Michael K: I don't think we should block on inline expansion, but maybe cgo tracebacks.
    • Austin: As an intermediate step, use gentraceback for if there’s a cgoTracebacker and cgo on the stack. Will work for 99% of our users.
    • Felix: Like the idea of making it the new default, but with the ability to switch back.
    • Michael K: We could add a GODEBUG flag
  • Felix: Flight recorder / ring buffer mode
    • Felix: We’d like to capture traces of slow spans. Wait for a p99 response and then get the last N MB of trace. I’m currently working on an experiment to see if this can be done in user space.
    • Michael K: I think finding the oldest batch is an O(N) operation. Ordering the batches is difficult because we assume everything will show up eventually.
    • Austin: The tracer is really stateful, so it's really difficult to actually manage a ring buffer. debuglog is a ring buffer, and what it does is consume its own format in order to manage a snapshot of the state.
    • Felix: I’d be okay with getting a non-perfect trace at the end. At least understand what the goroutines are doing. Maybe we could every once in a while emit a “synchronization” event. If a complete redesign of the format is required, [flight recorded mode] is something we’d be interested in.
    • Michael K: I’d like to find out what the overhead of writing the trace is. Say you have no stack traces, where is the rest of the time going? That’s important information for redesigning the trace format. I’ve already been thinking about redesigning the format. At the cost of using more space, it has to end up less stateful. Regularly synchronizing is one way to do that. That’s kind of where I was going: a “trace” is really a collection of self-contained traces. With the tooling able to be more resilient at the edges. Synchronation wouldn’t necessarily be STW, but you have a ragged barrier across the Ps that sync them all to the next trace chunk. That gets complicated in a ring buffer. I was thinking of gathering the requirements for a new trace format. Because there’s so much state, it’s hard to make it completely stateless without ballooning the trace.
    • Felix: JFR does that ... splitting the stream up into self-contained chunks.
    • Michael K: We’re definitely on the same page [wrt flight recorder]. The Go team arrived at this, too. We’re also trying to make ELF core dumps the source of truth for heap analysis. Ideally we’d be able to pull the ring buffer out of a core dump so you can see exactly what was happening before crashing.

@qmuntal
Copy link
Contributor

qmuntal commented Feb 2, 2023

Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.

FYI: #57302 is hitting this as well, as I'm implementing SEH unwinding using the frame pointer. Whichever is the fix for that, would be good to take SEH also into account.

gopherbot pushed a commit to golang/benchmarks that referenced this issue Feb 14, 2023
This change does a lot at once, but it's mostly refactoring. First, it
moves most of the profile abstraction out of benchmarks/internal/driver
and into a new shared package called diagnostics. It also renames
profiles to diagnostics to better capture the breadth of what this
mechanism collects. Then, it adds support for turning on diagnostics
from configuration files. Next, it adds support for generating
additional configurations to capture the overhead of collecting
diagnostics, starting with CPU profiling. Lastly, it adds support for
the new Trace diagnostic.

(This change also fixes a bug in go-build where Linux perf flags weren't
being propagated.)

In the future, core dumps could easily be folded into this new
diagnostics abstraction.

For golang/go#57175.

Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754
Reviewed-on: https://go-review.googlesource.com/c/benchmarks/+/459095
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@mknyszek
Copy link
Contributor Author

mknyszek commented Feb 16, 2023

2023-02-16 Sync

Attendees: @mknyszek @aclements @felixge @nsrip-dd @prattmic @dominikh @thepudds @pmbauer @dashpole @rhysh

  • 468301: runtime: delete gentraceback
    • Austin: Needs more testing.
    • Austin: Nice things to do as a result, listed in the issue. e.g.
      • Simpler defer processing
      • CPU profiles have a low limit on frames it'll capture.
        • Iterator makes this much more tenable to fix.
      • Years-old squirrely bug in the race detector.
    • Felix: I’m happy to look into testing using perf, but I’m not sure when I can get to it.
    • Rhys: If there are more frames than you want to record, could you add some context by including N outermost frames and M innermost frames. Maybe a “runtime._Elided” frame in the middle.
    • Michael P: We’ve thought about doing that for panic tracebacks.
  • 463835: runtime: frame pointer unwinding for tracer Felix Geisendörfer: wip, but almost ready for review
    • Are slight stack trace differences acceptable?
      • Michael K: I think that’s fine. As we move toward letting people parse the format, I think lining up traces with stacks from other sources could become more of a problem.
      • Felix: The current patch passes most of the tests of tracebacks in traces.
    • Should it use an unwinder interface similar to austin’s patches?
    • Could systemstack be changed to push frame pointers? Otherwise the caller frame is lost. Naive attempts to make this change caused crashes.
      • Austin: Yes please.
    • Weird issue with syscalls on BSDs losing a frame.
      • Austin: That’s probably lazy assembly.
      • Felix: Another option is to only enable FP unwinding on Linux for now.
      • Austin: As long as it works on Linux, Windows, and Darwin I’m happy.
    • Cgo unwinders
      • Austin: It’s fine to take the slow path if the current goroutine has cgo frames and there’s a cgo unwinder.
    • Felix: I got inlining to work (when traces are finalized). Benchmark numbers are still holding.
  • Michael K: Once all of the backtrace stuff is settled, I want to try using the monotonic clock (nanotime) rather than CPU ticks.
    • Nick: Could you record nanotime at the beginning of a batch and then CPU ticks after that.
    • Michael P: To do that safely, you’d need to know when you migrate CPUs. Linux’s restartable sequences can get you that.
    • Michael K: There might not be a performance gap between nanotime and cputicks.
    • Austin: If there’s even a performance gap, you could push more of the nanotime computation into the trace reader.
$ benchstat -col '.name@(CPUTicks Nanotime)' /tmp/bench
goos: linux
goarch: amd64
pkg: runtime
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
    │  CPUTicks   │              Nanotime               │
    │   sec/op    │   sec/op     vs base                │
*-8   10.75n ± 0%   16.11n ± 0%  +49.88% (p=0.000 n=20)
  • runtime: copystack doesn't adjust frame pointers on arm64 · Issue #58432 Felix Geisendörfer
    • It was relatively easy to fix once I understood what was going on, but there appear to be dragons there.
    • Boolean in the runtime does a double check of FPs on stack copies.
    • Would like to treat arm64 as a separate issue, but I plan to get to it.
  • 460541: runtime: reduce sysmon goroutine preemption (Felix Geisendörfer)
    • Michael P: There are likely issues here with special cases in the scheduler. Not sure they're easy to fix.
  • cmd/pprof: macOS 12.6.1 (M1) profile overcounts system calls (again) #57722 (Felix Geisendörfer)
    • Michael P: C reproducer and handing off to Apple (if it works) seems like a reasonable next step. No guarantee we'll get a fix though.
  • proposal: runtime: add per-goroutine CPU stats · Issue #41554 (Felix Geisendörfer)
    • Felix: Initial justification was along the lines of billing, which seems better served by pprof. Then it shifted to fast control loops to throttle users. It seems better to have scheduling priorities, but barring that it seems good to let user space do priorities.
    • Michael P: We’ve been discussing having tracing that’s cheap enough to have on all the time, and a parsing library. Maybe a user could do this by enabling tracing and parsing their own trace. Is this generally the right approach to user throttling at all?
    • Rhys: I think a cheap trace that can be parsed in the app is good and flexible. I’m not sure per-goroutine stats is the right approach. E.g., if I use the net/http client, there are a bunch of goroutines involved that I don’t control but I want to understand the latency of.
    • Felix: One trade-off of the trace route is the latency of reading your own trace.
    • Rhys: It would be useful if the app could say, “I need a ragged barrier ASAP and I’m willing to take some performance hit.”
    • Michael K: The other complication is how fast we can make the parsing. That might add unacceptable latency.
    • Felix: I think the “explain analyze” case is not the best example. The most difficult is trying to throttle a user of the database that’s doing something you don’t want. In that case you don’t know ahead of time, so you’d be doing the ragged barrier all the time.
    • Michael P: I think that’s a good argument for actual priorities in the scheduler. If you have some background goroutine watching for bad behavior, that might not get scheduled if there’s bad behavior.
    • Austin: Swirling around problems that people have been thinking about for decades. Would love to see a summary of the current state-of-the-art is here.
    • Michael K: Probably only OS APIs.
    • Austin: That's not a bad thing. If it's a good API, we can consider replicating it.
  • AIs
    • Michael K: Writing down trace requirements in earnest
    • Michael K: Testing for x/debug
    • Michael P: Need to review Austin's CL stack.
    • Michael P: debug/gosym proposal.
    • Felix: Clean up the tracer FP unwind patch (for amd64) to get it ready for review.
    • Austin: Try to keep moving along gentraceback stack. Think about test-hook-coverage aspect.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 3, 2023

2023-03-02 Sync

Attendees: @mknyszek @prattmic @felixge @nsrip-dd @aclements @thepudds @rhysh @bboreham

  • Michael K: I'm 70% of the way to a trace v2 (producer, consumer, trace format), and 40% of the way to writing it up.
    • Most of the time is being spent detangling the existing tracer, documenting it, and using that to justify next decisions. Hopefully I'll have a draft to share before next time.
    • [Michael K proceeds to go into way too much detail about this. Highlights below. A public document will follow.]
      • Let's use the system clock (e.g. clock_gettime) instead of RDTSC (for a number of reasons).

      • There are a very small number of places where you really need to understand the exact order of events. The current tracer takes advantage of that and I believe we need to retain this. Timestamps aren't enough.

      • Attach traces to Ms, not Ps. There’s a lot of complexity around GoSysExit racing with trace start. Thinking about ragged start and making the parser robust to that.

        • This choice forces us into avoiding a stop-the-world.
      • Trace binary format ended up being more about consumer efficiency than producer efficiency, but still more efficient on both sides.

        • Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings.
        • Trace events are sequences of 4-byte words whose internal structure respects byte boundaries and field alignment, to allow encoding/decoding events to just be memcpys and state management.
        • Using Felix's 280 MiB trace breakdown as a motivating example. By my calculations the current design woukld use around 10% more. Personally that seems acceptable for the other gains.
        • Every G event has an explicit G ID, but it's derived from a "G context" event. G IDs are also compressed.
    • Michael K: We could make the stack table faster by only checking the hash instead of an exact match. Small chance of error.
    • Rhys: Let's be cautious about making sure that traces actually work.
    • Michael K: That's a good point. We should put an explicit bound on the likelihood of error. If it's astronomically small, is that fine?
    • Rhys: Astronomical is good.
    • Rhys: Would the new trace format still enumerate every goroutine? Currently can get stuck in many-millisecond STW waiting for tracing to enumerate all goroutines.
    • Michael K: My plan was no. Goroutine profiles if you want that?
    • Rhys: That's good. Yeah, you should be able to correlate a goroutine profile with a corresponding STW event in a trace. Happy about no STW in general for traces too.
    • Rhys: RE: correlating things with traces, do we want to keep things orthogonal in general? Thinking about CPU profile events in traces.
    • Michael P: I see where you're coming from in that you might want just the CPU profile events from a trace (with timestamps) and it's weird to get the whole trace and throw most of it away. We discussed having an API for configuring the trace and which events get emitted, so that might be a good place for that.
    • Austin: There's a relevant issue about making CPU profiles more configurable as well, so maybe that's a good place for it too?
    • Michael P: I think there are a lot of API questions here. Do you configure CPU profile in tracing at the CPU profile side or at the tracing side? The most natural way sounds like the tracing side because that's your actual output format, but I'm not sure. And then it gets complicated if you turn on CPU profiling in the tracing API and then you separately turn on CPU profiling, is that allowed? Right now you can't turn on profiling twice. And that's even more complicated, if we let you figure the sampling rate and they're not the same.
    • Rhys: One of the difficulties that I've had in using execution traces and CPU profiles at the same time is that even though the CPU profile doesn't exactly stream its output while it's going. It's tricky to juggle two different output formats. At the same time that I'm trying to put into a single zip file to upload to blob storage. A single buffer would be handy.
    • Michael P: A single buffer is ideal, but we don't have a converter that could pull a CPU profile out of a trace. We're missing information.
    • Rhys: For one, we're definitely missing goroutine labels, though there's a patch out for that. We're also missing /proc/<pid>/maps for binary/symbol information.
    • Austin: It occurs to me that Linux perf basically starts with /proc/<pid>/maps.
    • Michael P: Perhaps we should also dump build information. We've been brainstorming about including this information for PGO.
    • Michael K: There's room for as much as we want at the beginning of the trace, basically, so I'm all for adding more there.
  • Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
    • Felix: Would the old trace format fit in the new parser?
    • Michael K: That was my goal. We'd have to do some retrofitting, but the old parser already exists. Caveat: parsing old traces would still have the same overall properties as the trace parser currently does.
  • Felix: Frame pointer unwinding patch for tracer is ready to review. It’s only amd64 for now and a bit rough around the edges. We should discuss what needs to be done before landing. Cgo is still missing, but I’m working on adding that.

@dominikh
Copy link
Member

dominikh commented Mar 4, 2023

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 6, 2023

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

It does not. It only cares about the initial state of all Ms (including goroutines running on them), and generally only mentions goroutines that actually emit events. For goroutines that aren't running, there are only two cases where we actually care about the initial state of a goroutine: whether it was blocked, or whether it was waiting. In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

The trace still needs to indicate if a goroutine (and M) is in a syscall or if it's running. In the new design, this information is emitted together at the first call into the tracer by that M for that partition. The timestamp needs to be back-dated to the start of the partition. There's some imprecision with this back-dating but it's only relevant at the very start of a trace. The worst case is that a goroutine may appear to have been running or in a syscall at the start of a trace for longer than it actually was. The amount of imprecision here is bounded by the time delta between the global (serialized) declaration of a new partition and when an M has it's buffer flushed and/or is notified (via an atomic) that tracing has started, which I expect in general to be very short and non-blocking. (We can also explicitly bound the time by telling the M what time it was contacted for a new partition.)

Note that the details above imply that when a new partition starts, a running M may have been in a tight loop and so hasn't emitted any events for the last partition, in which case we need to preempt it to have it dump its initial state. Generally, moving partitions forward doesn't have to even involve preemption.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

@dominikh
Copy link
Member

dominikh commented Mar 6, 2023

In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

That has two implications, however:

  1. goroutines that don't unblock during the trace will be unaccounted for
  2. the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view? However that wouldn't really fit into an M-centric format…

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 6, 2023

That has two implications, however:

  1. goroutines that don't unblock during the trace will be unaccounted for
  2. the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view?

Both of those things are good points.

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated). At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

However, I can definitely see that getting the information you describe has utility and we don't want to lose that. In the last meeting we discussed how goroutine profiles could be used to fill this gap. As a baseline, it should be fairly straightforward to correlate a goroutine profile's STW timestamp with a STW event in the trace. Taking that one step further, we could explicitly mention that the STW was for a goroutine profile in the trace. (In theory we could also dump the goroutine profile into the trace, like we do with CPU samples. I am not opposed to this, but I probably wouldn't do it to start with.)

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable? Perhaps I'm missing some use-case that's totally missed. FTR, I fully recognize that we're losing something here in the trace, but I argue the net benefit is worth that cost.

Also I just want to disclaim the design details in the last paragraph: subject to change in the first document draft. :) That's just where my head's at right now. It may turn out that the per-M synchronization I have in mind is too complex.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

That's another good point. To be clear, I do plan to have an API with some level of abstraction and not quite just []byte-to-type. :) Events will be opaque and fields will be accessed through methods, so we have a lot of wiggle room. However, something like the Link field I think requires keeping the whole trace in memory, because you never know when someone might want to access an event from a long long time ago (though I haven't thought this through). In theory an accessor can be arbitrarily complicated and even re-parse the trace to find the event, I suppose. :P

My general hope and expectation is that the vast majority of users should never have to look at the API at all, and instead rely on tools built with it. And those that do use the API don't need to understand the file format, just the execution model it presents (which I think is somewhat unavoidable).

@dominikh
Copy link
Member

dominikh commented Mar 6, 2023

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated).

I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?

At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

One use case of looking at execution traces as they are now is debugging synchronization issues. Imagine having an N:M producer/consumer model using goroutines and channels, and we're debugging why producers are blocking. The reason might be that all of the consumers are stuck, which is only evident if we can see them be stuck. If they're already stuck at the beginning of the trace then they would be invisible in the new implementation.

More generally speaking, a lot of users aren't interested in the per-P or per-M views and instead want to see what each goroutine is doing (see also the per-goroutine timelines in gotraceui.) It turns out that per-G views are useful for debugging correctness and performance issues in user code and that traces aren't only useful for debugging the runtime.

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable?

In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.

At least this would give people the choice if they want to tolerate STW for more detailed traces.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

Probably, yeah.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 6, 2023

I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?

It's not quite that it didn't pan out and more that it just doesn't work with a per-M approach given other design constraints.

The ragged barrier I mentioned in an earlier design sketch is the forEachP one which is ultimately still P-focused. Part of the reason I want to switch to a per-M approach is to remove the GoSysExit complexity that comes from the fact that goroutines can in fact run without Ps sometimes. That complexity is part of the event's semantics, so it tends to leak everywhere.

A per-M approach can side-step a lot of that complexity, but it means we need a way to synchronize all Ms that doesn't involve waiting until the M gets back into the scheduler. What I wrote above is a rough sketch of a proposed lightweight synchronization mechanism that most of the time doesn't require preemption. I think that in general we can't require preemption in a per-M approach if we want to be able to simplify the no-P edge cases and also get events out of e.g. sysmon, which always runs without a P. (EDIT: D'oh. I keep forgetting that the current tracer can indeed emit events without a P. So that's really more that just that we don't currently have a great way of tracking Ms in general. I would like to add more explicit M-related events. The GoSysExit point still stands because it races with a trace stop-the-world, which is the main source of complexity. If we synchronize via Ms that goes away.)

(In effect, I am proposing to shift the GoSysExit complexity somewhere else, but I hope that in the end it will be less complexity overall because the M synchronization details can probably be written in a way such that the details don't leak as much.)

@dominikh
Copy link
Member

dominikh commented Mar 6, 2023

An aside that might steer you closer to a per-M approach: I tried adding per-M timelines to gotraceui using the current format and found it impossible due to the current event sorting logic. I ran into scenarios where a P would start on an M while the M was still blocked in a syscall.

@mknyszek
Copy link
Contributor Author

mknyszek commented Jul 10, 2023

Thanks! The source of that is the notetsleepg in gcBgMarkStartWorkers which calls entersyscallblock. This case is kind of interesting, because it is a blocking syscall (the thread blocks in FUTEX_WAIT but the goroutine gets descheduled and its P is released). It's just one invoked by the runtime itself.

I'd argue maybe this one is working as intended? I see two possible problems here:

  • It is kind of noisy for the user debugging their own code to see this extra syscall that's entirely runtime-produced.
  • There are frames skipped which make it very unclear what the actual the source of the syscall is. Maybe the skip count is the right amount for any syscalls going through the syscall package, but it's not right for this case. This stack trace makes it appear as if the GC somehow represents a syscall but it's just a side-effect.

Maybe we capture more stack frames? That makes noisiness potentially worse, but in this case I think it would help a lot to identify "OK it's just the GC spinning up workers and waiting for them to start" from the extra stack frames. TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem and I'd be inclined to hide fewer frames in the trace itself (and leave it up to the UI to hide things), but I'm biased since those frames are ~always useful to me personally. :)

(From a performance perspective, I doubt including those frames would hurt much. Stack traces are likely to be just as dedup-able and the extra few frames would only increase trace size slightly.)

@dominikh
Copy link
Member

Based on your analysis I agree that this is working as intended. I'm also in favor of capturing more stack frames. With a proper stack, this event might even be useful to users, since it's a possible source of latency. And if it's too noisy, the UI should allow filtering it away.

TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem

It wouldn't be the first time that Gotraceui analyses stack traces, either. I already do so to walk backwards out of the runtime for events such as blocking channel sends (the user doesn't care that this happens in runtime.chansend1).

@mknyszek
Copy link
Contributor Author

mknyszek commented Jul 21, 2023

2023-07-20 Sync

Attendees: @mknyszek @prattmic @rhysh @bboreham @dominikh

  • Michael K: Continuing to make progress, but nothing new to talk about.
  • Michael P: FYI the tree is going to be reopened for development imminently.
  • Rhys: Filed some bugs recently about lock contention. Filed about a runtime/metrics metric to measure runtime-internal lock contention. Michael P mentioned that it would be nice to write up runtime-internal locks to the contention profile.
  • Dominik: I briefly wondered about sync.Mutex and execution tracing and if there could be a (cheap) unique ID for mutexes to analyze contention in detail (i.e. find goroutines blocked on the same mutex).
  • Michael K: They do have a unique ID: their address.
  • Dominik: The problem with the address would be address reuse.
  • Michael P: Addresses could be reused, but you're tracking contention on small timescales, and unless a GC occurs, addresses won't be reused.
  • Rhys: There is sometimes lock contention in the http2 implementation: something grabs the lock for the whole transport, then does a blocking write to the http2 connection, but blocks everything (seen for up to 15 minutes). These are just bugs, but the way I've found them is via goroutine profiles. The unique ID would be helpful to filter down which goroutine owns and/or is impacted by the lock.
  • Michael P: What does it mean for a goroutine to hold a lock? Locks can be released by other goroutines.
  • Rhys: Would still help for a lot of cases.
  • Michael P: My old team's locking was complicated. I made a wrapper for sync.Mutex that enforced unlocking by the same goroutine, and there was exactly one case found in the relatively large codebase where a different goroutine released the lock.
  • Rhys: It's useful to see how goroutines interact. I see this information when mutexes are contended, but not when they're uncontended. I wonder if it would be useful to occasionally sample uncontended locks as well to understand the structure of the program when it's not overloaded.
  • Michael P: That's an interesting idea. Would you imagine that it would be integrated into a contention profile? (Contention is zero on a sample.) Or would it be a separate profile?
  • Rhys: I was thinking of execution traces, e.g. this goroutine would have unblocked this other goroutine if they were executing differently. Right now I get no data about locks that might become contended until they are.
  • Michael P: In the limit, it would be nice if every lock and unlock was a trace event, but that's very expensive.
  • Dominik: It would probably be too expensive to emit lock/unlock events for uncontended locks, right?
  • Rhys: Maybe we can do 1% or 1 per million of lock/unlock pairs.
  • Michael P: Uncontended lock is already pretty fast: CAS and you're done. But maybe a (non-atomic, per-P) counter or something is fine with a slow path if that counter gets big.
  • Rhys: Also applies to channels.
  • Dominik: And the unlock checks a flag to tell if the lock was sampled?
  • Michael K: I think that flag would have to be on the mutex.
  • Michael P: (a) It's not immediately obvious that we should sample the unlock (but that would be a nice property. (b) There are unused state bits in the mutex state. We probably couldn't use it for a counter, but we could probably use it for a flag.
  • Rhys: I think this nice property is important for execution traces. "Had this unlock not happened, this lock would not be able to proceed."
  • Bryan: The mention of RWMutex made me think about getting data on how useful RWMutex actually is. For example, how often the program is actually taking advantage of the read lock. I often wonder if the implementation cost is worth it.
  • Rhys: One problem is that they look like they're reentrant until they're not. A useful tool would be to identify if you're assuming the lock is reentrant.
  • Michael K: You can imagine that the lock implementation detects it by writing down the G.
  • Michael P: The problem here is that you don't have to unlock from the same G. You can do weird things like handing off lock ownership and (surprisingly) sidestepping deadlocks. This makes detection more difficult.
  • Dominik: Would putting the address in the trace be difficult?
  • Michael K: I don't think it would be that difficult. It would make the trace bigger, but you can have a side-table.
  • Michael P: You could imagine a block context event or something that includes that address and comes after the block event.
  • Dominik: I'm not sure how much address reuse actually matters. The address will refer to the same mutex for lock and unlock pairs, and if two goroutines block on the same address'd mutex, it has to be the same mutex, too. Sampling all locks/unlocks would be a different story, however.
    • Although I'm not sure how this interacts with stack moving/the elusive "moving GC".
  • Rhys: Is there data that would be helpful for you for debugging the three issues I filed (61426, 61427, 61428)? In other words, is there any way we could improve runtime diagnostics that would make those issues issue to resolve?
  • Michael K: I think runtime-internal mutex contention might be sufficiently covered by CPU profiles? Because they lock up the whole thread, so samples land there. This isn't true for sync.Mutex.
  • Rhys: Doesn't capture details in certain cases, e.g. channel lock contention. Can't distinguish individual mutexes in a profile. Filed an issue about this a while ago. It would be useful in this case too to have some identifier for the lock.
  • Michael P: The sync.Mutex contention profile would benefit from a unique identifier or something to distinguish whether it's one mutex or 5. Possibly via a label.
  • Rhys: The contention profile accumulates over all time, so attaching labels to it means that until we figure out something clever, that also leaks memory over all time. So we can't have labels.
  • Michael P: sync.Mutex can appear on a CPU profile because the contended locker has to call into the runtime scheduler. It's just not quite as expensive as calling futex.
  • Michael P: I think the problem is slightly less bad for mutex profiles, but once you turn them on they keep everything forever. Could have a diff profiling mode.
  • Michael K: Could a diff profiling mode have a label?
  • Rhys: An execution trace is start and stop, but you can't get goroutine labels from it, and you can't get allocation events. Well, you kind of can with the HeapAlloc event.
  • Michael K: That's super interesting; I think HeapAlloc being a good sampler is sort of accidental, but that's a good use-case. I'm not certain that was intended…
  • Michael P: Hyrum's Law!
  • Rhys: We do collect a stack every half-megabyte to put into an allocation profile. If we put that into an execution trace too, that would be fine. Over time we could see that a certain type of request allocates a certain type of memory looking over many samples.
  • Michael K: Wrote an allocation tracer similar to the execution trace that emits an event for every alloc and free. Had about a 30% allocation overhead, but the traces were huge.
  • Dominik: There's a GODEBUG flag to trace all allocs, isn't there?
  • Michael K: Yeah, but it's really, really slow, since it just writes to STDERR.

@tbg
Copy link

tbg commented Jul 24, 2023

Just a small comment,

Rhys: One problem is that they look like they're reentrant until they're not. A useful tool would be to identify if you're assuming the lock is reentrant.

https://github.com/sasha-s/go-deadlock does this (at runtime) by recording and comparing the stack frames seen for each (manually instrumented) mutex (very expensive, we only use it in nightly stress testing). Of course this requires observing a re-entrant lock acquisition. It seems difficult to do this sort of thing statically, but even doing it cheaply enough to have it always on would be helpful.

@prattmic
Copy link
Member

prattmic commented Jul 24, 2023

@tbg We discussed detection briefly. It is unclear to me how to do robust deadlock detection given that lock and unlock are not required to be on the same goroutine. e.g., consider:

  1. Goroutine 1 calls RLock.
  2. Goroutine 2 takes "ownership" of this critical section.
  3. Goroutine 1 calls RLock again.
  4. Goroutine 2 calls RUnlock.

Step 3 naively looks like a deadlock, but as long as goroutine 2 is guaranteed to make progress it is not. How does go-deadlock handle this situation?

@tbg
Copy link

tbg commented Jul 26, 2023

It gives a false positive:

https://github.com/sasha-s/go-deadlock/blob/5afde13977e624ab3bd64e5801f75f9e8eb1f41b/deadlock.go#L284-L300

It's difficult to handle locks that pass between goroutines. Which is a shame, because it's also pretty rare.

@mknyszek
Copy link
Contributor Author

mknyszek commented Aug 9, 2023

2023-08-03 Sync

Attendees: @mknyszek @rhysh @nsrip-dd @felixge @bboreham @dominikh

  • Michael K: First pass implemented, no parser yet. (CL) Veered away from doc a little based on learnings from implementation. Couldn’t get around a brief STW on trace start. Advancing the partitions or stopping the trace doesn’t need STW.
  • Rhys: Can you quantify “brief STW”?
  • Michael K: It stops the world, prevents sysmon from doing its thing, sets global trace.on value, emits on event per P but this might not need to be done during STW. Briefly spins waiting for things to exit syscall. Couldn’t figure out how to get around that.
  • Rhys: Syscall stuff is a wild card, but otherwise it sounds like less STW than what we see from a GC?
  • Michael K: Yeah, it’s extremely brief. Getting attention of every G in a non-racy way was difficult and changed the design a bit.
  • Dominik: Will STW be recorded?
  • Michael K: There's currently a bug that it doesn't, but it easily can.
  • Rhys: Some of the applications I collect traces from have long GCs, and tracer can't start or stop during a GC cycle, so I miss things, especially when I'm collecting only 1 second traces. Does the new tracer need to wait for a GC to end?
  • Michael K: It does, but that restriction can be easily lifted in the future. The trace just needs to contain more information about what's currently in progress.
  • Felix: Can waiting for syscall exit could cause a very long stop-the-world since goroutines might be blocking on I/O?
  • Michael K: Ah, no. To clarify, the syscall exit is just making sure that there are no threads in a very short critical section on the syscall exit path. Threads are guaranteed to drain from that critical section.
  • Rhys: Could the spinning for that wait still be a problem, say if that thread gets descheduled?
  • Michael K: It could, but it's very short (hundreds of nanoseconds at most), and working around that would require the tracer to be wait-free. Though, any thread writing to the trace is indeed wait-free. The core of the design is that every thread holds its own seqlock (thread-local atomic counter) while it's writing events. The tracer then makes sure nothing is writing to the trace by ensuring the counter is not observed to be an odd number (indicating that it's in a critical section).
  • Michael K: As an aside, I think I may have also mitigated some of the issues with the CPU profile samples being inserted into the tracer only in a best-effort manner. I basically just spun up a different goroutine because it was simpler. There may still be some data loss across trace generations (partitions) since the CPU profile buffer isn't forcibly flushed on each generation. We can probably do a bit more to minimize that, but on the whole the loss should be small.
  • Rhys: Does this mean we also get M information for each profile sample?
  • Michael K: Yup!
  • Nick: You mentioned you're planning on writing a parser. There was some discussion of fixing up timestamps. Have you reached that point yet?
  • Michael K: Not yet, but I expect it to be fairly straightforward. The algorithm is going to treat event partial orders as the source of truth and try to treat timestamps as annotations. It can't in general, but my goal is to keep it as close to that as possible. The timestamp fixup will just happen when timestamps don't make sense with respect to the partial order.
  • Dominik: Will we end up with two parsers, one of the old, one for the new?
  • Michael K: My goal was no, but I'm starting with a parser for the new one and then I'll go back and retrofit the old format.
  • Dominik: Can fixing up timestamps introduce severe inaccuracies?
  • Michael K: No more inaccurate than the timestamps being out of order in the first place! As a side note, I plan to write a wire format spec at some point.
  • Dominik: Would it be possible for two events to have the same timestamp?
  • Michael K: Yeah, it'll be possible.
  • Rhys: It's useful in order to identify events uniquely. But that's fine actually, a good alternative is the byte offset in the trace which is what I already do.
  • Michael K: Yeah, I think you'll have to keep doing that.
  • Rhys: One interesting property about the current trace is that CPU profile samples are put in the trace buffer a bit after the stack is collected. So if you look closely in the trace you can see it shifted right quite a bit and get an idea of how long it takes to write a trace event.

@mknyszek
Copy link
Contributor Author

mknyszek commented Aug 20, 2023

2023-08-17 Sync

Attendees: @mknyszek @aclements @felixge @nsrip-dd @rhysh @dominikh

  • Austin: Wondering if anyone had seen runtime.step or runtime.pcvalue in their profiles, re: PC value data encoding experiments.
  • Rhys: I have an app that does a lot of copystack. 6% in runtime.pcvalue+runtime.step in the profile with “the most pcvalue”
  • Austin: That's where I'd expect to see it hit the hardest. I suspect it’s pretty low on average, but some apps hit it hard.
  • Rhys: This app starts a lot of goroutines that grow their stacks.
  • Michael K: I’ve been working on the trace parser. I’m getting to the partial ordering part of the parser. I got all of the structural elements out of the way. I plan to start with just a streaming API and to put it in x/exp. What would be the best way to have this parser both exist publicly and use it from within std? Exp isn’t great for this because we don’t want to vendor exp. Maybe we could vendor x/debug?
  • Austin: What’s so bad about vendoring from exp?
  • Michael K: I suspect Russ will be like “we do not want to depend on exp in any official capacity”. I do want this to live in two places so people can start using it, with the knowledge that the API might break. The long-term plan is to have a stable API, but the devil’s in the details of how events look.
  • Michael K: Leaning toward opaque value types. I want to leave the option open for lazy decoding. E.g., switch over event type, then say “event.GoStart()” and that returns the event as a GoStart.
  • Rhys: I will note there's a lot of stuff vendored into cmd.
  • Felix: Lazy decoding sounds good from the perspective of online processing.
  • Michael K: Right now it’s a completely streaming API, but I want to be able to derive one of these streamers from a ReaderAt. Mmap the entire trace, and the reader either does a buffered read or keeps just offsets to minimize the number of copies that happen. We could keep references into a mmapped byte slice.
  • Rhys: For the programs I have that read execution trace data, often they first sort it by goroutine. I don’t know if there’s a way to do that that’s still streaming without loading the entire thing into memory. If there was a way to have multiple iterators without having to reparse the entire file…
  • Michael K: Because of partial orders across Ms, things have to come out in time order and obey the partial order. If you’re just streaming it out, the parser already has to have state for every G and P, you could process it in total order, but copy the events out to separate output streams for each G, etc.
  • Rhys: Could we have an API that allows filtering over the streaming data, without discarding all of the work it does to figure out where the chunk boundaries are? But maybe calculating the partial orders is expensive enough that you have to keep everything in memory.
  • Michael K: No matter what you have to buffer an entire partition at a time.
  • Felix: Re sorting events by G, how do you handle events that refer to two goroutines?
  • Rhys: I look at the goroutine that emitted the event. E.g., if one goroutine unblocks another, we have the Unblock on the source and a GoStart on the target.
  • Michael K: I was expecting the parser API to be pretty low level. E.g., for go tool trace, I expect this to sit below the current high-level allocation-heavy work it does (at least to start). If you want to do higher-level analysis on top of that, it should be possible. If you want to keep more state around, that should be up to the consumer.
  • Austin: I think there are asymptotic issues at play. You can write processed streams out to disk, and I think that might be necessary to scale up arbitrarily.
  • Rhys: Pprof is an example of writing to disk. E.g., you can process a pprof with a focus flag and write out a much smaller profile that you can then work with further. Maybe the first time you run “go tool trace” it rewrites the trace in sorted order and flags it as already-sorted. Or add filtering in the future. Going to disk is fine, but I want to keep things in the same format so the ecosystem keeps working.
  • Dominik (Chat): what I found is that no order is convenient for all analyses of the events. All of G, P, and M orders are useful
  • Michael K: That might be doable. I’m reluctant to have a second format.
  • Michael K: Filtering by G is interesting… Early on, Russ said to me that a trace is most useful when you have the full context.
  • Austin: If people have a lightweight way to refer to events, is that useful? It wouldn't have all the order information (so you couldn't seek the streamer to that point, though you could do that but it would be much heavier-weight).
  • Many: nod
  • Rhys: The equivalent of focus for an execution trace is "I have a 5min execution trace, but I just want to see this 30 second window starting here." One of the things you can do with go tool pprof is you can have 100 profiles, but all these samples are from the same machine so you can merge them all together by adding a label. The equivalent in the execution trace might be to process a trace and add extra information, say what request a goroutine was handling, and write that back out to be processed by other tools.
  • Michael K: For focusing on a time window, partitions get you that. Slicing a trace is very easy now. You still have to go over the trace to figure out the time windows (which is very fast), but then you can copy out the partitions you need. Long-term I proposed the trace itself could write out a partition table at the end so you don’t have to do that yourself, but I’m parking a bunch of features so I can just get this done. (Also self-description.)
  • Felix: On our end we're creating tasks for every request that we're tracing as part of DataDog's distributed tracing, and we're using the logging feature to log the span ID (request ID). We've been able to build this ourselves but it would be nice to have this for all the tools. We're currently considering goroutines involved the way pprof does (inherited from parents). Having a way to slice-and-dice based on these user-level log messages seems useful.
  • Michael K: Is the issue with the current API that you can have tasks, but you can’t associate your own data with it? If you could associate arbitrary data with the task, would that be helpful?
  • (A task just has a single string name. You can then log with key/value pairs.)
  • Dominik (Chat): Yeah. Right now, the only custom data we have is logging strings, and the association is implicit, and really bound to a G, not a task. Arguably it could also be useful for regions.
  • Felix: I don’t necessarily need the information associated at task creation time, as long as tools agree on how to process this.
  • Rhys: More on annotations, when you make an HTTP request you don't know what the method is, the account ID, etc. They're things you might want to associate with a task but you don't know until later. There are also places where interesting things happen and it's not possible to annotate with tasks and regions. There's DNS requests that happen, for instance. I think it's powerful to be able to add instrumentation after the fact and be able to write that out and use that with all the existing tools.
  • Michael K: Writing out an updated trace should be straightforward. I'm not sure I follow the instrumentation after the fact.
  • Rhys: The idea is to look for events with a particular stack that look a particular way.
  • Michael K: This annotation seems somewhat fragile, and so I was thinking of std packages pushing trace events into the execution trace.
  • Rhys: It is fragile, but it only changes once every six months (at most) and it's easy to update. It also depends on having more than just one stack frame of context.
  • Michael K: Yeah, I abandoned that idea. It's not actually meaningfully faster.
  • Rhys: Also sometimes I want to go back in time and figure out how many resources (CPU time, memory allocations) it took to figure out that we even had an HTTP request to handle.
  • Dominik (Chat): We should put runtime/metrics into the trace.
  • Michael K: Agreed. Michael P recently suggested the same thing.
  • Michael K: Felix, you mentioned issues with worker pools. Couldn't it be instrumented with regions and contexts passed in? You don't necessarily need the library to instrument.
  • Felix: I consider all the various background goroutines as part of that, and that's basically impossible to instrument.
  • Nick (Chat): The main missing spot is non-blocking channel operations. If the worker pool is also instrumented, it doesn't matter. But otherwise you only have the correlation if there's some synchronization that points to the worker goroutine.
  • Dominik (Chat): Nick, but regions are annotated with the task from the context, isn't that sufficient?

@mknyszek
Copy link
Contributor Author

2023-08-31 Sync

Attendees: @mknyszek @rhysh @bboreham

  • Most people were out. Michael K and Rhys discussed a few things, but there isn't much to note down that wasn't mentioned later. Bryan had to leave early.

@mknyszek
Copy link
Contributor Author

2023-09-14 Sync

Attendees: @rhysh @prattmic @mknyszek @felixge @nsrip-dd

  • https://go.dev/cl/527875: trace: add experimental execution trace parsing API
  • https://go.dev/issue/62627: proposal: x/debug/trace: add package for parsing execution traces
  • https://go.dev/issue/57071: runtime/metrics: add a way to measure runtime-internal lock contention
    • Have code, including for Michael P’s suggestion of getting call stacks
    • How to test??? There’s no elective blocking on runtime-internal locks, as there is in sync.
  • OTel. Adding profiling as a supported data type

TODO(mknyszek): Go back and fill in the notes from the recording.

@mknyszek
Copy link
Contributor Author

2023-09-28 Sync

Cancelled due to Gophercon US.

@mknyszek
Copy link
Contributor Author

2023-10-12 Sync

Attendees: @mknyszek @prattmic @felixge @rhysh @nsrip-dd @bboreham

  • Michael K: Working on fixing bugs in 1.22 tracer. Based on how this goes we'll decide whether it will be enabled by default in 1.21. Should be possible to opt-out either way. The existing big test is very good at finding problems.
  • Felix: We’d be interested in contributing to the new parsing library so it can read < go1.22 traces.
  • Michael K: Parser source of truth will be in stdlib and copied to /x/ on a regular basis. Vendoring stuff into the stdlib the other way around is more painful.
  • Felix: Sharing go1.21 trace STW analysis results from Datadog services.
    • Interested in flight recorder. Tried to build a version of it with the current trace. While we're doing that we were looking to see if we could ship it. The concern was STW pauses. How bad are the pauses? For most services it's OK. We saw on average between 5 and 40 ms. 15k - 200k goroutines. Proportional to the number of goroutines. 300ns per goroutine approximately. Internal service using 2M goroutines, that one took 530 ms. The removal of pauses in the new tracer will matter. Will be nice to get some real-world experience with flight recording.
    • Michael K: Only one STW pause on starting tracing and it's "constant" time.
    • Rhys: Out of curiosity, is anything in the runtime proportional to the number of Ms?
    • Michael K: Not really. There are some things that are explicitly per-M, but the GC and scheduler only really cares about running Ms. The burden is usually on Ms without Ps to synchronize with the runtime.
    • Michael P: Are you by any chance increasing the max thread count?
    • Rhys: I haven't worked with programs that would exceed the limit since before the limit existed.
  • Rhys: https://go.dev/issue/63340 “I am also open to deleting /gc/pauses:seconds entirely.” (Is this the right group to discuss runtime/metrics deprecation posture?)
    • Thanks to Michael P for looking into it. I like the split you've come up with.
    • Michael K: This is not too hard of a case because we can offer an off-ramp for the metric. If the metric has a clear use-case, we don't want to remove it without offering an alternative. Deprecation and waiting a few releases is fine. Users have an opportunity to determine if the metric doesn't exist anymore.
    • Michael P: The documentation as it exists is not great for discovering metrics or whether they're deprecated. Also, should there be a GODEBUG flag for reenabled deprecated metrics?
    • Michael K: RE: documentation, I was thinking of listing all deprecated metrics in perpetuity. We shouldn't reintroduce a dead metric with a different meaning.
    • Michael P: One issue is that you have to go to the docs at all to identify that a metric is deprecated. There's no vet tool.
    • Rhys: Two ideas. The first idea is to let the metric be there but present no data. The second idea is to have a metric to query whether any deprecated metrics are being used.
    • Michael K: We could do both?
    • Rhys: GODEBUG is interesting because the app owner can fix their program without fixing the code of a library, which might be hard for them to do.
    • Michael P: I'm not thrilled about the zero value idea because there are quite a few metrics that stay zero for a long time. We could use a max uint value, but it's still awkward.
    • Felix: +1 to zero being a bad value to indicate deprecation. The metrics we released into Go 1.21 were originally broken and all zeroes, and so that would be indistinguishable from deprecation.
    • Rhys: Is the Go version in the go.mod file something that's available for this purpose? Basically, you can update your Go toolchain but metrics would retain their old names.
    • Michael P: IIUC if the Go version in the main module is less than the Go version of the toolchain, GODEBUG values will automatically be set. If we wanted to rely on this, just using GODEBUG seems to make sense.
    • Michael K: What do we do about metric values that don't make sense in a new runtime? Hm, maybe the zero value is fine in that case.
    • Michael P: I agree it's a theoretical concern, but can't think of a case where the zero value wouldn't work. If you're collecting a bunch of metrics and summing them to get a total, if you're not aware of a new metric then it might not include that. For example, some new memory class.
    • Michael K: For that specific case, I think it's OK. We have a catch-all memory metric (other) and a total. So we could just add new memory into the catch-all for compatibility.
    • Rhys: If you set the version of runtime metrics to be Go 1.21 for Go 1.22, do you not see new metrics?
    • Michael K: New metrics are fairly harmless, so it might be OK.
    • Rhys: Moving metrics around could be tricky in some cases. Let's discuss the details on the issue tracker.
    • Michael P: Do we emulate an old version of Go, or do we just do the bare minimum for compatibility? It can probably be argued both ways. GODEBUG traditionally hasn't been the former, but we are also trying to make it more comprehensive. Let's discuss further on the issue tracker. For the STW metrics proposal we can move forward on that by simply keeping both.
    • Bryan: post-meeting note: wanted to link prometheus/client_golang#994; it seems to me the Prometheus Go library could host some opinions about what are a good set of metrics to collect today.  Currently it implements "some set from 2017" and "everything". (We will follow-up on this in the next meeting.)
  • Michael K: Should Ps be part of the trace parsing API? They're kind of an internal implementation detail.
    • Felix: How likely is it that there's a future where Ps go away? What about rendering P-based timelines?
    • Michael K: I'm not certain of the value of rendering P-based timelines when you have per-M timelines.
    • Rhys: Currently I use Ps to guess what runqueue a G was in.
    • Michael P: As someone who works on the runtime, I'm biased toward knowing implementation details and having per-P details. I agree a per-P timeline isn't mostly directly useful, but it does compress the view. Minor but handy that more fits on screen.
    • Felix: +1. It impacts the visualization. The use-case of understanding the amount of runnable goroutines.
    • Michael K: There is an accepted proposal for the number of goroutines in each state, which may resolve that use-case.

@mknyszek
Copy link
Contributor Author

mknyszek commented Oct 26, 2023

2023-10-26 Sync

Attendees: @mknyszek @prattmic @felixge @rhysh

  • Michael K: New tracer in good shape, stability-wise.
  • Michael K: Demo. (Summary: live trace analysis of a service under load. Ask for the recording if you're curious!)
    • Felix: How much CPU overhead for trace parsing?
    • Michael K: Parser should be able to do 26 MiB/s. Demo is ingesting ~4.6 MiB/s.
    • Rhys: Maybe we could use something like this to build a top for Go, e.g. show CPU usage of goroutines in a list?
  • Michael K: New tracer Sweet benchmark results. (Raw data and new vs. old tracing overhead comparison.)
    • Michael K: New tracer is slightly slower than old tracer. Fixed some problems already, but needs more debugging. Trace size is ~12-20% bigger. Haven’t analyzed why yet.
    • Michael K: Previous tracer had an optimization for goroutines running on the same P again that they were running on before. Got rid of this because of complexity.
    • Michael K: Ps have to have sequence numbers too.
    • Michael K: If you have a region that starts before tracing starts, you’re going to lose that information today.
    • Michael K: If you slice the trace, it’s no worse than if you started the trace at this point.
    • Rhys K: Did you get rid of listing all goroutines at the start of the trace?
    • Michael K: Yes.
    • Felix: Can you publish these results somewhere?
    • Michael K: yes.
    • Felix: Trace size increase is a little worrying, but probably okay given the other improvements around partitioning.
    • Michael K: Will make the parser faster, but might not have a chance to work on smaller trace time.
    • Michael K: Current trace code has zero abstraction over trace events. Can’t take the new event API and plug it into the old one. It would be difficult to fit the new clean syscall handling into the new clean syscall handling. Probably need to fork cmd/trace. I could use help with with this if somebody is interested?
    • Rhys: I got a lot of code relying on the old trace events without a lot of abstraction.
    • Felix: I could make some time next week.
    • Michael K: Would be great, I want it to go on in. And ideally on by default.
    • Michel K: Tracer seems stable, haven’t gotten a crash in one week. One edge case is C calling back into Go. I haven’t had a chance to debug this yet. Doesn’t crash in sweet otherwise.
    • Felix: Will you have time to fix the cgo stuff?
    • Michael K: Yes.
  • Rhys: Seeking tradeoff advice on runtime-internal mutex profiler. Collecting the call stack once we know that we’re going to keep the record extends the critical section “probably too much”. In a production app with GMP=36 and MutexProfileFraction=10 and a high level of contention on a particular channel lock, CPU usage increased by 0.3x. So, capture stack earlier? Larger M, or larger stack requirement in lock2?
    • Rhys: Where do we store callstacks for contented locks when we’re not able to allocate memory? Made a field on M that can store one of them. If there is a second it choses between the two using the ns of contention value. I’d like it to be fair. I delay capturing stack into M until I know I’m going to save it. This means I have to delay until we capture the timestamp. We only know the duration once we get the lock (?). … It was using too much CPU. Maybe mutex profile fraction is too high and should be scaled back? It seems like I need to move capturing stack out of critical section, but where can I store it? Worried about putting 32 pointer size variables on the stack for every call to lock.
    • Michael P: Can you do it in unlock?
    • Rhys: Then I need somewhere to stash the duration. That would mean making every lock/channel bigger?
    • Rhys: Are locks always unlocked by same M, no hand-offs?
    • Michael P&M: Yes
    • (more discussion)
    • Rhys: I’ll stash a pointer to the lock on the M and capture the stack in unlock.
  • Felix: Deep stacks grinding Go runtime to a halt.
    • Michael K: Do you use MEMLIMIT? Will trigger GC more often if stack size grows.
    • Felix: Will share more info once we debug this.
  • Michael K: How should GOMAXPROCS be represented in the trace?
    • Currently, when Ps are named they stay alive forever in the parser. If GOMAXPROCS changes the number of Ps, then nothing happens. Those Ps just stay idle.
    • Question: should GOMAXPROCS represent some kind of Idle->NotExist transitions for these Ps? Or is this OK?
    • Felix: It's interesting to understand when Ps actually disappear, but I haven't had a use-case for it before. If it's easy to add, I don't see a reason not to add it.
    • Michael K: I don't think it's too complex, but I'd have to think through everything. It would be something like a ProcCreate/ProcDestroy event pair.
    • Felix: There are a few places where GOMAXPROCS emits events even when the number of procs doesn't change.
    • Michael K: It's emitted on every start-the-world. To be clear, it would only emit create/destroy events for the diff in GOMAXPROCS.
    • Rhys: How hard would it be to identify what runqueue a goroutine came from?
    • Michael P: We're talking about Ps being in _Pidle, but that also can't have anything on its runqueue.
    • Rhys: It can have a non-empty mcache. But runqueues are just the more interesting thing to me today. What becomes harder to track if we don't have a ProcDestroy event?
    • Michael P: What's the reason to not make procresize have not-exist transitions? Seems like the consensus is that it's maybe not critical, but it would be easier to make sense of the traces.
    • Michael K: Mostly just worried because we don't have much time until the freeze.
    • Rhys: The wire protocol includes the count of Ps. This seems like it's a question about what the parsing API should present to applications using the parsing API. If the parsing code can figure out which Ps are dead, then it can just expose that.
    • Michael K: It's a bit harder to get it in the parser today because the parser likes to have a 1:1 correspondence between wire format events and the events it emits. The new events would actually be easier.
    • Rhys: Uber has a library that changes GOMAXPROCS but not super often.
    • Michael K: Even a few times an hour, it's probably not going to be a big deal.
      Michael P: Given that the API change here is pretty small and it's something we can leave for now, we can do it in Go 1.23.

@mknyszek
Copy link
Contributor Author

Since it came up in the last sync and I don't want it to be missed, I updated the previous sync notes with a link to the raw performance data from Sweet: Raw data and new vs. old tracing overhead comparison.

@narqo
Copy link
Contributor

narqo commented Oct 27, 2023

Demo. (Summary: live trace analysis of a service under load. Ask for the recording if you're curious!)

@mknyszek if you could upload the recording to a public channel, and add the link to the file to the notes, this would be cool

@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 9, 2023

@narqo Hey, sorry for not getting back to you; I've been working on getting stuff (like the tracer :)) in before the code freeze. I'll get back to this soon. Thanks for your interest and sorry for the delay.

@mknyszek
Copy link
Contributor Author

mknyszek commented Nov 9, 2023

2023-11-09 Sync

  • Attendees: @rhysh @mknyszek @prattmic @nsrip-dd @bboreham @felixge
  • Michael K and Felix: Status update. New tracer will land for Go 1.22, gonna try to get cmd/trace ported in time. We have a freeze exception for that. Whether it's enabled by default will depend on cmd/trace being finished in time.
  • Bryan: post-meeting note from two meetings ago: prometheus/client_golang#994; it seems to me the Prometheus Go library could host some opinions about what are a good set of metrics to collect today.  Currently it implements "some set from 2017" and "everything".
    • Michael P: Were you thinking of documentation as to what's the most interesting or an API that would expose this?
    • Bryan: I'm thinking that Prometheus would just pick the good set. The Go runtime internally could document the good set.
    • Felix: Having a good set of metrics as a default for libraries that ingest metrics is valuable and we're interested in that too. I'm not sure if that should be curated by the runtime or if some provider can make that choice. I'm also involved in an internal project to determine what the most interesting metrics are that should be in a default dashboard that doesn't overwhelm people.
    • Rhys: The discussion on the Prometheus issue seems to be about histograms and bucket alignment, etc. Is this about interesting metrics or names of metrics?
    • Bryan: I didn't mean to focus on histogram metrics. IIRC the last time this was a problem is that it added 88 metrics per process. The set that was present in 2017 is probably not a great set today, but the set of everything will probably make people unhappy. My target is to remove some of the ones from 2017 and to add some new interesting ones.
    • Rhys: Could this be a blog post or some living document describing the most interesting metrics?
    • Bryan: That would help.
    • Michael K: I'm slightly worried about picking a 'good set' because it may hide the fact that more detailed metrics exist. Perhaps that's just a communication problem.
    • Felix: I can get away probably with exposing the metrics we want that are useful to us, but then there's still the dashboard question.
    • Rhys: My strategy is to first record across the cluster if there's something interesting to look at, and then separately outside the metrics system to have more data at a per-process level in a cheaper way. e.g. goroutine profiles, heap profiles, etc. They're available and they're cheap.
    • Michael P: Internally at Google it tends to work this way. The metrics system will only collect a small subset of metrics company-wide, but every process has a dump of everything that's always available to look at.
    • Felix: This came up in the OpenTelemtry profiling. Pixie would store things like profiles on the host, but it's difficult to run queries to learn things about aggregate behavior (e.g. bash scripts to aggregate the data).
    • Michael P: Pricing per sample. Histograms are charged per bucket so they're more expensive. But our histogram metrics are some of our most interesting metrics.
    • Michael K: I think I'm convinced we should do this.
    • Rhys: Looking forward to having an encoding/json library that can serialize histograms and doesn't choke up on +Inf and -Inf.
    • Felix: Why does it currently choke?
    • Rhys: Maybe I should open an issue with the details.
    • Rhys: We should add cross-linking between ReadMemStats and runtime/metrics.
    • Michael P: Along those lines, someone sent me some sample code that used ReadMemStats. It made me realize the docs for that don't even mention runtime/metrics.
    • Rhys: Should we add an API to runtime/metrics to write data into MemStats?
  • Michael P/Rhys: What to do about mutex profile semantics: https://go-review.googlesource.com/c/go/+/528657/6..8/src/runtime/mprof.go#b669
    • Michael P: Last time Rhys was discussing whether we should collect stack samples are lock or unlock. We decided unlock would be simpler to implement and would match mutex profiles better. That's true, but we've realized that even though both are collecting on unlock, there's a difference in semantics. The sample value in the current profile is how long a waiter has waited for the lock. In contrast the sample value in the profile is how long the lock holder waited. Rhys proposed in the CL to copy the sync.Mutex behavior so that we do record how long waiters are waiting. I like this. But we almost certainly can't do that for 1.22. Should we submit what we have or should we wait for 1.23?
    • Rhys: This would mean the way we use the key in lock futex would be unnecessary for the core function of locks. It's only necessary for the profiling. I don't know if this going to slow down locks on Linux.
    • Michael P: I don't think it should affect performance on the fast path, only if the lock is contended. If the lock is contended you'll be sleeping anyway. But we'll see.
    • Michael K: Inventory cost? If the correct semantics need a full rewrite then I'm not sure it's necessarily worth landing.
    • Rhys: That's not my only concern with the patch. An earlier version of the patch did show an increase in CPU time. I wrote a benchmark for the newest version and it does slow down the contended case from 6µs per op to 7µs per op. I'm wary about landing it now. Perhaps we can add a GOEXPERIMENT and put the awkward semantics behind that.
    • Michael P: I support a GOEXPERIMENT. We need tests to make sure the GOEXPERIMENT doesn't break.
    • Michael K: +1 to GOEXPERIMENT.
    • Rhys: Do GOEXPERIMENTS have an on-going cost? Some GOEXPERIMENTS end up as runtime/metrics and stick around for 4 releases.
    • Michael P: I think you're thinking of GODEBUG which is technically different. GODEBUG is set at run time, GOEXPERIMENT is set at build time. At this point, all GOEXPERIMENTS are completely unsupported. All GODEBUGs we add specifically for compatibility, so it's kinda in-between, because there are old GODEBUGs that are more in the unsupported case.
    • Felix: Small question re: the GOEXPERIMENT. Do you have to rebuild the toolchain?
    • Michael P: No. There are a few that may be specifically for the compiler that need to be set when building the compiler, but in general they can just be set at go build time. Though, that makes me think that perhaps this could be a GODEBUG. The main benefit of the GOEXPERIMENT is that it would be a build flag and you could be absolutely sure there's no performance impact. There would still be a flag check for GODEBUG.
    • Rhys: It seems like there are a bunch of "ifs" already in lock/unlock. I'm also worried that teams have set the mutex profile fraction that's appropriate for their use of sync.Mutex, but that's no longer appropriate for runtime locks.
    • Michael K: Is this actually a breaking change?
    • Rhys: CPU went up by +0.5x in the earlier version I mentioned before. They run with a mutex profile fraction of 10 and GOMAXPROCS=36.
    • Michael P: I think this is a real concern in the final version. This might be a good example of a compatibility GODEBUG added for the final version. My hope would be that we get things to a point where it's not a problem for anyone.
    • Rhys: That's my hope too, but we're less than 2 weeks from the freeze so I think we just didn't make it for Go 1.22.
    • Michael P: I think the conclusion we're coming to is that we can add a GODEBUG, but it would be off for Go 1.22. It wouldn't affect anyone by default. In Go 1.23 we'll fix the semantics and turn it on by default, but they'll be able to turn it off.
    • Michael K: GOEXPERIMENT or GODEBUG?
    • Rhys: Not worried about the extra branch; it's only in the contended case.
    • Michael P: If that's not acceptable now, it also won't be acceptable later.
  • Bryan: we found we were spending up to 5% of CPU in runtime.morestack ref runtime: new goroutines can spend excessive time in morestack #18138
    • The default behavior of gRPC is to fire up a new goroutine per unary call. Eventually someone noticed in a profile that it's spending 5% of its time in morestack. It has an option to use a worker pool. The end result of this realization is that everyone is going to build worker pools and we lose the idea of goroutines along the way. I'd like to avoid that. There is mention on the issue that PGO could solve the issue. It did take a very long time for someone to notice.
    • Rhys: I don't think that we have a profile that shows how big a goroutine's current stack needs are, or how big it's grown to. It shows up in the CPU profile but we don't have the numbers.
    • Michael P: I agree we could have better visibility. Since a recent release the starting stack size is adjusted based on average overall stack needs. It's fairly conservative. It would be interesting to see if it's firing at all.
    • Michael K: The current strategy is conservative because it's a global knob. If we made it more fine-grained we could be aggressive. Keith (who implemented the adaptive stack size stuff) has had some thoughts about it. But he went with the global knob to begin with because it wasn't very hard to implement.
    • Rhys: This is 5% of the CPU time that's spent, but it's on the critical path for processing requests. One thing on my mind is that I'd like to see that in execution traces. What do you think about stack growth events?
    • Michael K: Stack growth happens a lot so I'd be concerned about the volume of events. I'd be in favor of it if we sampled it.
    • Rhys: Sampling will make it harder to reason about. You can kind of already infer this from the CPU sample event, too.
    • Bryan: The number of people reporting the issue may not be related to the number of people experiencing the issue.
    • Michael P: That's a good point. Part of the confusion is that 18138 doesn't mention or include adaptive stack growth. Maybe we should have closed the issue and made a new one. Definitely still worth posting on the issue. RE: stack growth events, I'm less concerned about the volume of stack growth events. It happens the most when new events are created, so the number of stack growth events should be linearly related to the goroutine creation events.
    • Nick: You can kind of figure out how big a stack needed to be from a traceback, since you know the size of each function. Maybe that can help.
    • Michael K: This is actually fairly easy to do in gdb as it turns out.
    • Rhys: That's a great point. That requires the executable though, and there aren't many tools that we discuss here that require the executable. Could that information go into the symbol table that we ship with goroutine and CPU profiles?
    • Nick (in chat: I thought about this mainly in a PGO context, where you know the frame sizes at compile time. But I guess that wouldn't tell what they were at profile collection time, to your point.
    • Michael P: You can attach to samples but not frames in the pprof format, which is a bit annoying.
    • Felix: If you have more thoughts on what's annoying and missing pprof, there's a group in OTel discussing pprof and possibly merging the profiling formats, so I could take those ideas to that group.

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 7, 2023

2023-12-07 Sync

Attendees: @rhysh @mknyszek @prattmic @bboreham @nsrip-dd @felixge

  • Michael K: Everything's in for the new tracer. Lots of cleanup to do for next release, like moving old traces behind the new API. Working on a prototype of the flight recorder for x/exp.
  • Michael P: There's also x/debug that needs fixing, but I might not get to it before the holidays.
  • Felix: RE: making the new API compatible with old traces, I thought that might be too hard?
  • Michael K: The difficulty was with exporting the new traces behind the old (internal) API. I don't think it'll be hard to export the old traces behind the new API. It's just work.
  • Michael K: Russ Cox was concerned about the performance of traceAcquire, since there's a trace.gen atomic load to check if tracing is enabled. I didn't notice any problems in scheduler-heavy microbenchmarks on amd64 or arm64, but his point is that coroutine switches might be happening an order of magnitude more often, so every little bit (especially with respect to atomics) counts a lot more. He suggested adding a non-atomic bool that's set during the trace start STW and then unset when tracing stops later. This would replace the first check of trace.gen, so if tracing is disabled there's no non-atomic check. I'm pretty sure this works because we always double-check trace.gen, but I have to think about it more in the case of repeated trace starts and stops.
    • Michael P: The atomic load of trace.gen adds a happens-before edge, so there shouldn't be any flip-flopping. I think it sounds reasonable. I was going to suggest making trace stop and start get the attention of every M and set a bool on them, but that's more complicated.
  • Felix: I'm curious what y'alls thoughts are about coroutines and diagnostics.
    • Michael K: In tracing it's just handing off goroutines.
    • Michael P: +1, it's just goroutines. The CPU profile will treat them as separate goroutines.
    • Michael K: That reminds me; there was some discussion about tracebacks merging the stack traces from multiple goroutines.
    • Felix: Will there be a parent/child relationship between the coroutines?
    • Michael P: I would expect so.
    • Michael K: I'll note that this is only for iter.Pull.
    • Michael P: The proposal isn't yet fully accepted, so this can still be discussed on the proposal. This probably won't stop the proposal though, unless we decide that the current proposal makes it impossible to create sensible information for debugging. Delve might have some fun with this.
  • Topic: OS Metrics (Process RSS) for runtime/metrics (Felix Geisendörfer)
    • Backstory: playing a lot with runtime/metrics, especially the memory classes. I've been writing a blog post explaining the memory classes and how this relates to OS metrics.
    • One thing I noticed that's lacking is estimating how much non-Go memory a process is using. Our strategy is to take a best guess of the physical memory used by the Go runtime, and subtract that from the OS's RSS numbers… assuming there's no shenanigans like ballasts going on (which should be less common now). From that perspective, does it sound reasonable to add OS metrics? At some point you have to reconcile those numbers anyway. Lots of libraries collect these metrics differently making for a confusing landscape. Are there any good reasons not to make this a proposal?
    • Bryan: I like the suggestion, but I think it's going to be very hard to get meaning from the OS numbers. RSS includes memory the OS just hasn't kicked out yet. If you want to know if you're running out of memory you need the PSI metrics from cgroups v2.
    • Felix: Are you specifically thinking about the page cache or something else?
    • Bryan: If you write a bunch of stuff to disk it'll sit in your RSS. But specifically that's not the working set, which is a much tighter definition.
    • Rhys: Are you talking about a write syscall or something like a memory mapped file?
    • Bryan: I'm talking about a write syscall. I proposed a talk to KubeCon along the lines of "why does my process OOM." I proposed this to try and force myself to figure this out.
    • Michael P: I have a similar concern to Bryan's. There are so many metrics and it's hard to make sense of them. But Felix is also right: at some point somebody has to understand it. The metrics also differ on different OSes. Should we have different metric sets for different OSes?
    • Felix: I would also like to point out there might be some low-hanging fruit. One particular issue is looking at your memory usage and getting surprised that it doesn't match with the heap profile.
    • Michael K: At the bare minimum, I think we should communicate that the heap profile isn't exactly intended to map directly to actual memory usage.
    • Felix: Except that in practice I find the heap profile to be pretty accurate, within single-digit percent error.
    • Rhys: Haven't been closely involved with teams debugging OOMs, but my understanding is that the heap profile is confusing and insufficient for this use-case. Confusing because it's off by a factor of GOGC, and insufficient because it only includes reports of allocations once they've been around for a cycle or two. You don't get data on the most recently allocations that caused the OOM.
    • Michael P: Just double-checked the Linux kernel source that Bryan is correct. The RSS is effectively the number of pages mapped into your page table. So if you map a file, the faulted in pages are in your page table. If your memory is low then those pages get zapped. That's the core reason why it's usually confusing. Anonymous memory is pretty straightforward, but files get complicated.
    • Felix: Thanks for confirming that, but I think that just shows that it's important to give users more guidance. I think what most people want is what happened just before dying, but it's really hard because the OOM killer sends a SIGKILL. But at the same time, if only 20% caused the failure, what about the other 80%? Maybe that's more important, and we can get that information.
    • Bryan: I second that the post-mortem heap profile is difficult to get.
    • Michael K: The state of OOMs on Linux is fairly dismal, and we could spend a lot of time working around it, but I wonder if maybe we should put some effort into engaging with that community to change things. Regarding where to export OS metrics, I think Jonathan Amsterdam was starting discussions around standard library metrics and package for that. If that allows users to hook in their own metrics, we could provide an x/debug package that inserts platform-specific memory metrics. I feel more comfortable with that because platforms all provide different metrics and the x/ repositories have less strict compatibility guarantees, so we could evolve with the various platforms more naturally. The barrier to fixing up these metrics as things change also goes down. (Not said in the meeting, but in my mind this is like the difference between the "syscall" package, now frozen, and the "x/sys/unix" package which is allowed to continue to evolve relatively fluidly.) If we could spend some time and apply the expertise in this virtual room to do this work once, I think that would bring a big benefit to a lot of people without that much additional work. As a final note, I've also run into plenty of situations where, for example, some system provides a metric like "bytes of memory in use" with a similarly vague description, and figuring out exactly what that means and how its derived is endlessly frustrating.
    • Rhys: The frequent cause of OOMs are not giant processes on one machine, but usually tiny processes with small resource limits. So if it's OOMing, it's doing so on a regular basis. So what I want to do is stash a heap profile to disk every once in a while. But it only includes data from a couple GCs ago, so it'll always be stale. RE: RSS, I don't know if it's helpful, but I thought there was a syscall on Linux that you can ask for what part of memory is actually mapped and sets a bitmap.
    • Felix: /proc/* has this bitmap.
    • Bryan: For the case where you really only have anonymous memory and no swap (most processes), the Go runtime could figure out what the limit is and generate the post-mortem diagnostic close to the limit.
    • Michael P: I agree we could just do it ourselves and perhaps should. cgroups even have an event that sends a signal when close to the limit. My biggest annoyance with this limit is that all sorts of reclaimable memory gets counted against you, but it throws it away as soon as you get close to the limit. So you might spend a lot of time thrashing for memory but the kernel could drop some memory too. It's annoying that today these would have to be two completely different systems that don't know about each other.
    • Felix: Just to recap, it's probably not worth a proposal for runtime/metrics, but maybe it could be in an x/debug package that hooks into some more general metrics package, like what Jonathan Amsterdam is starting discussions around.
    • Michael K: That's my gut feeling, but I don't want to say for sure. Maybe I'm wrong about runtime/metrics being in the wrong place. In theory we could export to expvar, but that package is in a weird place. It's not a great API, yet there are people out there using it.
    • Michael P: I think we should invite Jonathan Amsterdam to this meeting. I'll talk to him.
  • Topic: Why is CPU profiling so different from other profile types (start/stop, given that there’s no streaming in the current implementation). Implications on emitting labels for other profile types, or including them in execution trace (especially heap profile sample events?). (Rhys)
    • Michael K: I think CPU profiles are different because they implicitly require a duration (first Start and then later Stop). I'm not really sure why it's not just always-on, but my assumption is it's higher overhead. Regarding emitting heap profile sample events into execution traces, I'm hoping to use the new tracing infrastructure to bring in certain optional diagnostics into traces, like allocation events (if sampled, that's approximately like the heap profile), page allocation events, or possibly even pointer-write events to identify precise lifetimes for heap objects. This information would be useless to most people, but super useful for future runtime/compiler work.
    • Felix: Thank you Rhys for pointing out the footgun of CPU profiles accumulating data over time without freeing it. That's a lesson I probably avoided learning the hard way. A workaround could be to 
    • Rhys: Michael K said CPU profiles are expensive. It's 0.6% or 0.7%, which is not huge. Maybe we could have something on all the time that samples even less frequently.
    • Michael P: I find myself using Linux perf for the initialization part, since that's what you're stuck with. I wonder if in the SIGPROF handler we could check if we're at an async preempt safepoint and use frame pointer unwinding at that point. So we could reduce the overhead quite a bit. Signal delivery probably costs a bit, but I imagine tracebacks are the biggest cost.
    • Michael P: The pprof.Profile API lets you make a custom profile. Has anyone actually done this? I don't see this much.
    • Rhys: I've done it to count concurrent HTTP requests to find places where we didn't close the response body. Also, custom profiles are only for active handles. It doesn't help with accumulating things like bytes.
    • Felix: Have we ever measured the overhead of the CPU profile? I'm just wondering how much we're going to get.
    • Rhys: When I look at execution trace data with timestamps and call stacks, there's a minimum time delta between execution trace events that include call stacks. My understanding from that is that a normal traceback takes ~30µs. I also don't know how much time it takes the kernel to do all its timer stuff. Linux perf would have the answers.
  • No meeting in two weeks since it would be on December 21st. See you next year!

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 4, 2024

2024-01-04 Sync

Attendees: @nsrip-dd @felixge @bboreham @prattmic @dominikh @rhysh @thepudds @mknyszek Daniel Schwartz-Narbonne

  • Bryan: It came up that GOMEMLIMIT should replace a heap ballast, but we've found in practice that it doesn't work and we've had to roll back to a heap ballast. [Thread.]
    • One example is a web proxy that handles a ton of requests, so if you do nothing you get a really small heap and a frequent GC. GOMEMLIMIT makes you pick a number, so you can't burst up a bit. In practice we didn't like having to pick that number, because a number too low causes issues.
    • Felix: What about GOGC?
    • Bryan: You need to have it at 1000 or 2000, so you get these really big spikes if you do nothing else. Some combination of the two ought to work, but we had a couple of attempts but had to roll it back in production. In practice we set a heap ballast of ~1 GiB.
    • Michael K: For some context, Bryan's situation involves overcommit of container memory, so simply setting GOMEMLIMIT and GOGC=off is problematic, since the runtime will just use all memory up to GOMEMLIMIT. We discussed both the idea of setting GOGC to some value and a much higher GOMEMLIMIT (closer to the true maximum, after overcommit), as well as other alternatives. The memory target proposal is a much closer fit, but it didn't get much traction. A similar issue also suggesting GOMEMLIMIT isn't good enough for all use cases was filed by someone. That was a little different since it involved a persistent memory leak, but the recommendations are about the same.
    • Rhys: Ballasts are trivial, why have them in the runtime?
    • Felix: Quick reply to ballast is that it messes with heap inuse profiles and metrics in ways that are confusing if you don't know there's a ballast.
    • Bryan: We believed that you could map memory without Go touching it, but we saw in some specific cases that the runtime actually does end up zeroing the memory, paging it all in. +1 to Felix's point.
    • Rhys: There's a blog post that points out that a whole bunch of things that need to line up for it to work out. So easy if you get the subtle parts.
    • thepudds: Is the memory target something that can at least be approximated in user space?
    • Michael K: There are two ways: one is a ballast, and one is with GOGC. The downside of the latter is that you're not going to get perfect responsiveness, but if you're just using it as an optimization, that's probably fine.
    • thepudds: Someone could implement it as a package.
    • Bryan: The mismatch between k8s CPU numbers and GOMAXPROCS is a pain, but yeah when the GC kicks up in response to memory pressure (as defined by GOMEMLIMIT) it can become a real problem.
    • thepudds: Is SetFinalizer responsive enough?
    • Michael K: Yeah, the Go compiler actually does this and it works out OK.
    • Rhys: In the compiler we know that nothing is going to attach a finalizer that just blocks forever, then if you know that's true in your application then it's responsive enough. But you can also make a finalizer that blocks all other finalizers from running, which is a hazard.
  • Nick/Daniel: Adding GC roots to the inuse heap profile to improve memory leak debugging (proof-of-concept CL)
    • Michael P: Seems interesting. We considered modifying the heap dump tooling to emit a valid core file, so the same tooling works on crashes and heap dumps. I think the core idea of figuring out why memory is held live is very valuable.
    • Bryan: Wanted to call out that I got viewcore working again for 1.20 and 1.21. Can you give more details as to what this looks like?
    • Nick: These are labels attached to profile samples that tell you something about the root keeping the object alive. The prototype just picks a random number of roots and it's non-deterministic. RE: core dumps, I don't know how hard it is to handle core dumps operationally. In our context, we know our users might not feel great about sending a full dump of memory to us.
    • Daniel: The nice thing about this is that it does tie into the existing heap profile, and not yet another new type of thing.
    • Bryan: About getting the data, CRIU is interesting but not something I've tried. It is part of k8s though. It's an alpha feature as of 1.26, haven't checked what it is today. You can ask for a checkpoint of a container. I assume that's analogous to a core file. → Link
    • Michael P: You raise a good point about the PII stuff. A heap profile doesn't actually contain the contents of memory. In theory if we're making these artificial core files, we could exclude non-pointer data. Also agreed that it's nice to have things in the same heap profile format.
    • Daniel: Dumping the heap graph could be really useful, but our prototype doesn't do that.
    • Bryan: I have to think for a reasonably sized program it takes too long to STW to dump the full heap graph. You probably want some kind of virtual memory copy-on-write thing. In .NET you could get a dump of the heap graph. Just trying to get back there personally.
    • Daniel: One of the nice things about this approach is that we're already doing most of the work in the GC. The overhead is about 7% slower when disabled, but 12% slower when enabled. With some more careful tuning it might be possible to get this to zero when off, lower and when enabled.
    • Bryan: Yeah, if it's possible to get it cheap enough maybe that's a good path.
    • Daniel: You could also get a partial graph and that could be useful.
    • Rhys: Can the chain be something simpler like the type? What if we represent the chains as a graph of types? Is that small enough that that could fit in a new kind of hash table? Is that enough information to be useful?
    • Michael P: I don't really know the answer to your question as to whether that's enough, but I'll note that the roots aren't really call stacks at all. The chain that it's reachable through is indeed more about types. It's unclear to me exactly what information we want.
    • Rhys: The mprof hash table today has a list of call stacks as a unique key. What if this were a different sort of profile where instead of a list of call stacks, it's a list of types?
    • thepudds: Almost anything here would be a big step forward. Even if you just had a number of bytes tied to some root, it can be an 80% solution.
    • Michael K: I worry that the investment to make this fast would be really high. 7% represents >1 years of investment in improving the GC hot paths. But I could be wrong, there could be a way to make this fast. (Disclaimer: I have not looked closely at the patch.)
    • Daniel: How hard would it be to have a GC visitor framework? Are there other use-cases? RE: double counting, I don't think that should be an issue. Doesn't the GC only mark objects once?
    • Michael K: GC is inherently racy for performance reasons. It’s possible (though rare) that objects get marked twice.
    • thepudds: The ideal solution means zero cost when it's off, but then you pay the cost when you dump something, but it is also more work to analyze a new kind of dump.
    • Nick: It makes a lot of sense to me that the bar for performance here is really high. But it also wasn't totally clear to me whether the cost would be worth it, which is one reason why I wanted to bring it up here.
    • Rhys: What is standing in the way of a process snapshotting its own memory? Aside from it being hard, is there type information that's hard to get, or something that gets in the way of a process analyzing its own memory? If we want zero overhead and not have the runtime commit to supporting new things, maybe we can have the runtime support a simple thing that lets programs do introspection into their own memory.
    • Michael K: This sounds similar to what Daniel was saying about the GC visitor framework. I vaguely remember checkmark mode doing something like this.
    • Bryan: The important information is mostly in internal packages, so it's a bit of the antithesis of what Go normally exports, but if it's a delve-like thing, then why not?
    • Michael K: On a final note, it would be great to fix this. It's definitely a gap that Go has compared to other languages.
  • Michael K: Flight recorder prototype landed. Please try it out!
  • Felix: Compiler regression in go1.22?
    • Trying out PGO in 1.22, and a benchmark decoding protobuf data that just a baseline build was 4% slower on darwin/arm64.
    • Michael K: I got a message on Slack about a possible regression. It was 5% faster on linux/amd64 and 5% slower on darwin/arm64. I've been meaning to come back to that. Nothing really stood out in the profile. I was also looking at the performance dashboard and saw this yesterday. Looks like the regression came from August 2023, which was mostly compiler CLs.
    • Michael P: You're probably already aware of this, but in case you're not: if you're benchmarking a tight loop, it might turn out to be a code alignment thing. If it's protobuf it's probably big enough.
    • Felix: I'll double-check. I'll also double check to see exactly what platforms the regression happens on.

@mknyszek
Copy link
Contributor Author

2024-01-18 Sync

Attendees: @prattmic @mknyszek @rhysh @bboreham @felixge @nsrip-dd @thepudds

  • Felix: How can callers of metrics.Read understand if a metric value is the same because it has not changed since the last read or because it has not been updated (e.g. by the GC). I.e. How can one avoid computing incorrect /cpu/classes/ deltas?
    • Felix: Considering using GC count metrics, but it doesn't feel right.
    • Michael K: We should just fix this. The issue is non-monotonicity of some metrics (that are computed as subtractions) because other metrics are computed only once per GC cycle.
    • Rhys: You can use idle time as an in-band proxy for whether anything changed.
    • Felix: How would I interpret an increase in the number of GC cycles but no change in stats?
    • Rhys: It could be a system with very broken timers.
    • Felix: Not urgent, just wanted to bring it to this group's attention. Wanted to avoid manipulating metrics from the runtime before they reach the user because they're harder to interpret.
    • Michael K: I mostly added the gc metrics to replace GCCPUFraction.
    • Felix: Yeah, one problem with this metric was also that it counted the GC fraction since the program started.
    • Michael P: There might be a race between GC count and the CPU metrics. If a GC STW happens in between while the metrics are being read, they might be inconsistent.
  • Felix: testing: writeProfiles is not called after panic #65129
    • Felix: We were debugging a test hang in CI, but we had a test that panicked, which made it hard to get a trace out.
    • Michael K: We could call traceAdvance on a fatal panic, and then the trace tool would have to understand that it's OK to have garbage data at the tail of the trace.
    • Michael P: The testing package already has a cleanup function that could stop acquiring the execution trace and other diagnostics. It's weird that it doesn't.
    • Felix: Do we see any good reasons for not doing this in the defers that are already in place? My only concern is what would happen if the panic was so bad that these functions would hang. But I think that problem already exists, so we wouldn't be creating a new problem.
    • Rhys: Complex proposal: we spoke months ago about how great it would be if we could extract an execution trace from a core dump. Is that a thing that we want to do? We could have the testing package be the thing that uses it. Flushing traces still has a problem because the data is being fed through an io.Writer, which has no strong way to fully flush, especially when plumbed across many goroutines.
    • Michael P: Would like to have this, but I'm more skeptical about the testing package using this. OSes can make it difficult to enable core files. On Linux there's no other way to request a core file except for changing the global core pattern.
    • Felix: Worried about being able to do this in CI environments. For example, it might be really hard to set this up in, for example, GitHub Actions. Just looking up how to get core dumps on macOS seems complicated.
    • Rhys: I wasn't necessarily thinking of a core file, but the parent process can attach in the same way a debugger might. Just like ptrace, it would stop the process before it fully crashed. I imagine this would be more portable.
    • Michael K: I think ptrace is going to be less portable probably; delve is only available on two platforms currently and that's part of the reason why. Also, if we export a core dump file via WriteHeapDump, then we can have a portable way to dump this data. If flight recording is enabled, we can even lift the trace data directly from the flight recorder buffers, which we know will be whole.
  • Michael P: Shout-out that one of our internal users was noticing runtime lock contention in their profiles and I was able to figure out what was going on. Thank you to Rhys for contributing runtime lock contention to the mutex profiles!
  • Michael K: We're gonna be shifting gears a little bit toward efficiency and performance analysis this year. Before that happens, I want to get some trace cleanup done, and specifically the last big chunk of work is getting the old trace parser behind the new API. If anyone has time to spare or thoughts on how we can make this easier and cleaner, please let me know! Dominik offered to pitch in a little bit.
    • Felix: If I can find some time I'll see if that's something I can help with. I'll also talk to Dominik.
  • Felix: Update: have not confirmed the compiler regression from 2 weeks ago. I'll ping someone on Slack if I find anything. I also found a fun PGO thing where the average goroutine stack size went up 30% after enabling PGO.
    • Michael P: You can look at the assembly to check stack frame sizes, but there's a compiler bisect tool that can be used to identify which decisions the compiler made actually resulted in the observed change. It was created for the loop variable scope change. It's called something like x/tools/cmd/bisect.
    • Felix: The challenge would be to write a reproducer, since this is a big application that's using a very large number of goroutines.
    • Michael P: If there's a common stack, then you could look at those frames specifically. I can send instructions on how to find out what's inlined due to PGO.
    • Felix: Sometimes I see a different stack trace from PGO binaries because of inlined frame attribution differences.
    • Michael P: I think what you're referring to is if a function calls a closure which gets inlined, its name gets more complicated and confusing. Otherwise it shouldn't really impact things.
    • Felix: It does make profiles harder to compare.
    • Michael P: Matthew Dempsky has been thinking about how to adjust these names.

@dominikh
Copy link
Member

dominikh commented Jan 18, 2024

If anyone has time to spare or thoughts on how we can make this easier and cleaner, please let me know!

I don't have ideas regarding overall design yet. However, since there'll be no way around reading old traces entirely into memory to sort them, I'd like to suggest using the trace parser of Gotraceui. It started out as a copy of go tool trace's parser, but has been optimized to use less memory, put no pressure on the GC, and it uses a smarter way of merging batches that lowers CPU time. It's how Gotraceui is able to load traces in a fraction of the time and memory needed by the old go tool trace.

It's at https://github.com/dominikh/gotraceui/tree/8fbc7cfaeb3cebed8890efbc030a62a7f1ff3f81/trace — since it started as a direct copy of Go's parser, the git log for the folder should show all relevant changes, and includes some benchmarks.

There are some changes that might have to be reverted if we want to support very old traces. My parser dropped support for Go 1.10 and older and it doesn't handle EvFutileWakeup, which haven't been a thing since Nov 2015.

The only change that might not be great is 2c5675443eebc969ee07cd4f2063c2d7476f7b9b which removes support for trace formats older than Go 1.11. That change could be reverted if need be, but traces that old wouldn't benefit from my improvements to batch merging.

Edit: I've nearly completed an implementation of the conversion.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 9, 2024

2024-02-01 Sync

Attendees: @prattmic @felixge @dominikh @nsrip-dd @rhysh @bboreham @mknyszek

  • Felix: RE: Parsing old traces in the new API. I see that Dominik has implemented it. We've tried it and it seems to work really well.
    • Michael K: On my TODO list to review, I will get to it soon. And thank you again to Dominik for working on that!
  • Felix: RE: Panics in tests cause broken execution traces.
    • Got reviews. Thanks!
  • Felix: RE: Tried PGO, found increase in stack memory usage.
    • Have a profiler tool that takes a goroutine profile and a binary that produces a plausible stack frame size profile.
    • Worth a proposal as a profile type?
    • Michael P: Could be another sample type in the goroutine profile, possibly. Your proposed way of estimating the stack size should be pretty close. It probably omits some wrapper functions that add a few bytes. If we do make this a real profile type, we can put in the real stack size. The runtime has to look at the goroutine anyway.
    • Rhys: How common is it for goroutines to idle with a small number of stack frames, but grow their stack pretty large transiently. That could hide stack space with this way of estimating the stack size. RE: expanding the goroutine profile, could we add extra little debug things to execution traces to get this data? For example, when stack growths happen.
    • Felix: Lots of good ideas. RE: the concern about goroutines that grow and then shrink again, are you worried about capturing the HWM of stack size?
    • Rhys: 
    • Felix: Wouldn't this be the same issue with a live heap profile?
    • Rhys: I think there's a bound on the live heap.
    • Michael P: I think there are two types of growing being discussed here. One is how much of the stack is being used, and the other is how much stack space is allocated. Rhys's case suggests that stacks shouldn't shrink too often, since we need a full GC to shrink. I think the case where the runtime directly records the allocated size handles that.
    • Michael K: I think Felix's approach would still work well if you take many samples. Eventually you'd catch the spikes. You're going into the problem already knowing your stack memory usage is high.
    • Felix: Another reason this would work well in practice is that you only care about it when you have a lot of goroutines, because that's when you have a lot of stack space.
  • Michael K: Flush generation on panic and accept broken trace tails
    • Michael P: In theory if you have a core dump, it shouldn't be too difficult to simulate what traceAdvance does, and finish up the trace. That way you can fully extract the trace out of the core dump.
    • Rhys: Does this only work well if the trace is being directed to an os.File, or does it work well in the general case? What happens when the io.Writer goroutine needs to execute?
    • Michael P: When a panic starts we freeze the world. RE: what Michael was talking about, we were thinking more about 
    • Michael K: I was thinking more about crashes as somewhat isolated, so that most of the application could still continue running. Maybe that's wrong.
    • Michael P: We could investigate that, but we'd have to not freeze the world.
  • Michael P: Just merged SetCrashOutput, it lets you specify an additional FD to write out panic string in addition to stderr.
    • You could imagine the flight recorder has a similar API that writes out to an os.File, which will be a lot more robust than a general io.Writer.
  • Michael K: Still need to look into the slight regressions we've seen due to recent diagnostics work. It's only been microbenchmarks, so we expect this to not actually impact real applications. Rolling out at a bigger scale hasn't shown any significant regressions. See https://perf.golang.org/dashboard/?benchmark=regressions&unit=&repository=go&branch=master&days=180&end=2024-02-01T16%3A37 for the benchmark list we have.
  • Rhys: Did a bit of investigation as well: runtime: ~5% regression in ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-8 sec/op at 450ecbe #64455.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 9, 2024

2024-02-15 Sync

Attendees: @nsrip-dd @rhysh @mknyszek @bboreham @felixge @prattmic @dashpole

  • Sysmon Preemption
    • Michael P: Felix’ patch came up when somebody reported a go1.22 regression that might have been caused by preemption. https://go.dev/issue/65647
    • Felix: Came up on my end this week as well. I found the preemptions useful to debug OS latency, seeing 500ms goroutine time slices in the execution trace gave me confidence that OS scheduling latency was high.
    • Schedviz tool: MP started prototype to combine kernel scheduler traces and Go execution traces, effectively add a CPU dimension to the execution trace.
  • Michael K: Default metrics set for OpenTelemetry
    • Following up on a previous discussion.
    • Michael K: Exporting all metrics might be too much. I wrote up a proposal for an API to give a default set of metrics that are widely useful. Michael P convinced me that this should maybe start as documentation rather than an API.
    • Michael K: API would take a go language version and give back a set of metrics. If you ask for go1.17, you get the defaults for that version.
      • func Default(toolchain string) ([]Description, error)
    • Bryan: There is a prometheus client golang issue (link?). Maybe we (grafana) could dig up information on which go metrics are collected by our customers in the cloud via prometheus. We can scan dashboards and queries for 30d and report on this in aggregate.
    • Michael K: That would be useful to have. I came up with a default set in my doc, but I’m not super happy with it. I could share the doc, but feels like it might get deprecated right away.
    • Michael K: The default set would have strict compatibility guarantees. Anything in the set would be supported effectively forever. You could unconditionally rely on those metrics to still exist (but they might become 0).
    • Rhys: Questions about RSS, VSS, why heap profile is ½ of RSS come up often. Maybe the GC guide could address some of this. I don’t understand why the API to get good metrics needs to be in the core rather than a 3rd party thing.
    • Michael K: Yeah, could also go into /x/ as a middle ground.
    • Rhys: I don’t know what it means for a metric to be a default. Different apps have different needs. This is also coming up b/c metrics are expensive. They are small relative to heap profiles which many folks collect.
    • Michael K: I had the same thoughts – why is this a problem at all.
    • Bryan: We charge $8 / month for a thousand time series. I think runtime/metrics has ~100 metrics, that would be 80 cents per month. This is multiplied by the number of processes you run.
    • Felix: Usually what's expensive is metrics with tags with high cardinality. runtime/metrics are typically exceptions to this because they're process-global. I think it's cheap-ish, but different people have different opinions.
    • Rhys: I agree that putting them as time series in practice gets expensive, but there's a lot of data we can get out of the runtime that we wouldn't turn into a time series. Like heap profiles and execution traces. In practice I'd like to have all the data sitting around.
    • David: Our motivation is primarily around stability. Unless it's very very expensive, and it sounds like it's not, that's probably not much of a concern for us. We would like to be able to say "here's 1.0, we won't break you."
    • Bryan: In Prometheus there will be 3 options: the 2017 set, all of them, recommended set. Aren't heap profiles just as large? (e.g. you have 100s of stacks and frames, if you have a bunch of heap profiles, isn't that the same)
    • Rhys: I'm not sure we're looking at this the same way. For stability, could we have a string translation API?
    • David: That seems OK. The library would hide this translation from users. The harder question is: is there a set of metrics that we think will continue to be useful in Go 1.30?
    • Rhys: I would expect most runtime/metrics to not change. Recently, the STW metric changed.
    • Michael P: I see where you're coming from David. I think to Rhys' point that most metrics don't change. But there are also details about, for example, memory classes. New classes can get added, and some might not make sense as the GC changes.
    • Felix: I think the argument of stability resonates with me. I recently built a dashboard that breaks down memory and it could break. I also think I got what Bryan was saying earlier. I think you can see the heap profiles similarly to metrics, but the blob profile has different performance characteristics. Also, they tend to get taken less frequently.
    • David: It's also good to have some space for metrics to be introduced that are in development.
    • Michael K: Shall we move forward with documenting metrics? Would this work for you David?
    • David: What this would look like is we would have a documented set that depends on Go's documented set. This would be fine to unblock us. Going forward, it would be good to have a way to make this more maintainable and detect breakage from the Go side.
    • Felix: Last I looked it seemed like the Go metrics support in OTel was incomplete. What's the plan?
    • David: We plan to throw away what we have and start from scratch. We have two approaches in progress: one would be to have a stable set, and the other would be to have some configuration so that users that want different metrics or more metrics don't have to wait for us.
    • Felix: What's the best way to follow along this stream of work?
    • David: There is an issue on the semantic conventions repo: Semantic conventions for go runtime metrics open-telemetry/semantic-conventions#535
    • David: There are also WIP discussions/proposals in the Go sig: runtime instrumentation replacement proposal
  • OpenTelemetry OTEP 239 (pprofextended format proposed as new profiling signal format) - Felix Geisendörfer
    • I wanted to call some attention to this and it's in the last call stage. It will get merged soon.
    • At this point the spec could still be changed, but the big picture will probably stay the same.
    • pprofextended: open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2
    • Michael K: I'll bring it up to the team.
    • Michael P: Alexey has been following this, which gives me confidence that there will be some alignment.
    • Felix: Yeah, he basically co-designed this with Dmitri from Grafana. But AFAIK he was not the one that could make the call on what could ultimately happen with the pprof project. It's more about whether OTel determines pprof's future.
    • Michael P: Yeah, it seems like the question is how they're going to be kept in sync (or not).
    • Felix: Divergence would be highly undesirable because great pains were taken to make pprofextended to maintain compatibility.
  • Felix: Deterministic Testing for Go
    • Felix: A colleague approached me with questions on how to make the Go runtime more deterministic for testing.
    • Michael K: We have considered adding scheduler hooks or more deterministic scheduling.
    • Michael P: runtime: scheduler testing controls #54475
    • Rhys: There are some places where the Go project intentionally randomizes.
    • Felix: I think my colleague was actually trying to randomize but with a fixed seed.
    • Michael K: Race mode already randomizes a whole bunch of things. You could imagine setting the seed for that.
    • Felix: A seed that is used for everything would be interesting.
    • Michael P: Some of the randomness isn't explicit, like two threads racing for the same work. It needs explicit support to become deterministic.
    • Rhys: You could imagine a record-replay type system.
    • Michael K: It occurs to me that an execution trace might have the information you need for that already. It's not designed for it, but it might work.
    • Michael P: That's interesting, it might work. There will still be some weird things you need to do, for example to make an async preemption land at exactly some instruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests