Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: diagnostics improvements tracking issue #57175

Open
15 tasks
mknyszek opened this issue Dec 8, 2022 · 12 comments
Open
15 tasks

runtime: diagnostics improvements tracking issue #57175

mknyszek opened this issue Dec 8, 2022 · 12 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Dec 8, 2022

As the Go user base grows, more and more Go developers are seeking to understand the performance of their programs and reduce resource costs. However, they are locked into the relatively limited diagnostic tools we provide today. Some teams build their own tools, but right now that requires a large investment. This issue extends to the Go team as well, where we often put significant effort into ad-hoc performance tooling to analyze the performance of Go itself.

This issue is a tracking issue for improving the state of Go runtime diagnostics and its tooling, focusing primarily on runtime/trace traces and heap analysis tooling.

To do this work, we the Go team are collaborating with @felixge and @nsrip-dd and with input from others in the Go community. We currently have a virtual sync every 2 weeks (starting 2022-12-07), Thursdays at 11 AM NYC time. Please ping me at mknyszek -- at -- golang.org for an invite if you're interested in attending. This issue will be updated regularly with meeting notes from those meetings.

Below is what we currently plan to work on and explore, organized by broader effort and roughly prioritized. Note that this almost certainly will change as work progresses and more may be added. (TODO(mknyszek): Expand on this.)

Runtime tracing

Tracing usability

Tracing performance

Heap analysis (see #57447)

  • Update viewcore's internal core file libraries (gocore and core) to work with Go at tip.
  • Ensure that gocore and core are well-tested, and tested at tip.
  • Make gocore and core externally-visible APIs, allowing Go developers to build on top of it.
  • Write go.dev documentation on how to do heap analysis with viewcore.

CC @aclements @prattmic @felixge @nsrip-dd @rhysh @dominikh

@mknyszek mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 8, 2022
@mknyszek mknyszek added this to the Go1.21 milestone Dec 8, 2022
@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 8, 2022

2022-12-07 Sync

Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @rhysh

Notes:

  • pprof labels in execution traces
    • Michael K: I need to follow up on the CL and issue.
    • Michael P: Have you considered runtime/trace regions?
    • Nick: Yes, but it doesn't quite hit our use-cases.
    • Rhys: Could use logs instead. Is the point to follow goroutines?
    • Felix: Yeah.
    • Rhys: Started parsing the trace format for some internal tooling.
    • Felix: Prefer labels because inheritance. Mostly like it for profile tools on top, but maybe doesn't matter for tracing tools. The parent goroutine gets recorded so things like request IDs can be tracked in post-processing.
      • Also useful for attributing CPU time.
    • Michael P: Tasks are inherited, but inheritance is processed downstream.
      • Would be nice to have pprof labels just so you don't have to think about which one to use.
    • Michael K: Useful even just to bridge pprof and runtime/trace.
    • Austin: Agreed. We can one day get to the point of deprecating the old APIs as well.
    • Rhys: RE: Attributing CPU time, can see at better than 10 ms of granularity already (even if Ps aren't CPU time, it's still time the rest of the app couldn't run).
    • Michael P: There's an issue about measure CPU time on-line. (TODO: Find it.)
  • Trace parsing API
    • Michael K: How important is this? Priority?
    • Felix: Important for the community, but we can't make use of it in our case. Will the trace format change in between releases?
    • Michael K: I think we can always guarantee a trace format for a release.
    • Michael P: How high level should this API be?
      • cmd/trace has two levels:
        • Low-level that understands format and events.
        • Higher-level that understands relationships between goroutines, etc.
    • Michael K: Page trace splits this into "parser" and "simulator." The latter is more stateful.
    • Felix: Intuitive feeling toward lower level API.
    • Rhys: +1 to low level.
    • Austin: Scalability of processing traces.
      • Currently not in a good state in low or high level format (currently requires the whole trace).
      • Can fix trace wire format for low-level parsing scalability issues, but it's much less clear how to do this for the high-level format.
    • Austin: Flight recorder idea.
      • Interacts interestingly with simulation. Current trace snapshots everything.
      • Solved this in debuglog; reads its own tail and keeps local state updated.
      • Complicated trade-offs in this space.
    • Felix: We use a lot of JFR, one thing that's nice is it's broken down into self-contained chunks.
  • Michael K sent out a very half-baked trace format revamp. (Thanks for the comments! Far from ready to share more widely.)
    • The next step is to measure the actual current overhead.
      • Maybe add a mode to Sweet?
      • Rhys: Have been collecting CPU profiles and execution traces. 20% of CPU time during execution trace is for execution trace itself. 95% of overhead is collecting stack traces.
        • Collect 1 second every 5000 seconds and no one complains. People do complain about goroutine profiles every 2 minutes.
      • Michael K: Shooting for KUTrace overhead, so making stack traces optional/faster is just step 1.
      • Felix: Trace effect on tail latency.
        • Rhys: Traces are my view of tail latency.
      • Felix: Benchmark for pathological cases and worst case.
      • Austin: Linked to trace overhead issue, Dmitry proposed switching to frame pointer unwinding.
      • Felix: At some point implemented frame pointer unwinding in userland and it was 50x faster (link).
      • Rhys: Not sure what kind of tool you could build without stack traces in an app that doesn't set pprof labels, tasks, regions, trace logs, etc.
      • Michael K: Integration with structured logging?
      • Michael P: It does add yet another way to add things to runtime/trace.
      • Rhys: The standard library (e.g. database/sql) doesn't currently use runtime/trace at all, maybe it should.
      • Michael K: This connects to deciding what goes into a trace. I think this is a very good idea.
      • Felix: +1. Java world and JFR does this.
  • Deciding what goes into a trace
    • Disabling stack tracing / reduce stack trace depth
    • Filtering by pprof labels
    • Specific event classes
    • Standard library events
    • Rhys: I've made this decision for my organization. Expected that you do profiling for a long running service. No opportunity for app owners to express opinions. People who complained forked the package, turned it off, and now coming back. I kind of want everything.
    • Felix: I would love to be in a place where we can do that, but we get pushback from users when the overhead is too high.
    • Rhys: The question is how close we get within 1% overhead. My choice was to get everything, but less often.
    • Felix: Desire to get more of everything is in conflict with adding more kinds of things in the trace.
    • Michael P: Agreed. Ideally we have tracing that's sufficiently fast that we have on all the time, but if libraries are allowed to add new traces then it could be a problem. It would be nice to turn that off without forking a library.
  • Before next sync:
    • Michael K: Unblock pprof labels patch and benchmarking trace overhead.
    • Felix: I can contribute a worst case benchmark.
      • Currently blocked on pprof labels in trace.
    • Felix: Started to work on gentraceback. Might work on it over the holidays.
      • Trying for bug-for-bug compatibility.
    • Michael P: Austin has been working on this too.

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 8, 2022
@felixge
Copy link
Contributor

felixge commented Dec 9, 2022

I'll miss the Dec 22nd meetup because I'm traveling for the holidays. That being said, if I find time I might also look into #57159 . Getting a proof of concept for Perfetto UI integration (ideally using their protocol buffer format) is probably more important than the gentraceback refactoring at this point. I just tried to work with a 300 MB (15s of prod activity) yesterday, and it was a real eye opener to the way the current UI struggles.

@tbg
Copy link

tbg commented Dec 9, 2022

I don't know if it's relevant (probably nothing new for the folks on this thread), but I had similar problems with the go tool trace viewer where it would freeze on me all the time, esp. in the per-goroutine view (/trace?goid=N). I figured out you can download perfetto-compatible JSON data from /jsontrace?goid=N. (/jsontrace gives the default view). This can then be uploaded to ui.perfetto.dev. This doesn't show all the information in the trace so it's not as great, but I was glad to have something that worked.

@thediveo
Copy link

thediveo commented Dec 9, 2022

would the pprof labels also show up in goroutine traces?

@qmuntal
Copy link
Contributor

qmuntal commented Dec 9, 2022

I'm working on a PoC that improves native stack unwinding on Windows by adding additional information to the PE file. This will help debugging with WinDbg and profiling with Windows Performance Analyzer. Would this work fit into the effort tracked by this issue?

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 9, 2022

@thediveo I think that might be a good question for #56295, or you could file another issue. Off the top of my head, that doesn't sound like it would be too difficult to do.

@qmuntal Oh neat! That's awesome. I think it's a little tangential to the work we're proposing here, unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback). Then again, if one of the goals is better integration with the Windows Performance Analyzer that's certainly more in the same spirit. Do you have an issue for tracking that already?

@qmuntal
Copy link
Contributor

qmuntal commented Dec 9, 2022

Do you have an issue for tracking that already?

I still have to prepare the proposal, I plan to submit it next week.

unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback).

Not for now, but once I finish this I want to investigate how feasible is too unwind native code and merge it with the Go unwinding, in case the exception happens in a non-Go module.

@qmuntal
Copy link
Contributor

qmuntal commented Dec 14, 2022

Do you have an issue for tracking that already?

I do now #57302 😄

@gopherbot
Copy link

Change https://go.dev/cl/459095 mentions this issue: sweet: add support for execution traces and measuring trace overhead

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 22, 2022

2022-12-22 Sync

Attendees: @mknyszek @aclements @prattmic @bboreham @rhysh @dominikh

  • Organizational stuff
    • OK to record meetings?
    • Meeting recorded with transcript this week (please ask if you would like to see it).
  • Trace overhead benchmarks
name                                old time/op            new time/op            delta
BiogoIgor                                      17.7s ± 3%             17.5s ± 4%     ~     (p=0.190 n=10+10)
BiogoKrishna                                   15.1s ± 4%             15.1s ± 4%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             5.78s ± 7%             5.76s ±11%     ~     (p=0.853 n=10+10)
BleveQuery                                     2.37s ± 0%             2.37s ± 0%   -0.26%  (p=0.016 n=8+10)
FoglemanFauxGLRenderRotateBoat                 16.9s ± 9%             16.9s ± 7%     ~     (p=0.796 n=10+10)
FoglemanPathTraceRenderGopherIter1             36.7s ± 1%             44.4s ± 2%  +21.01%  (p=0.000 n=10+10)
GoBuildKubelet                                 47.0s ± 2%             48.8s ± 3%   +3.72%  (p=0.000 n=10+10)
GoBuildKubeletLink                             8.89s ± 2%             8.88s ± 4%     ~     (p=0.720 n=10+9)
GoBuildIstioctl                                45.9s ± 1%             47.8s ± 2%   +4.09%  (p=0.000 n=10+10)
GoBuildIstioctlLink                            9.07s ± 2%             8.99s ± 2%     ~     (p=0.095 n=10+9)
GoBuildFrontend                                15.7s ± 4%             16.1s ± 2%   +2.45%  (p=0.043 n=10+10)
GoBuildFrontendLink                            1.38s ± 2%             1.37s ± 3%     ~     (p=0.529 n=10+10)
GopherLuaKNucleotide                           27.9s ± 0%             27.9s ± 1%     ~     (p=0.853 n=10+10)
MarkdownRenderXHTML                            256ms ± 2%             256ms ± 2%     ~     (p=1.000 n=9+9)
Tile38WithinCircle100kmRequest                 618µs ± 7%             657µs ±10%   +6.30%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             722µs ± 6%             773µs ± 4%   +6.96%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  508µs ± 3%             532µs ± 3%   +4.73%  (p=0.000 n=10+10)

name                                old average-RSS-bytes  new average-RSS-bytes  delta
BiogoIgor                                     68.8MB ± 2%            71.8MB ± 4%   +4.40%  (p=0.000 n=10+10)
BiogoKrishna                                  4.42GB ± 0%            4.42GB ± 0%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             194MB ± 2%             198MB ± 3%   +1.91%  (p=0.008 n=9+10)
BleveQuery                                     536MB ± 0%             537MB ± 1%     ~     (p=0.190 n=10+10)
FoglemanFauxGLRenderRotateBoat                 444MB ± 1%             446MB ± 0%   +0.41%  (p=0.035 n=10+9)
FoglemanPathTraceRenderGopherIter1             132MB ± 1%             142MB ± 4%   +7.61%  (p=0.000 n=10+10)
GoBuildKubelet                                1.75GB ± 1%            1.85GB ± 1%   +5.51%  (p=0.000 n=10+10)
GoBuildIstioctl                               1.35GB ± 1%            1.42GB ± 1%   +5.49%  (p=0.000 n=10+9)
GoBuildFrontend                                511MB ± 2%             543MB ± 1%   +6.31%  (p=0.000 n=10+9)
GopherLuaKNucleotide                          37.0MB ± 1%            40.4MB ± 2%   +9.24%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           21.8MB ± 3%            24.0MB ± 3%  +10.14%  (p=0.000 n=9+8)
Tile38WithinCircle100kmRequest                5.40GB ± 1%            5.38GB ± 1%     ~     (p=0.315 n=10+10)
Tile38IntersectsCircle100kmRequest            5.72GB ± 1%            5.71GB ± 1%     ~     (p=0.971 n=10+10)
Tile38KNearestLimit100Request                 7.26GB ± 0%            7.25GB ± 0%     ~     (p=0.739 n=10+10)

name                                old peak-RSS-bytes     new peak-RSS-bytes     delta
BiogoIgor                                     95.9MB ± 4%            98.5MB ± 3%   +2.70%  (p=0.030 n=10+10)
BiogoKrishna                                  4.49GB ± 0%            4.49GB ± 0%     ~     (p=0.356 n=9+10)
BleveIndexBatch100                             282MB ± 3%             284MB ± 4%     ~     (p=0.436 n=10+10)
BleveQuery                                     537MB ± 0%             538MB ± 1%     ~     (p=0.579 n=10+10)
FoglemanFauxGLRenderRotateBoat                 485MB ± 1%             483MB ± 0%     ~     (p=0.388 n=10+9)
FoglemanPathTraceRenderGopherIter1             180MB ± 2%             193MB ± 3%   +7.19%  (p=0.000 n=10+10)
GopherLuaKNucleotide                          39.8MB ± 3%            46.0MB ±20%  +15.56%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           22.1MB ± 3%            25.5MB ± 7%  +15.45%  (p=0.000 n=9+10)
Tile38WithinCircle100kmRequest                5.70GB ± 1%            5.68GB ± 1%   -0.45%  (p=0.023 n=10+10)
Tile38IntersectsCircle100kmRequest            5.93GB ± 1%            5.91GB ± 2%     ~     (p=0.631 n=10+10)
Tile38KNearestLimit100Request                 7.47GB ± 1%            7.46GB ± 0%     ~     (p=0.579 n=10+10)

name                                old peak-VM-bytes      new peak-VM-bytes      delta
BiogoIgor                                      802MB ± 0%             803MB ± 0%   +0.11%  (p=0.000 n=10+10)
BiogoKrishna                                  5.24GB ± 0%            5.24GB ± 0%   +0.01%  (p=0.001 n=10+10)
BleveIndexBatch100                            1.79GB ± 0%            1.79GB ± 0%   +0.05%  (p=0.000 n=8+8)
BleveQuery                                    3.53GB ± 1%            3.53GB ± 1%     ~     (p=0.237 n=10+10)
FoglemanFauxGLRenderRotateBoat                1.21GB ± 0%            1.16GB ± 4%     ~     (p=0.163 n=8+10)
FoglemanPathTraceRenderGopherIter1             875MB ± 0%             884MB ± 0%   +1.02%  (p=0.000 n=10+10)
GopherLuaKNucleotide                           733MB ± 0%             734MB ± 0%   +0.11%  (p=0.000 n=9+10)
MarkdownRenderXHTML                            733MB ± 0%             734MB ± 0%   +0.10%  (p=0.000 n=10+9)
Tile38WithinCircle100kmRequest                6.42GB ± 0%            6.39GB ± 1%     ~     (p=0.086 n=8+10)
Tile38IntersectsCircle100kmRequest            6.62GB ± 1%            6.61GB ± 2%     ~     (p=0.927 n=10+10)
Tile38KNearestLimit100Request                 8.16GB ± 1%            8.18GB ± 0%     ~     (p=0.649 n=10+8)

name                                old p50-latency-ns     new p50-latency-ns     delta
Tile38WithinCircle100kmRequest                  144k ± 3%              159k ± 3%  +10.56%  (p=0.000 n=9+9)
Tile38IntersectsCircle100kmRequest              215k ± 1%              232k ± 2%   +7.91%  (p=0.000 n=9+10)
Tile38KNearestLimit100Request                   347k ± 2%              373k ± 1%   +7.21%  (p=0.000 n=10+10)

name                                old p90-latency-ns     new p90-latency-ns     delta
Tile38WithinCircle100kmRequest                  908k ± 6%              956k ± 9%   +5.22%  (p=0.043 n=10+10)
Tile38IntersectsCircle100kmRequest             1.07M ± 4%             1.11M ± 5%   +4.33%  (p=0.001 n=10+10)
Tile38KNearestLimit100Request                  1.03M ± 3%             1.05M ± 4%   +2.64%  (p=0.011 n=10+10)

name                                old p99-latency-ns     new p99-latency-ns     delta
Tile38WithinCircle100kmRequest                 7.55M ± 9%             7.93M ±13%     ~     (p=0.089 n=10+10)
Tile38IntersectsCircle100kmRequest             7.81M ± 8%             8.39M ± 2%   +7.36%  (p=0.000 n=10+8)
Tile38KNearestLimit100Request                  2.03M ± 4%             2.08M ± 5%   +2.52%  (p=0.019 n=10+10)

name                                old ops/s              new ops/s              delta
Tile38WithinCircle100kmRequest                 9.73k ± 7%             9.16k ±11%   -5.83%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             8.31k ± 6%             7.77k ± 4%   -6.55%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  11.8k ± 3%             11.3k ± 3%   -4.51%  (p=0.000 n=10+10)
  • Introduction: Bryan Boreham, Grafana Labs
    • Questions within the team about whether useful information has been derived from Go execution traces.
    • Phlare: continuous profiling. Interested in linking together various signals (distributed tracing, profiling)
    • Michael K: Interesting data point about usability.
    • Michael P: Hard to link application behavior to trace.
    • Bryan: Example: channels. Still don't really know where to find that data.
    • Dominik: One of the reasons I started on gotraceui was to surface more information and do more automatic inference and analysis of the data.
    • Rhys: Execution trace technique: get data out of them to find the interesting traces. Try to extract features that would be interesting up-front.
      • Starts with internal trace parser. Have code to find start and end of HTTP requests, DNS lookups, etc.
      • Tooling on the way to get open sourced.
  • Heap analysis plan (proposal: x/debug: make the core and gocore packages public #57447)
    • Austin: Additional context is we're confident in the API we're planning to export, as opposed to tracing which we have nothing for yet.
  • https://go.dev/issue/57307 proposal: cmd/trace: visualize time taken by syscall
    • Austin: Does Perfetto do better with instantaneous events?
      • Michael P: Yes, there's a 20px wide arrow but we have so many.
      • Rhys: Hold shift, draw a box. If you aim well, you get what you want.
    • Rhys: Why is there only one timestamp on some events?
      • Austin: We can add another timestamp.
      • Michael P: Syscall fast path does a lot less.
  • pprof labels in traces
    • Michael K: I think I've unblocked Nick. Michael and I are reviewing.
  • runtime.gentraceback cleanup
    • Austin: Back and forth on the issue about making it an iterator, sent out CLs, not tested yet.
  • Next meeting: Jan 5th, Michael P and Michael K won't be here, so Austin will run it.
  • Action items:
    • We're slowing down for the holidays, so no strong expectations
    • Michael K:
      • Try to land execution trace benchmarking.
      • Might look into heap analysis stuff.
      • After break, might want to start working on trace format more seriously.
  • Happy holidays!

@mknyszek
Copy link
Contributor Author

2022-01-05 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham vnedkov @dashpole

  • Organizational stuff
  • Benchmarks: Can we add a goroutine ping pong example? (Felix)
    • Tracer benchmarks all show relatively low overhead. Can we add a benchmark that demonstrates the worst case?
    • Austin: Sweet probably isn’t the right place because that’s application-level. Maybe add to Bent?
    • Felix: Next step on these benchmarks? Land MK’s trace benchmark support?
    • Austin: It’s certainly fine to land. We don’t have a good way to integrate these “extra dimensions” into our automated benchmarking.
    • AI(austin): Bring up monitoring extra benchmarking dimensions.
    • Austin: “Unit benchmarks” would be the perfect place for a ping pong benchmark (we already have one in the main repo), but we never quite got to integrating these into automated monitoring.
  • Are only GC STW recorded? Would it make sense to record other STW events (read metrics, goroutine profile, heap dump)? (Felix)
    • Rhys: You get ProcStop events
    • Austin: Yeah, you’re right that we trace high level GC STW events.
    • Rhys: Currently the GC traces the “best case” STW, which can be really misleading.
    • Austin: We could definitely have a “stopping the world” and a “world stopped”. Maybe don’t need that for start.
    • Felix: That would be great. We’re investigating rare long STWs right now.
    • Rhys: Starting the world can take a while. Problems with heap lock contention. I would love to have more visibility into the runtime locks.
    • Austin: Runtime locks are a bit of a mess. I also wonder if they should be “scalable”.
    • Rhys: I’d love to discuss that. C&R office hours?
    • Austin: Perfect.
    • Conclusion: Let’s add events for all STWs and also separate “stopping” from “stopped”.
  • Updates on Perfetto UI (Felix and Nick)
    • Add to UI CL: https://go.dev/cl/457716 
    • Felix: The JSON currently produced by the trace tool is basically compatible with Perfetto. Doesn’t let us open really large traces without splitting, which was one of the hopes. And it takes a while to load. I was able to use the command line tools to load a 280MB trace into a 9.8GB JSON trace and load that in Perfetto, but it took 20 minutes. Nick has been working on outputting proto directly, which will hopefully produce less data than JSON.
    • Rhys: When I tried this a while ago, the connection of data flow wasn’t quite right.
    • Felix: This CL doesn’t fix that. I’m hoping it’s an upstream issue, which they’re pretty responsive to. I’m hoping protobuf will just make it go away, since that’s their canonical input.
    • Nick: Stack traces seem to be missing from protobuf, which we definitely want. We might need upstream changes to support that.
    • Felix: I suspect there may be some long tail of issues. But the initial plan would be to keep both viewers until we feel this is solid.
    • Austin: How does the streaming work?
    • Felix: They have an in-memory column store with a SQL interface on top of it. Large traces would still be a problem because they’d need to be loaded fully into memory.
    • Austin: In principle we could swap out that column store for our own streaming thing, but that sounds like a significant amount of work.
    • Felix: On Go Time someone said they only use runtime trace when they’re really desperate and then they can’t figure it out anyway. Most people don’t think about their program from the perspective of the scheduler. I’d like to have different pivoting, like one timeline per G (or M). We sort of have that in the goroutine analysis, but that only shows on-CPU time. Dominick did that in gotraceui.
  • Updates on pprof labels (Nick)
    • Nick: In MK’s recent comments on pprof labels CL, he wondered about a size limit on labels being recorded in the trace. Thinking about trace overhead. Users can also add arbitrary logs (limited by trace buffer size). My thought is that users are deciding to make these as big or as small as they want.
    • Austin: My reaction is “do what the user said”
    • Rhys: It seems like we already don’t have a limit on the pprof labels (number/length/etc) and maybe it would have been good to have a limit, but we already don’t.
    • Bryan: For me it’s more important to be able to find out how much damage you’re doing with this data. Inevitably people want one more byte than the limit and will be frustrated.
    • Felix: Two sides to this problem: how to get the data in the trace while keeping overhead low, and the other is keeping the memory usage low for keeping all these labels. For trace overhead, I’m thinking we want two or three levels of filtering: filter what events, filter events by properties (e.g., duration). JFR supports both of these. And potentially a way to modify events (maybe too far), like truncation. At some point you can almost guarantee fixed-cost tracing. E.g., turn off everything except profile events; now you have timestamps on profile events without all the other overhead.
    • Austin: MK and I have definitely been thinking in that direction. The current trace viewer is almost purpose-built for analyzing the scheduler and needs to understand how a lot of events relate. But if we open up reading traces, the trace viewer becomes just another tool and maybe it’s fine for it to say “I need these events” (kind of like “perf sched” or similar).
    • Felix: I can ask my Java colleagues about how this works in JFR.
    • Rhys: Curious how you’re thinking about filtering.
    • Felix: Simpler is better. You could imagine a callback, but that’s not simple. Probably something like runtime/metrics where you can discover the events and select.
    • Rhys: Definitely need a header saying which events are included.
    • Felix: Agreed. Also nice for viewers so they don’t have to hard-code all of the events.

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 19, 2023

2022-01-19 Sync

  • Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham @mknyszek @prattmic @dominikh @dashpole
  • Felix: gentraceback iterator refactoring
    • Felix: What's the progress?
    • Austin: Made progress. Running into issues with write barriers and trying to knock down all the write barriers one by one. Big open question of testing; so many horrible corner cases. No good answers.
    • Felix: Tried to do it incrementally instead of all at once; also painful. RE: testing, would it be useful to have the ability to instrument a PC and do a traceback from there?
    • Austin: That would help. The worst parts are weird though, like signals. If we had a good way to inject a symbol, like a breakpoint, that would help a lot.
      • Idea: could use hardware breakpoints via perf-event-open (Linux only, but at least architecture-independent) which could get enough coverage for Austin to be happy.
      • Could potentially synthesize other signal tests from a single signal.
    • Felix: I'll give it a shot.
    • Michael K: What work could we do in parallel?
    • Felix: Could write a frame pointer unwinder separately for tracing just to get an idea of the overhead.
      • Austin: +1. Tricky things include logic in gentraceback for filtering out frames. Maybe it doesn't matter for the trace viewer (i.e. don't filter). Also inline unwinding. Trying to totally separate inline unwinding in gentraceback. Once its its own separate thing, it'd be straightforward to plumb that into a frame pointer unwinder.
      • Michael K: Could we skip inline unwinding for the experiment?
      • Austin: Yeah.
      • Michael P: +1 to separating out inline unwinding. Already "runtime_expandFinalInlineFrame" in the runtime which is a good reference point for this.
      • Felix: Also all the complexity with cgo traceback, but we should just ignore that for the experiment.
      • Michael K: The cgo traceback tests are also really flaky, and if we could have better testing around that that would be great.
  • Felix: Perfetto UI blues … (timeline bug, link bug, stack traces, large traces, small screens, protocol buffer format) … gotraceui w/ wasm? Having an online tool with independent release cycle is tempting?
    • CL out that makes Perfetto work. Limitations:
      • Limited for very large traces as-is.
      • Doesn't seem easy to make it work as well as go tool trace (bugs). e.g. timelines not named correctly. Events not connected correctly.
        • Harder: getting stack traces to show up. Nick has tried to make it work. Protobuf format doesn't have an obvious stack trace format?
        • Nick: Not a one-to-one mapping between Catapult format and Perfetto. Can stick a single location in the Perfetto format, but not a full stack trace. Little things in the protobuf format that aren't well-documented. e.g. string interning only works if you include a number in the header.
        • Michael K: MP and I looked into this. Perfetto knows how to do this for some traces, but it’s built into a C++ library, so we’d have to rewrite that in Go or call into it from Go. I’m not sure it even has strong backwards compatibility.
        • Michael P: There is the Perfetto tool that runs the RPC server. (trace_processor.) That loads into a SQLite in-memory DB, but does do better than the fully in-browser implementation. It can do bigger traces, though is still limited. That seems like enough of an improvement to me.
        • Felix: I have a 280MB trace that gets split into 90 parts for 15 seconds on a busy server. Maybe we should start with deciding what size trace we want to have a good experience for.
        • Michael K: I think 280MB is a big trace, though it’s only 15 seconds. I think we should be targeting bigger than that. It’s easy to get a 1GB trace. But we can start with Perfetto as long as it’s better and work toward that.
        • Austin: Is that better with Perfetto?
        • Felix: I think it would be better. Maybe 5x better, so a second at a time (don’t quote me on that).
        • Michael P: The trace_processsor is better, but still limited by the in-memory SQLite DB. Presumably that could be on disk. I don’t know if the trace loading is also linear in the trace size.
        • Rhys: What do you even do with an execution trace that large? How do you get value out of that?
        • Felix: This trace was from a colleague from an instance that was struggling with pauses. It looked like a straggling procstop. It was debugging the behavior of a whole application that was behaving poorly.
        • Rhys: So you were looking for behavior that was pretty zoomed-out.
        • Felix: Yeah.
        • Michael K: Part of the problem with existing traces is the usability of this. I think it’s a valid question about whether big traces are all that useful. Sometimes you’re not even really sure what you’re looking for. Say I wanted to run a full trace on every invocation of the compiler. You don’t necessarily know what you’re looking for to improve compiler speed.
        • Austin: I bet if you were to profile the space of large trace file, the vast majority of that would not be useful to you looking at it at a high level. Suggests a solution here for filtering is to just reduce what goes into the trace.
        • 280MB Trace Size Breakdown
        • Michael K: Maybe just proc start/proc stop for what Felix was describing.
        • Rhys: But once you find the problem, you want more detail. It's hard to catch the end of a garbage collection cycle because of the rules of starting a trace during a GC cycle.
        • Michael K: Fixing the mark phase issue should be easier than before.
        • Austin: Awesome breakdown!
      • User group said "please don't do this" because Perfetto isn't nice to small screens.
      • Felix: gotraceui
        • Viewing timelines for goroutines is great.
        • Would like Dominik to talk about gotraceui some more.
        • I want to be intentional about choosing Perfetto.
        • Michael K: I think the dependency on gio was a concern.
        • Dominik: Gio (the UI library I use) supports wasm, so it should be fairly
          straightforward to have gotraceui run in the browser if we want to go
          down that road.
        • Dominik: I still rely on loading entire traces into memory (but using significantly less memory than current go tool trace), but with the upcoming format changes, streaming data might be possible.  We currently load everything into memory because when the user zooms out far enough, we need all events to compute what we display. But we could probably precompute these zoom levels, similar to mipmaps.
        • Dominik: For the current trace format, gotraceui needs roughly 30x the size of the trace in memory. so a 300 MB trace needs 9 GB.
        • Michael K: I have been thinking about an HTML UI that does something like Google Maps tiles to scale. We could skip a lot of work if we could take gotraceui as the UI, but port it into something more portable than Gio. OTOH, it’s even more work to build something from scratch.
        • Dominik: WRT gotraceui's use of Gio, there'll be pretty rich UI, and I don't fancy writing UIs in HTML/JS. But all of the processing of trace data could live externally
        • Michael P: It’s not necessarily a hard requirement that the Go project itself ship a trace viewer. We have to now because there’s no API. But if we shipped an API, it wouldn’t be a hard requirement. Much like we don’t ship a debugger.
        • Michael K: One option is that we ignore the UI situation entirely and build something that you can parse separately and ship something really bare later. In the meantime, point at a little tool that will shove it into trace_processor and point people at Perfetto. For a brief time, stop shipping our own. It’s very convenient that you only need a Go installation to view these traces, but I think you’re right that we could stop shipping a UI. We could also keep the existing UI working/limping while we do other things in parallel.
        • Felix: Is Dominik looking for contributors? (That comes with its own overheads)
        • Dominik: I'm usually not big on contributions in the form of code; but ideas and feedback are hugely appreciated
        • Michael K: We don’t have to make a decision on using Perfetto now. Maybe we should plug along for two more weeks (with Perfetto) and figure out if we can fix the issues without too much effort, and then make a hard decision on what to do at the next meeting.
        • 👍
  • Felix: traceutils anonymize & breakdown and ideas: (flamescope, graphviz, tracer overhead)
    • Implemented anonymization of traces. Breakdowns, too.
    • Tracer overhead tool that uses profile samples in the trace to identify overheads.
  • Felix: Format: Consistent message framing, remove varint padding for stacks
    • 4 different cases for how an event can be laid out.
    • Maybe a way to skip messages and layouts it doesn't understand.
    • Austin: self-descriptive header giving lengths for each opcode
    • Michael K: Any state in the trace makes things hard to push it up into OTel, since that’s completely stateless.
    • Felix: We’re actually trying to do two things in OTel. Including binary data blobs, like pprof and JFRs. And something to send stateful things like stack traces, etc, where you can refer back to them efficiently.
    • David: For trace I wouldn’t expect a stateful protocol to be introduced any time soon. But for profiling it may be a possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests

7 participants