-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/trace: improve Go diagnostics / tracing #29104
Comments
This may need to be separated into multiple issues, but I didn't see an easy way to group them, other than just referencing the "parent" incident in the text ? |
Thank you for this report/issue @robaho! Kindly paging @hyangah @rakyll @dvyukov @aclements |
Issue 4 is incorrect. When you trace into a Go routine there are entries for each routine. (Graphs,etc) The problem is things are too low level. It could be helped by libraries and applications putting user events everywhere, but even then the analysis is manual. I am interested in developing something more macro that understands the event dependencies and provides an easier to understand performance profile. |
The tracer is a work in progress. Progress has been slow recently. It's not clear when we will get to runtime/trace improvements. We do already record the place where a goroutine was created (it shows up in panic crash output, for example). If that can be added to the trace format easily, that seems worthwhile. The larger issues around good summaries and digest presentations are difficult. |
@rsc if the trace parsing facilities were moved out of an internal package to public, then users could develop more advanced analysis tools. Right now you need to copy and paste the code, which is brittle as things advance. At at minimum this could be addressed pretty easily. This is especially important for the user events (annotations), as custom analysis is almost certainly required |
Overview
Go offers extensive diagnostic capabilities out of the box, but they are of limited usefulness in a larger, highly concurrent, real-world application utilizing 100+ Go routines.
For the purposes of this proposal, I am using a real-world application. This application uses gRPC and has a single RPC stream active. It also supports 'fix financial protocol' via standard TCP sockets (no active connections). It emits multicast packets as well. It has a web interface, and also web socket interface (both with no active connections). So there is a lot of network end-points in-use. Although it is a real-world application, it is very small compared to most enterprise applications.
The following screen shot shows a profile, and the OS cpu utilization - notice the very heavy system cpu usage:
A tree sample provides a bit more context:
Here is the OS total CPU % breakdown:
Here is the 'web' view for the profile:
Here is a GODEBUG sampling with scheddetail=1 for the same process.
When using the 'trace' the flow it a bit easier to spot, as 'goroutine analysis' shows:
And going into the detail for startMarketData (exhibit A) function, we see:
and examining the sync block time, we see:
but examining the scheduler wait time, we see:
Dilema
Although the profile clearly shows lots of OS activity, and the OS monitor shows heavy CPU utilization, there is no easy way to analyze what is causing the heavy system cpu%. The biggest problem is that the profile is NOT partitioned by 'go routine'. Using the trace tool, things are better but there is no aggregate view. Ideally the detail view (exhibit A above), would be shown on the 'analysis page' for all go routines - the only way to get this information is to inspect each routine individually. Something like this (filed as issue #29103)
But more importantly, there needs to be a summary analysis as to the "why" - not reviewing a graph, for example, if I open the detail for the start market data routine above, it should show (sample only):
The user can also use the 'trace viewer' which is 100% manual - it is very powerful, but difficult to see the forest through the trees.
Specific Proposals
Working with the current naming is more difficult than need be - you need to be an expert in all of the codebases of all of the libraries to even have an basic understanding of what is going on. The name would allows the library creator to provide significant information as to it purpose. This need should be appended to the existing "name" in the diagnostic tools.
In lieu of 1, at a minimum, the go routine name as provided now should be changed to include the source file and line number. Often a single function spawns multiple go routines, and the idiomatic way is anonymous functions, so you end up with func1, func2, etc. The source/line would make tracking these down much easier. This would also be the case, if the program did not provide a name in 1.
The trace output should have an option for per go routine stack sampling, which would place a 'stack trace event' in the trace file on the sampling interval. It may be possible to sync a profile capture with a sync capture based on timestamps, but it simplifies a lot to have everything in a single capture file.
There should be trace reporting by Go routine instance. Currently with the trace, all of the routines of a single type are lumped together. This does not allow easy detection of processing bottlenecks. For example, if there are 4 routines consuming 15%, is it 1 routine consuming 15%, or 3 consuming 1% and 1 consuming 14% ?
Most importantly, there is no public API to read the trace file to produce custom analysis reports, especially when there exists an API to add 'user events' to the trace. The internal/trace package should be made public. Currently, the information might be obtained in json from the 'tool trace webserver', but this is undocumented, and probably too inefficient for certain analysis needs. The current solution involves copy and pasting the package.
The text was updated successfully, but these errors were encountered: