Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug: improvements to the debug command #10320

Open
5 of 6 tasks
dnephin opened this issue May 31, 2021 · 2 comments · Fixed by #10279
Open
5 of 6 tasks

debug: improvements to the debug command #10320

dnephin opened this issue May 31, 2021 · 2 comments · Fixed by #10279
Labels
theme/reliability theme/telemetry Anything related to telemetry or observability type/enhancement Proposed improvement or new feature

Comments

@dnephin
Copy link
Contributor

dnephin commented May 31, 2021

This issue is to track a number of improvements to the data produced by consul debug. These improvements should make it easier to get useful data from consul debug.

  • only capture a single cpu profile and single trace for the entire duration, instead of a separate one for each interval. A single cpu profile and trace should contain all the same data, and is easier to consume than 4 or 5 separate profiles.
  • capture a single delta heap, and goroutine profile, instead of a separate one for each interval
  • (debug: use the new metrics stream in debug command #10399) change the metrics endpoint to return a stream of metrics each time the window ends, instead of having to poll for metrics, which results in most metrics being missed.
  • only capture logs once, instead of once per interval
  • add tests for the log capture to show that it properly captures all logs
  • rename cluster.json to members.json to better match the name used by the cli (consul members)
@dnephin dnephin added type/enhancement Proposed improvement or new feature theme/telemetry Anything related to telemetry or observability theme/reliability labels May 31, 2021
@dhiaayachi dhiaayachi linked a pull request Jun 2, 2021 that will close this issue
@banks
Copy link
Member

banks commented Jun 3, 2021

capture a single delta heap, and goroutine profile, instead of a separate one for each interval

I'm interested in the thought process behind this one Daniel - in some cases seeing a few different point-in-time snapshots of these can help e.g. to see which object allocations are growing fastest from one time to the next. Or how the number of goroutines in a specific location is evolving over time. At any rate they are cheap so it seems like we don't loose much by having them there but maybe I'm missing some benefit of removing them?

The rest of these sound like solid improvements!

@dnephin
Copy link
Contributor Author

dnephin commented Jun 3, 2021

In general, my thinking is that having multiple copies of the same profile is a distraction and makes it harder to use the debug data. As much as possible we should attempt to move to a debug archive that has a single file of each type. I believe this proposal, once complete, would accomplish this goal. More details on the specifics below.

The heap profile includes 4 views:

  -inuse_space      Display in-use memory size
  -inuse_objects    Display in-use object counts
  -alloc_space      Display allocated memory size
  -alloc_objects    Display allocated object counts

My understanding is that by switching between the alloc and inuse view you can find most problems, without needing additional profiles. If we need additional data points, capturing them with more time in-between (ex: 10 minutes or 30 minutes) would probably provide a clearer signal, so even when we do need multiple snapshots, capturing separate debug dumps would be a better option than multiple snapshots that are only 30 seconds apart.

For the goroutine profile I'm not sure what we expect to see in the different snapshots. If we are worried about missing something in a single snapshot, then another option is to capture multiple snapshots, and then merge them into a single one in the archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/reliability theme/telemetry Anything related to telemetry or observability type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants