Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator debug: fix pprof interval handling #20206

Merged
merged 2 commits into from Mar 25, 2024
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Mar 22, 2024

The nomad operator debug command saves a CPU profile for each interval, and names these files based on the interval.

The same functions takes a goroutine profile, heap profile, etc. but is missing the logic to interpolate the file name with the interval. This results in the operator debug command making potentially many expensive profile requests, and then overwriting the data. Update the command to save every profile it scrapes, and number them similarly to the existing CPU profile.

Additionally, the command flags for -pprof-interval and -pprof-duration were validated backwards, which meant that we always coerced the -pprof-interval to be the same as the -pprof-duration, which always resulted in a single profile being taken at the start of the bundle. Correct the check as well as change the defaults to be more sensible.

Fixes: #20151


In addition to fixing up the tests as needed, I've tested this locally as follows.

$ nomad operator debug -duration 1m -stale=true -node-id=61c19030  -log-level=trace -pprof-interval=15s
Starting debugger...

Nomad CLI Version: Nomad v1.7.7-dev
BuildDate 2024-03-22T19:40:43Z
Revision f91127fc5492d930dc61af54fd4f1f2a6f01f109+CHANGES
           Region:
        Namespace:
          Servers: (1/1) [continuity.global]
          Clients: (1/1) [61c19030-0ba7-6927-d9b9-a5b9df52f4e4]
         Interval: 30s
         Duration: 1m
   pprof Interval: 15s

Capturing cluster data...
Consul - Skipping, no API address found
    Capture pprofInterval 0000
    Capture interval 0000
    Capture pprofInterval 0001
    Capture interval 0001
    Capture pprofInterval 0002
    Capture pprofInterval 0003
Created debug archive: nomad-debug-2024-03-22-200116Z.tar.gz

This results in the following file tree:

file tree
$ tar -xf nomad-debug-2024-03-22-200116Z.tar.gz
$ tree nomad-debug-2024-03-22-200116Z
nomad-debug-2024-03-22-200116Z
├── client
│   └── 61c19030-0ba7-6927-d9b9-a5b9df52f4e4
│       ├── agent-host.json
│       ├── allocs_0000.prof
│       ├── allocs_0001.prof
│       ├── allocs_0002.prof
│       ├── allocs_0003.prof
│       ├── goroutine_0000.prof
│       ├── goroutine_0001.prof
│       ├── goroutine_0002.prof
│       ├── goroutine_0003.prof
│       ├── goroutine-debug1_0000.txt
│       ├── goroutine-debug1_0001.txt
│       ├── goroutine-debug1_0002.txt
│       ├── goroutine-debug1_0003.txt
│       ├── goroutine-debug2_0000.txt
│       ├── goroutine-debug2_0001.txt
│       ├── goroutine-debug2_0002.txt
│       ├── goroutine-debug2_0003.txt
│       ├── heap_0000.prof
│       ├── heap_0001.prof
│       ├── heap_0002.prof
│       ├── heap_0003.prof
│       ├── monitor.log
│       ├── profile_0000.prof
│       ├── profile_0001.prof
│       ├── profile_0002.prof
│       ├── profile_0003.prof
│       ├── threadcreate_0000.prof
│       ├── threadcreate_0001.prof
│       ├── threadcreate_0002.prof
│       ├── threadcreate_0003.prof
│       ├── trace_0000.prof
│       ├── trace_0001.prof
│       ├── trace_0002.prof
│       └── trace_0003.prof
├── cluster
│   ├── agent-self.json
│   ├── cli-flags.json
│   ├── eventstream.json
│   ├── leader.json
│   ├── members.json
│   ├── namespaces.json
│   ├── nodes.json
│   └── regions.json
├── index.html
├── index.json
├── interval
│   ├── 0000
│   │   ├── allocations.json
│   │   ├── csi-plugins.json
│   │   ├── csi-volumes.json
│   │   ├── deployments.json
│   │   ├── evaluations.json
│   │   ├── jobs.json
│   │   ├── license.json
│   │   ├── metrics.json
│   │   ├── nodes.json
│   │   ├── operator-autopilot-health.json
│   │   ├── operator-raft.json
│   │   └── operator-scheduler.json
│   └── 0001
│       ├── allocations.json
│       ├── csi-plugins.json
│       ├── csi-volumes.json
│       ├── deployments.json
│       ├── evaluations.json
│       ├── jobs.json
│       ├── license.json
│       ├── metrics.json
│       ├── nodes.json
│       ├── operator-autopilot-health.json
│       ├── operator-raft.json
│       └── operator-scheduler.json
└── server
    └── continuity.global
        ├── agent-host.json
        ├── allocs_0000.prof
        ├── allocs_0001.prof
        ├── allocs_0002.prof
        ├── allocs_0003.prof
        ├── goroutine_0000.prof
        ├── goroutine_0001.prof
        ├── goroutine_0002.prof
        ├── goroutine_0003.prof
        ├── goroutine-debug1_0000.txt
        ├── goroutine-debug1_0001.txt
        ├── goroutine-debug1_0002.txt
        ├── goroutine-debug1_0003.txt
        ├── goroutine-debug2_0000.txt
        ├── goroutine-debug2_0001.txt
        ├── goroutine-debug2_0002.txt
        ├── goroutine-debug2_0003.txt
        ├── heap_0000.prof
        ├── heap_0001.prof
        ├── heap_0002.prof
        ├── heap_0003.prof
        ├── monitor.log
        ├── profile_0000.prof
        ├── profile_0001.prof
        ├── profile_0002.prof
        ├── profile_0003.prof
        ├── threadcreate_0000.prof
        ├── threadcreate_0001.prof
        ├── threadcreate_0002.prof
        ├── threadcreate_0003.prof
        ├── trace_0000.prof
        ├── trace_0001.prof
        ├── trace_0002.prof
        └── trace_0003.prof

8 directories, 102 files

The `nomad operator debug` command saves a CPU profile for each interval, and
names these files based on the interval.

The same functions takes a goroutine profile, heap profile, etc. but is missing
the logic to interpolate the file name with the interval. This results in the
operator debug command making potentially many expensive profile requests, and
then overwriting the data. Update the command to save every profile it scrapes,
and number them similarly to the existing CPU profile.

Additionally, the command flags for `-pprof-interval` and `-pprof-duration` were
validated backwards, which meant that we always coerced the `-pprof-interval` to
be the same as the `-pprof-duration`, which always resulted in a single profile
being taken at the start of the bundle. Correct the check as well as change the
defaults to be more sensible.

Fixes: #20151
@tgross tgross added this to the 1.7.x milestone Mar 22, 2024
@tgross tgross added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels Mar 22, 2024
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Do we also want a changelog entry for this?

@tgross tgross merged commit 02d98b9 into main Mar 25, 2024
21 checks passed
@tgross tgross deleted the b-operator-debug-interval branch March 25, 2024 13:01
philrenaud pushed a commit that referenced this pull request Apr 18, 2024
The `nomad operator debug` command saves a CPU profile for each interval, and
names these files based on the interval.

The same functions takes a goroutine profile, heap profile, etc. but is missing
the logic to interpolate the file name with the interval. This results in the
operator debug command making potentially many expensive profile requests, and
then overwriting the data. Update the command to save every profile it scrapes,
and number them similarly to the existing CPU profile.

Additionally, the command flags for `-pprof-interval` and `-pprof-duration` were
validated backwards, which meant that we always coerced the `-pprof-interval` to
be the same as the `-pprof-duration`, which always resulted in a single profile
being taken at the start of the bundle. Correct the check as well as change the
defaults to be more sensible.

Fixes: #20151
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line theme/cli type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

operator debug throws away all but last node profile
2 participants