Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of operator debug: fix pprof interval handling into release/1.5.x #20214

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #20206 to be assessed for backporting due to the inclusion of the label backport/1.5.x.

The below text is copied from the body of the original PR.


The nomad operator debug command saves a CPU profile for each interval, and names these files based on the interval.

The same functions takes a goroutine profile, heap profile, etc. but is missing the logic to interpolate the file name with the interval. This results in the operator debug command making potentially many expensive profile requests, and then overwriting the data. Update the command to save every profile it scrapes, and number them similarly to the existing CPU profile.

Additionally, the command flags for -pprof-interval and -pprof-duration were validated backwards, which meant that we always coerced the -pprof-interval to be the same as the -pprof-duration, which always resulted in a single profile being taken at the start of the bundle. Correct the check as well as change the defaults to be more sensible.

Fixes: #20151


In addition to fixing up the tests as needed, I've tested this locally as follows.

$ nomad operator debug -duration 1m -stale=true -node-id=61c19030  -log-level=trace -pprof-interval=15s
Starting debugger...

Nomad CLI Version: Nomad v1.7.7-dev
BuildDate 2024-03-22T19:40:43Z
Revision f91127fc5492d930dc61af54fd4f1f2a6f01f109+CHANGES
           Region:
        Namespace:
          Servers: (1/1) [continuity.global]
          Clients: (1/1) [61c19030-0ba7-6927-d9b9-a5b9df52f4e4]
         Interval: 30s
         Duration: 1m
   pprof Interval: 15s

Capturing cluster data...
Consul - Skipping, no API address found
    Capture pprofInterval 0000
    Capture interval 0000
    Capture pprofInterval 0001
    Capture interval 0001
    Capture pprofInterval 0002
    Capture pprofInterval 0003
Created debug archive: nomad-debug-2024-03-22-200116Z.tar.gz

This results in the following file tree:

file tree
$ tar -xf nomad-debug-2024-03-22-200116Z.tar.gz
$ tree nomad-debug-2024-03-22-200116Z
nomad-debug-2024-03-22-200116Z
├── client
│   └── 61c19030-0ba7-6927-d9b9-a5b9df52f4e4
│       ├── agent-host.json
│       ├── allocs_0000.prof
│       ├── allocs_0001.prof
│       ├── allocs_0002.prof
│       ├── allocs_0003.prof
│       ├── goroutine_0000.prof
│       ├── goroutine_0001.prof
│       ├── goroutine_0002.prof
│       ├── goroutine_0003.prof
│       ├── goroutine-debug1_0000.txt
│       ├── goroutine-debug1_0001.txt
│       ├── goroutine-debug1_0002.txt
│       ├── goroutine-debug1_0003.txt
│       ├── goroutine-debug2_0000.txt
│       ├── goroutine-debug2_0001.txt
│       ├── goroutine-debug2_0002.txt
│       ├── goroutine-debug2_0003.txt
│       ├── heap_0000.prof
│       ├── heap_0001.prof
│       ├── heap_0002.prof
│       ├── heap_0003.prof
│       ├── monitor.log
│       ├── profile_0000.prof
│       ├── profile_0001.prof
│       ├── profile_0002.prof
│       ├── profile_0003.prof
│       ├── threadcreate_0000.prof
│       ├── threadcreate_0001.prof
│       ├── threadcreate_0002.prof
│       ├── threadcreate_0003.prof
│       ├── trace_0000.prof
│       ├── trace_0001.prof
│       ├── trace_0002.prof
│       └── trace_0003.prof
├── cluster
│   ├── agent-self.json
│   ├── cli-flags.json
│   ├── eventstream.json
│   ├── leader.json
│   ├── members.json
│   ├── namespaces.json
│   ├── nodes.json
│   └── regions.json
├── index.html
├── index.json
├── interval
│   ├── 0000
│   │   ├── allocations.json
│   │   ├── csi-plugins.json
│   │   ├── csi-volumes.json
│   │   ├── deployments.json
│   │   ├── evaluations.json
│   │   ├── jobs.json
│   │   ├── license.json
│   │   ├── metrics.json
│   │   ├── nodes.json
│   │   ├── operator-autopilot-health.json
│   │   ├── operator-raft.json
│   │   └── operator-scheduler.json
│   └── 0001
│       ├── allocations.json
│       ├── csi-plugins.json
│       ├── csi-volumes.json
│       ├── deployments.json
│       ├── evaluations.json
│       ├── jobs.json
│       ├── license.json
│       ├── metrics.json
│       ├── nodes.json
│       ├── operator-autopilot-health.json
│       ├── operator-raft.json
│       └── operator-scheduler.json
└── server
    └── continuity.global
        ├── agent-host.json
        ├── allocs_0000.prof
        ├── allocs_0001.prof
        ├── allocs_0002.prof
        ├── allocs_0003.prof
        ├── goroutine_0000.prof
        ├── goroutine_0001.prof
        ├── goroutine_0002.prof
        ├── goroutine_0003.prof
        ├── goroutine-debug1_0000.txt
        ├── goroutine-debug1_0001.txt
        ├── goroutine-debug1_0002.txt
        ├── goroutine-debug1_0003.txt
        ├── goroutine-debug2_0000.txt
        ├── goroutine-debug2_0001.txt
        ├── goroutine-debug2_0002.txt
        ├── goroutine-debug2_0003.txt
        ├── heap_0000.prof
        ├── heap_0001.prof
        ├── heap_0002.prof
        ├── heap_0003.prof
        ├── monitor.log
        ├── profile_0000.prof
        ├── profile_0001.prof
        ├── profile_0002.prof
        ├── profile_0003.prof
        ├── threadcreate_0000.prof
        ├── threadcreate_0001.prof
        ├── threadcreate_0002.prof
        ├── threadcreate_0003.prof
        ├── trace_0000.prof
        ├── trace_0001.prof
        ├── trace_0002.prof
        └── trace_0003.prof

8 directories, 102 files

Overview of commits

@vercel vercel bot temporarily deployed to Preview – nomad March 25, 2024 13:03 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad-storybook-and-ui March 25, 2024 13:09 Inactive
@tgross tgross merged commit c6cbcbe into release/1.5.x Mar 25, 2024
23 of 27 checks passed
@tgross tgross deleted the backport/b-operator-debug-interval/especially-rare-caiman branch March 25, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants