Fix: perf argument ordering bug silently disabling the profiler and losing collected data by artursarlo · Pull Request #69 · pinterest/gprofiler

artursarlo · 2026-04-08T23:03:06Z

Fix: perf argument ordering bug silently disabling the profiler and losing collected data

Summary

A subtle but impactful argument ordering bug in PerfProcess._get_perf_cmd() caused two compounding issues:

Perf event discovery always failed on certain hosts, disabling the perf profiler entirely.
Collected data was silently lost on hosts where discovery accidentally succeeded, because the wrong process was being profiled.

The Bug

PerfProcess._get_perf_cmd() assembled the final perf record command by concatenating argument groups in this order:

[perf_path(), "record", "-F", ..., "-m", ...]
+ extra_args     # may include "--" followed by a command (e.g. "-- sleep 0.5")
+ self._pid_args # contains "-a" (system-wide) or "--pid <pids>"
+ self._cgroup_args

During event discovery, extra_args includes ["--", "sleep", "0.5"] — the -- separator tells perf record that everything after it is the command to trace. This means self._pid_args (which contains -a for system-wide profiling) was appended after the -- separator, effectively being passed as an argument to sleep rather than to perf:

# What was actually executed:
perf record -F 11 -g -o /tmp/perf_default_event.fp --switch-output=15s,signal --switch-max-files=1 -m 129 -- sleep 0.5 -a
#                                                                                                                        ^^
#                                                                              passed to sleep, not perf ────────────────┘

sleep does not accept -a as a valid argument and exits immediately with a non-zero code before any samples can be collected.

Impact

Impact 1: Perf profiler silently disabled (consistent failure)

discover_appropriate_perf_event() runs perf record -- sleep 0.5 for each supported event type (default, cpu-clock, task-clock) and checks whether any samples were collected. Because sleep exited immediately due to the invalid -a flag, perf record also exited with zero samples every time. The discovery function exhausted all events and raised PerfNoSupportedEvent, logging:

[WARNING] gprofiler.profilers.perf: Failed to determine perf event to use.
This is likely due to GPU machine compatibility issues where perf segfaults.
Perf profiler will be disabled. Other profilers will continue.

This was consistently triggered on hosts where the processes_to_profile was None (system-wide mode), causing the entire perf profiler to be permanently disabled for those hosts.

Impact 2: Collected data silently lost (intermittent failure)

In normal profiling runs (not discovery), extra_args does not contain --. However, self._pid_args was still being placed after extra_args, meaning -a or --pid <pids> came after all event flags (-e cycles, --call-graph dwarf,...). While perf tolerates this in many cases, the ordering is incorrect per perf's argument parsing rules, and on some kernel/perf versions this caused profiling samples to be dropped or the wrong set of processes to be targeted — resulting in inconsistent, partial data collection that was invisible to the caller.

Root Cause

The _get_perf_cmd method in PerfProcess concatenated extra_args before self._pid_args:

# BEFORE (buggy)
+ extra_args      # could contain "--" separator
+ self._pid_args  # "-a" or "--pid X" — ended up after "--"

The Fix

Two files were changed:

`gprofiler/utils/perf_process.py` — root fix

Swapped the order so pid_args always comes before extra_args:

# AFTER (fixed)
+ self._pid_args  # "-a" or "--pid X" — always before "--"
+ extra_args      # events and optional "-- <cmd>" for discovery

This guarantees the resulting command is always well-formed:

perf record -F 11 -g -o ... -m 129   -a   -- sleep 0.5
                                      ^^
                              correctly placed before "--"

`gprofiler/utils/perf.py` — cleanup

Removed the now-unnecessary discovery_pids variable and simplified the PerfProcess construction in discover_appropriate_perf_event to always pass processes_to_profile=None (system-wide), relying on the corrected ordering in _get_perf_cmd to place -a correctly.

How It Was Found

The bug was surfaced by a host emitting the "Failed to determine perf event to use" warning. Manual inspection of the perf data files on the host using perf evlist confirmed events were being recorded (cycles:P), but perf script output showed no stack frames — which led to tracing the discovery logic and ultimately uncovering the argument ordering bug.

Files Changed

gprofiler/utils/perf_process.py — swapped extra_args and self._pid_args order in _get_perf_cmd()
gprofiler/utils/perf.py — cleaned up discover_appropriate_perf_event() to always use processes_to_profile=None for discovery

artursarlo · 2026-04-09T23:56:29Z

Test for cgropup based profiling

Configuration Under Test

--perf-use-cgroups
--perf-max-cgroups 10
--perf-max-docker-containers 1   (initial runs: 0/1, final test: 1 + heartbeat aggressive)
--skip-system-profilers-above 600
--profiling-duration 60
--perf-mode fp

Validation Checklist

✅ Check 1: `-a` placed before `--` in discovery command

Expected: perf record ... -a -- sleep 0.5
Observed (23:45:32):

perf record -F 11 -g -o .../perf_default_event.fp --switch-output=15s,signal \
  --switch-max-files=1 -m 129 -a -- sleep 0.5
                             ^^
                      -a correctly before --

Result: ✅ PASS — -a appears before --, sleep receives no spurious arguments.

✅ Check 2: Perf event discovery succeeds

Expected: Perf event PERF_DEFAULT discovery successful, found N samples
Observed (23:45:32–23:45:35):

[DEBUG] gprofiler.utils.perf: Perf event PERF_DEFAULT discovery: parsing output stream
[DEBUG] gprofiler.utils.perf_process: perf (fp mode) run output
        perf_stderr: rounding mmap pages size to 1024K (256 pages)
                     Captured and wrote 1.431 MB ...perf_default_event.fp.<timestamp>
[DEBUG] gprofiler.utils.perf: Perf event PERF_DEFAULT discovery successful, found 13 samples
[INFO]  gprofiler.utils.perf_process: Stopped perf (fp mode) (exit_code=0)
[DEBUG] gprofiler.profilers.perf: Discovered perf event (discovered_perf_event=PERF_DEFAULT)

Result: ✅ PASS — 13 samples collected, PERF_DEFAULT (hardware cycles) selected. No fallback to software events needed.

✅ Check 3: Absence of "Failed to determine perf event to use"

Expected: No warning about failing to determine perf event.
Observed: The warning Failed to determine perf event to use does not appear anywhere in the log during this run.
Result: ✅ PASS — The bug that was previously disabling the profiler is resolved.

✅ Check 4: Cgroup v2 detection and Docker container discovery

Expected: Cgroups discovered and mapped to cgroup v2 paths under /sys/fs/cgroup/system.slice/.
Observed (23:45:37):

[DEBUG] Found cgroup v2 path for container cdfc2f7c...: 
        /sys/fs/cgroup/system.slice/docker-cdfc2f7c....scope (CPU: 5.4%)
[DEBUG] Found cgroup v2 path for container 0a47afe8...: 
        /sys/fs/cgroup/system.slice/docker-0a47afe8....scope (CPU: 4.23%)
[DEBUG] Found cgroup v2 path for container 3ef0bad9...: 
        /sys/fs/cgroup/system.slice/docker-3ef0bad9....scope (CPU: 3.9%)
... (21 containers total, ranked by CPU usage)
[DEBUG] Found 88 cgroups to analyze
[DEBUG] Analyzed 88 cgroups with resource data
[INFO]  Using individual Docker container profiling: 21 containers, 1 other cgroups
[INFO]  Using cgroup-based profiling with 22 top cgroups: [...]

Result: ✅ PASS — Cgroup v2 correctly detected; 21 Docker containers + 1 system cgroup = 22 total cgroups selected by CPU usage.

✅ Check 5: `-a` placed before `-e` and `-G` in the actual profiling command

Expected: perf record ... -a -e cycles ... -G <cgroups>
Observed (23:45:39):

perf record -F 11 -g -o .../perf.fp --switch-output=90.0s,signal \
  --switch-max-files=1 -m 129 \
  -a \
  -e cycles -e cycles ... (×22, one per cgroup) \
  -G system.slice/docker-cdfc2f7c....scope,system.slice/docker-0a47afe8....scope,...
  ^^
  -a correctly before -e and -G

Result: ✅ PASS — Argument order is correct: pid_args (-a) → extra_args (-e ...) → cgroup_args (-G ...).

✅ Check 6: Perf starts and continuously dumps data

Expected: Perf starts successfully and periodically dumps output files.
Observed:

[INFO]  23:45:39 - Started perf (fp mode)
[DEBUG] 23:46:39 - perf (fp mode) run output: Dump .../perf.fp.2026040923463936
[DEBUG] 23:48:29 - perf (fp mode) run output: Dump .../perf.fp.2026040923482937
[DEBUG] 23:49:34 - perf (fp mode) run output: Dump .../perf.fp.2026040923493440
[DEBUG] 23:50:38 - perf (fp mode) run output: Dump .../perf.fp.2026040923503844

Result: ✅ PASS — Perf is running continuously, switching output files every ~60s as expected. Data is being collected.

Fix the arguments ordering for perf commands

6828d6e

prashantbytesyntax reviewed Apr 8, 2026

View reviewed changes

Comment thread gprofiler/utils/perf_process.py

Comment thread gprofiler/utils/perf_process.py

artursarlo merged commit 56273b6 into master Apr 10, 2026
12 of 35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: perf argument ordering bug silently disabling the profiler and losing collected data#69

Fix: perf argument ordering bug silently disabling the profiler and losing collected data#69
artursarlo merged 1 commit into
masterfrom
fix/discover_perf_event_logic

artursarlo commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

artursarlo commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

artursarlo commented Apr 8, 2026

Fix: perf argument ordering bug silently disabling the profiler and losing collected data

Summary

The Bug

Impact

Impact 1: Perf profiler silently disabled (consistent failure)

Impact 2: Collected data silently lost (intermittent failure)

Root Cause

The Fix

gprofiler/utils/perf_process.py — root fix

gprofiler/utils/perf.py — cleanup

How It Was Found

Files Changed

Uh oh!

Uh oh!

Uh oh!

artursarlo commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test for cgropup based profiling

Configuration Under Test

Validation Checklist

✅ Check 1: -a placed before -- in discovery command

✅ Check 2: Perf event discovery succeeds

✅ Check 3: Absence of "Failed to determine perf event to use"

✅ Check 4: Cgroup v2 detection and Docker container discovery

✅ Check 5: -a placed before -e and -G in the actual profiling command

✅ Check 6: Perf starts and continuously dumps data

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`gprofiler/utils/perf_process.py` — root fix

`gprofiler/utils/perf.py` — cleanup

artursarlo commented Apr 9, 2026 •

edited

Loading

✅ Check 1: `-a` placed before `--` in discovery command

✅ Check 5: `-a` placed before `-e` and `-G` in the actual profiling command