feat:Implement perf event groups, scaled reads, and group snapshots#22
Conversation
- Introduced `group_fd` field in the perf options structure to allow attaching BPF programs to a group of perf events. - Updated the `ks_open_perf_event` function to accept `group_fd` and handle group event management. - Implemented helper functions for managing active members of perf event groups, ensuring that group leaders cannot be detached while active members exist. - Enhanced the generated code to include necessary checks and structures for handling multiplexed perf events. - Added tests to validate the new group management features and ensure correct code generation for group-related operations.
- Introduced functions to manage performance event groups, including detection of maximum events and validation of static groups. - Added support for new performance read functions: `read_raw`, `read_details`, and `read_group`, along with their corresponding structures and handling in the code generation. - Enhanced the type checker to validate performance event group attachments and ensure no cycles exist in group leader relationships. - Updated userspace code generation to track usage of new performance read functions and manage group attachments. - Added tests for new functionality, including validation of oversized static performance event groups and code generation for new read functions.
…erformance event groups; added snapshot index printing functionality; updated userspace code generation tests to verify variable reuse logic.
Design suggestion: collapse the four read verbs into one generic
|
Fold calls into primary expressions so chained access like read(cache).scaled parses cleanly. Make read() return PerfRead and remove the split raw/details/group helpers. Always request PERF_FORMAT_ID and PERF_FORMAT_GROUP, clamp static group limits to 16, and update docs, examples, and tests.
DescriptionThis update consolidates all perf counter reads around a single, unified PerfRead snapshot API, addressing previous review feedback regarding API bloat and latent silent-truncation bugs. Additionally, this PR formalizes our API philosophy regarding data retrieval (Perf vs. Maps vs. RingBuffers) to ensure clear semantic boundaries moving forward. Key Technical Changes
Architectural Note: Semantic Boundaries for read() (Pushing back on generic dispatch)In the previous review, it was suggested that we structure read() as a polymorphic dispatch point to eventually support Maps and RingBuffers. After careful consideration of the current language syntax, this PR advocates for : Separating verb semantics. We propose strictly bounding read() to static snapshot retrievals (like Perf counters), rather than making it a universal accessor. Here is the rationale:
By keeping read() strictly for snapshots, [] for state lookups, and dispatch() for event streams, we maintain a clear, predictable, and highly specialized API philosophy. |
| print("Page-fault perf_event demo attached") | ||
| var page = attach(prog, perf_options { perf_type: perf_type_software, perf_config: page_faults, pid: 0, cpu: -1, period: 1 }, 0) | ||
| // branch joins cache's perf event group. Adding a member restarts the whole group from zero. | ||
| var branch = attach(prog, perf_options { perf_type: perf_type_hardware, perf_config: branch_misses, period: 10000000, inherit: true}, 0) |
| var cache_count = read(cache).scaled | ||
| print("Cache-miss count: %lld", cache_count) | ||
| var branch_count = read(branch) | ||
| var branch_count = read(branch).scaled |
There was a problem hiding this comment.
This PR sets read_format to include PERF_FORMAT_GROUP on every event, and reads into a group buffer:
result->raw = (int64_t)group.values[0].value; // values[0] == leader
result->scaled = result->values[0];
With PERF_FORMAT_GROUP, a read(2) on any fd in the group is serviced by the kernel via that event's group_leader and returns the entries leader-first. The fd you pass picks which group, not which entry. So
whether you read the cache fd or the branch fd, the buffer comes back as:
nr=2, time_enabled, time_running,
values[0] = { cache_misses_value, cache_id } <- always the leader
values[1] = { branch_misses_value, branch_id }
scaled/raw are hardcoded to values[0], which is the leader. So read(branch).scaled returns the cache-miss count, even though the example labels and uses it as the branch-miss count. The branch value is sitting at values[1], but PerfRead gives you no "which index is me" — so you can't reliably pull it out.
If branch had instead been a standalone event (no group:), read(branch).scaled would be correct — nr=1, values[0] is branch's own. The bug only bites when you read() a non-leader member, which is exactly what the example does.
Store each perf attachment's kernel event id in the internal attachment state and use it to select the matching entry from PERF_FORMAT_GROUP reads. This makes read(member).raw/scaled report the member's counter instead of always returning the group leader. Also clarify the perf read docs and fix a stale example comment.
|
Good catch, thanks. You’re right: with PERF_FORMAT_GROUP the fd selects the group snapshot, but values[0] is still the leader, so read(branch).scaled was returning the leader count for grouped members. I fixed this by recording each attachment’s PERF_EVENT_IOC_ID in the internal perf attachment state and selecting the matching id from the group read buffer when filling raw/scaled. The full group snapshot remains available through values[]/ids[], and the public PerfAttachment shape stays unchanged. Also cleaned up the stale page-fault example comment and updated the docs to say read(att).raw/scaled refer to the attachment being read. |
Overview
This PR introduces the ability to group multiple
perfmetrics (e.g., cache misses, branch misses, cycles) into a single scheduling group. This ensures that counters observing the same workload are started and stopped together, solving the issue of misaligned results from independently managed counters.Additionally, it brings comprehensive multiplex-aware read APIs, static PMU slot limit validations, and fixes several internal userspace codegen edges to stabilize snapshot data consumption.
Key Features & User-Facing Changes
1. High-Level Grouping API
groupfield: Added a high-levelgroupfield inperf_optionsto easily attach members to a leader.group_fd: leader.perf_fdapproach is preserved for backward compatibility.2. Multiplex-Aware Read APIs
read(att): Now returns scaled values by default, corrected viatime_enabled / time_runningwhen PMU multiplexing occurs. (Matches raw count if no multiplexing happens).read_raw(att): Returns the uncorrected, raw counter values.read_details(att): Returns a struct containingraw,scaled,time_enabled, andtime_running—ideal for manual delta or rate calculations.read_group(leader): Captures an atomic snapshot of the entire group. Returns up to 16 ID/Value pairs (wherevalues[]are pre-scaled according to snapshot timing) and snapshot time fields.3. Group Lifecycle Management
4. Compile-Time PMU Slot Validation
4(or dynamically probessysfs), and can be overridden via theKERNELSCRIPT_PERF_GROUP_MAX_EVENTSenvironment variable.perf_type_softwareandperf_type_tracepointare correctly excluded from hardware PMU slot counts.Internal & Codegen Improvements
read_group()snapshot arrays (snapshot.ids[i]/snapshot.values[i]).memcpy", preventing invalid C generation from snapshot struct fields.forloop counters and subsequent variables of the same name produced duplicate function-level C declarations.Documentation & Examples
examples/perf_cache_miss.ks: Refactored to use the newgroupAPI. Added demonstrations ofread_details()for rate calculation andread_group()for iterating through snapshotid/valuepairs.examples/perf_page_fault.ks: Extended to demonstrate updated perf read semantics.README.md,SPEC.md, andBUILTINS.mdto reflect group semantics, read interfaces, and PMU slot constraints.Test Coverage
group_fdand high-levelgrouppaths.read(), and helper generation forread_raw(),read_details(), andread_group().forloop counter variable reuse in userspace codegen.