-
Notifications
You must be signed in to change notification settings - Fork 51
PAPI Overview
This page provides a general overview of the PAPI library with a discussion of all major features and functionality.
Welcome to PAPI, the Performance API. This overview will provide you with a discussion of how to use the different components and functions of PAPI. The intended audience includes application developers, performance tool writers, and curious students of performance who wish to access performance data to tune and model application performance. You should have some level of familiarity with C and Fortran, and have a basic knowledge of computer architecture and programming.
PAPI is written in C. The function calls in the C interface are defined in the header file, papi.h and consist of the following form:
<returned data type> PAPI_function_name(arg1, arg2, …)
The function calls in the Fortran interface are defined in the source file, papi_fwrappers.c and consist of the following form:
PAPIF_function_name(arg1, arg2, …, check)
As you can see, the C function calls have equivalent Fortran function calls (PAPI_<call> becomes PAPIF_<call>). This is generally true for most function calls, except for the functions that return C pointers to structures, such as PAPI_get_opt and PAPI_get_executable_info, which are either not implemented in the Fortran interface, or implemented with different calling semantics. In the function calls of the Fortran interface, the return code of the corresponding C routine is returned in the argument, check.
For most architectures, the following relation holds between the pseudo-types listed and Fortran variable types:
Pseudo-type | Fortran type | Description |
---|---|---|
C_INT | INTEGER | Default Integer type |
C_FLOAT | REAL | Default Real type |
C_LONG_LONG | INTEGER*8 | Extended size integer |
C_STRING | CHARACTER*(PAPI_MAX_STR_LEN) | Fortran string |
C_INT FUNCTION | EXTERNAL INTEGER FUNCTION | Fortran function returning integer result |
Array arguments must be of sufficient size to hold the input/output from/to the subroutine for predictable behavior. The array length is indicated either by the accompanying argument or by internal PAPI definitions.
Subroutines accepting C_STRING as an argument are on most implementations capable of reading the character string length as provided by Fortran. In these implementations, the string is truncated or space padded as necessary. For other implementations, the length of the character array is assumed to be of sufficient size. No character string longer than PAPI_MAX_STR_LEN is returned by the PAPIF interface.
PAPI provides several components that allow you to monitor system information of CPUs, network cards, graphics accelerator cards, parallel file systems and more. While the CPU components, perf_event and perf_event_uncore, and the sysdetect component are enabled by default, all other components have to be specified during the installation process.
In PAPI there are two types of components:
- Standard: fully initialized after a call to PAPI_library_init (e.g. perf_event and perf_event_uncore);
- Delay Init: fully initialized after a call to any of the PAPI functions that access the component, like PAPI_enum_cmp_event or PAPI_add_event (e.g. cuda and rocm).
After calling PAPI_library_init, Delay Init components are in an intermediate initialization state. This is conveyed to the user by setting the disabled flag of the component info structure to PAPI_EDELAY_INIT. If the user does not need to access the component info structure, the user does not need to be concerned with delayed initialization. Delay Init components initialization is completed by, e.g., a call to PAPI_enum_cmp_event. If the call returns PAPI_OK the component disabled flag is updated from PAPI_EDELAY_INIT to PAPI_OK.
The reason PAPI has Delay Init components is to minimize overhead. Some components, like GPU components, may have hundreds of thousands of events that require several minutes to be accessed. If a component with hundreds of thousands of events is configured in PAPI but the user does not need it, it would be unreasonable for the user to wait several minutes for the component to be initialized.
To install PAPI with additional components, you have to specify them during configure.
For example, to install PAPI with the CUDA component enabled:
./configure --with-components="cuda"
For CUDA, PAPI requires one environment variable: PAPI_CUDA_ROOT
. This is
required for both compiling and runtime. Typically in Linux one would export this (example is shown below) variable but
some systems have software to manage environment variables (such as module
or
spack
), so consult with your sysadmin if you have such management software. Eg:
export PAPI_CUDA_ROOT=/path/to/installed/cuda
If you want to install multiple components, you must specify them as a space separated list.
Example:
./configure --with-components="appio coretemp cuda nvml"
The following table list all available components.
Before installing a component, please read further instructions by clicking on the desired component name.
Note: The name of the components in the table is shown as it must be used in configure.
Component Name | Description | |
---|---|---|
CPU | ||
perf_event | Linux perf_event CPU counters (default) | |
perf_event_uncore | Linux perf_event CPU uncore and northbridge (default) | |
perfctr | Linux perfctr CPU counters (only used for Linux before 2.6.31.) | |
perfctr_ppc | Linux perfctr CPU counters for IBM PowerPC (9) architecture (only used for Linux before 2.6.31.) | |
perfmon_ia64 | Linux perfmon2 CPU counters for Itanium architecture (only used for Linux before 2.6.31.) | |
perfmon2 | Linux perfmon2 CPU counters (only used for Linux before 2.6.31.) | |
GPU | ||
cuda | CUDA events and metrics via NVIDIA CuPTI interfaces | |
nvml | NVIDIA hardware counters (usage, power, temperature, fan speed, etc) | |
rocm | GPU events and metrics via AMD ROCm-PL API | |
rocm_smi | AMD GPU hardware counters (usage, power, temperature, fan speed, etc) | |
intel_gpu | Intel GPU hardware performance metrics through Intel oneAPI Level Zero Interface | |
Power | ||
host_micpower | Host-side power usage on MIC guest cards | |
libmsr | Measuring and capping power usage on recent Intel architectures using the RAPL interface | |
micpower | Power usage on Intel Xeon Phi (MIC) | |
nvml | NVIDIA hardware counters (usage, power, temperature, fan speed, etc) | |
powercap | Linux powercap energy measurements | |
powercap_ppc | Linux powercap energy measurements for IBM PowerPC (9) architecture | |
rapl | Linux RAPL energy measurements | |
rocm_smi | AMD GPU hardware counters (usage, power, temperature, fan speed, etc) | |
sensors_ppc | Linux sensors_ppc energy measurements | |
Network | ||
infiniband | Linux Infiniband statistics using the sysfs interface | |
net | Linux network driver statistics | |
I/O | ||
appio | Linux I/O system calls | |
io | Linux I/O statistics from /proc/self/io | |
lustre | Lustre filesystem statistics | |
stealtime | Stealtime filesystem statistics | |
Other | ||
bgpm | Hardware counters for Blue Gene/Q | |
coretemp | Linux hwmon temperature and other info | |
coretemp_freebsd | FreeBSD hwmon temperature and other info | |
emon | EMON counters for Blue Gene/Q | |
example | Simple example component | |
lmsensors | Linux LMsensor statistics | |
mx | Myricom MX (Myrinet Express) statistics | |
pcp | Performance Co-Pilot | |
sde | Software defined events | |
vmware | Support for VMware (vmguest and pseudo counters) | |
sysdetect | Support for system detection information |
The above links provide guides through example code. It is our intention that these code will be executable by simply copying it into a file, compiling it, and linking it to the PAPI library.
PAPI counts events that occur on a cpu or other subsystem. There are usually more events to be measured than counter registers to count them in, so PAPI also provides the means to map events to counters. To learn more about events, click here, or on the title above.
In addition to the events that are native to each component, PAPI defines a set of preset events that are standardized across all cpu components. To facilitate the discovery of supported events, PAPI provides query functions to inquire about the availability of specified events. Events are often referred to by name, but internally PAPI uses an opaque code to specify an event. Translation functions are provided to convert between names and codes. For convenience, event codes for a specific component can be collected into event sets. A variety of functions are available to manage event sets. Additionally, a number of options can be set, either for the behavior of the whole library, or for an individual event set.
All of these features are described in greater detail below.
Native events comprise the set of all events that are available for a specific component. For cpus, there are generally far more native events available than can be mapped onto PAPI preset events. For other components, native events are generally the only option available. Click here, or on the title above for more information on native events and examples of their use.
Preset events, also known as predefined events, are a common set of cpu events deemed relevant and useful for application performance tuning. PAPI defines a set of about 100 preset events for cpus. A given cpu will implement a subset of those, often no more than several dozen. Although the names and calling semantics of preset events are standardized across platforms, the exact definitions are determined by the underlying hardware. Caveat emptor. For more details on preset events and examples of their use, click here, or on the title above.
Several low-level functions can be called to learn more about preset or native events.
PAPI_query_event returns PAPI_OK
if the event can be counted or if the event cannot be counted an error code is returned.
PAPI_get_event_info returns a structure containing information about a specific event; and
PAPI_enum_event returns the next event in a sequence given the event code of a specific event. This function is useful for enumerating over a list of events.
For more details on these functions and examples of their use, click here, or on the title above.
A preset or native event can be referenced by name or by event code. Most PAPI functions require an event code, while most user input and output is in terms of names. Two low-level functions are provided to translate between these formats. They are discussed with usage examples here or by clicking on the title above.
Event Sets are user-defined collections of hardware events (preset or native), which are measured together to provide meaningful information. Events in an Event Set must all belong to a single component. Multiple Event Sets can be defined at the same time, but only one per component can be active. For details on managing Event Sets, including function calls and example code, click here or on the title above.
There are a number of options that can globally affect the operation of the entire PAPI library or locally affect a specific event set. These options can be reviewed and set by calling a pair of low-level functions, as described in more detail here and via the title above.
The high-level API (Application Programming Interface) provides the ability to record performance events inside instrumented regions of serial, multi-processing (MPI, SHMEM) and thread (OpenMP, Pthreads) parallel applications. It is designed for simplicity, not flexibility. For more details click here or on the title above.
PAPI provides four simplified functions to get Mflops/s (floating point operation rate), Mflips/s (floating point instruction rate), IPC (instructions per cycle), and EPC (arbitrary events per cycle). For more details click here or on the title above.
The low-level API (Application Programming Interface) manages hardware events in user-defined groups called Event Sets. It is meant for experienced application programmers and tool developers wanting fine-grained measurement and control of the PAPI interface. It provides access to both PAPI preset and native events, and supports all installed components. For more details on the Low Level API, click here or on the title above.
PAPI provides four functions to measure time in microseconds or cycles for either real (wall clock) time or virtual (process) time. These timers use the most accurate timers available on the platform in use. More information on these routines can be found here or by clicking the title above.
This section explains the PAPI functions associated with obtaining hardware and executable information. Code examples along with the corresponding output are included as well.
PAPI supports a number of advanced features beyond simple event counting. You can learn more about these advanced topics by following the title links below.
Hardware Performance Counters are generally a scarce resource. There are often many more events of interest than counters to count them on. Multiplexing is one way around this dilemma. It doesn't come without trade-offs. Click here or the title above to learn more.
PAPI can be used with parallel as well as serial programs. For a discussion of issues that come up in threaded or multiprocess environments, click here or the title above.
Most processors can generate an interrupt when a performance counter exceeds a threshold value. PAPI allows you to attach an interrupt handler to that occurrence so you can perform periodic activities where the period is determined by an event other than time. Learn more by clicking here or the title above.
By using the overflow capabilities of PAPI, it is possible to create profiles of the distribution of various performance events across a selected address space. Learn more by clicking here or on the title above.
Sometimes things don't go as planned. Most PAPI routines will tell you when that happens. It's always a good idea to check if things worked and let someone know if they didn't. To learn more about the return codes that PAPI provides, and how to turn them into meaningful messages, click here or the title above.
Many of the code snippets in this Overview and in the PAPI man pages refer to a routine called handle_error. One possible implementation of this routine is shown here.
A collection of simple utility commands is available in the src/utils directory. See individual utilities for details on usage.
Utility Name | Description |
---|---|
papi_avail | provides availability and detail information for PAPI preset events |
papi_clockres | prints clock latency and resolution |
papi_cost | provides costs of execution for PAPI start/stop, read and accum |
papi_command_line | executes PAPI preset or native events from the command line |
papi_decode | decodes PAPI preset events into a csv format suitable for PAPI_encode_events |
papi_event_chooser | given a list of named events, lists other events that can be counted with them |
papi_mem_info | provides information on the memory architecture of the current processor |
papi_native_avail | provides detailed information for PAPI native events |