Skip to content

PAPI Overview

djwoun edited this page Sep 13, 2024 · 67 revisions

This page provides a general overview of the PAPI library with a discussion of all major features and functionality.

  1. Intended Audience
  2. C and Fortran Calling Interfaces
  3. Components
  4. Example Code
  5. Events
  1. Getting and Setting Options
  2. PAPI Counter Interfaces
  1. PAPI Timers
  2. PAPI System Information
  3. Advanced PAPI Features
  1. PAPI Error Handling
  2. PAPI Utilities

Intended Audience

Welcome to PAPI, the Performance API. This overview will provide you with a discussion of how to use the different components and functions of PAPI. The intended audience includes application developers, performance tool writers, and curious students of performance who wish to access performance data to tune and model application performance. You should have some level of familiarity with C and Fortran, and have a basic knowledge of computer architecture and programming.


C and Fortran Calling Interfaces

PAPI is written in C. The function calls in the C interface are defined in the header file, papi.h and consist of the following form:

<returned data type> PAPI_function_name(arg1, arg2, …)

The function calls in the Fortran interface are defined in the source file, papi_fwrappers.c and consist of the following form:

PAPIF_function_name(arg1, arg2, …, check) 

As you can see, the C function calls have equivalent Fortran function calls (PAPI_<call> becomes PAPIF_<call>). This is generally true for most function calls, except for the functions that return C pointers to structures, such as PAPI_get_opt and PAPI_get_executable_info, which are either not implemented in the Fortran interface, or implemented with different calling semantics. In the function calls of the Fortran interface, the return code of the corresponding C routine is returned in the argument, check.

For most architectures, the following relation holds between the pseudo-types listed and Fortran variable types:

Pseudo-type Fortran type Description
C_INT INTEGER Default Integer type
C_FLOAT REAL Default Real type
C_LONG_LONG INTEGER*8 Extended size integer
C_STRING CHARACTER*(PAPI_MAX_STR_LEN) Fortran string
C_INT FUNCTION EXTERNAL INTEGER FUNCTION Fortran function returning integer result

Array arguments must be of sufficient size to hold the input/output from/to the subroutine for predictable behavior. The array length is indicated either by the accompanying argument or by internal PAPI definitions.

Subroutines accepting C_STRING as an argument are on most implementations capable of reading the character string length as provided by Fortran. In these implementations, the string is truncated or space padded as necessary. For other implementations, the length of the character array is assumed to be of sufficient size. No character string longer than PAPI_MAX_STR_LEN is returned by the PAPIF interface.


Components

PAPI provides several components that allow you to monitor system information of CPUs, network cards, graphics accelerator cards, parallel file systems and more. While the CPU components, perf_event and perf_event_uncore, and the sysdetect component are enabled by default, all other components have to be specified during the installation process.

Components Types

In PAPI there are two types of components:

  • Standard: fully initialized after a call to PAPI_library_init (e.g. perf_event and perf_event_uncore);
  • Delay Init: fully initialized after a call to any of the PAPI functions that access the component, like PAPI_enum_cmp_event or PAPI_add_event (e.g. cuda and rocm).

After calling PAPI_library_init, Delay Init components are in an intermediate initialization state. This is conveyed to the user by setting the disabled flag of the component info structure to PAPI_EDELAY_INIT. If the user does not need to access the component info structure, the user does not need to be concerned with delayed initialization. Delay Init components initialization is completed by, e.g., a call to PAPI_enum_cmp_event. If the call returns PAPI_OK the component disabled flag is updated from PAPI_EDELAY_INIT to PAPI_OK.

The reason PAPI has Delay Init components is to minimize overhead. Some components, like GPU components, may have hundreds of thousands of events that require several minutes to be accessed. If a component with hundreds of thousands of events is configured in PAPI but the user does not need it, it would be unreasonable for the user to wait several minutes for the component to be initialized.

Installation of Components

To install PAPI with additional components, you have to specify them during configure.

For example, to install PAPI with the CUDA component enabled:

./configure --with-components="cuda"

For CUDA, PAPI requires one environment variable: PAPI_CUDA_ROOT. This is required for both compiling and runtime. Typically in Linux one would export this (example is shown below) variable but some systems have software to manage environment variables (such as module or spack), so consult with your sysadmin if you have such management software. Eg:

export PAPI_CUDA_ROOT=/path/to/installed/cuda

If you want to install multiple components, you must specify them as a space separated list.

Example:

./configure --with-components="appio coretemp cuda nvml"

List of Components

The following table list all available components.

Before installing a component, please read further instructions by clicking on the desired component name.

Note: The name of the components in the table is shown as it must be used in configure.

Component Name Description
CPU
perf_event Linux perf_event CPU counters (default)
perf_event_uncore Linux perf_event CPU uncore and northbridge (default)
perfctr Linux perfctr CPU counters (only used for Linux before 2.6.31.)
perfctr_ppc Linux perfctr CPU counters for IBM PowerPC (9) architecture (only used for Linux before 2.6.31.)
perfmon_ia64 Linux perfmon2 CPU counters for Itanium architecture (only used for Linux before 2.6.31.)
perfmon2 Linux perfmon2 CPU counters (only used for Linux before 2.6.31.)
GPU
cuda CUDA events and metrics via NVIDIA CuPTI interfaces
nvml NVIDIA hardware counters (usage, power, temperature, fan speed, etc)
rocm GPU events and metrics via AMD ROCm-PL API
rocm_smi AMD GPU hardware counters (usage, power, temperature, fan speed, etc)
intel_gpu Intel GPU hardware performance metrics through Intel oneAPI Level Zero Interface
Power
host_micpower Host-side power usage on MIC guest cards
libmsr Measuring and capping power usage on recent Intel architectures using the RAPL interface
micpower Power usage on Intel Xeon Phi (MIC)
nvml NVIDIA hardware counters (usage, power, temperature, fan speed, etc)
powercap Linux powercap energy measurements
powercap_ppc Linux powercap energy measurements for IBM PowerPC (9) architecture
rapl Linux RAPL energy measurements
rocm_smi AMD GPU hardware counters (usage, power, temperature, fan speed, etc)
sensors_ppc Linux sensors_ppc energy measurements
Network
infiniband Linux Infiniband statistics using the sysfs interface
net Linux network driver statistics
I/O
appio Linux I/O system calls
io Linux I/O statistics from /proc/self/io
lustre Lustre filesystem statistics
stealtime Stealtime filesystem statistics
Other
bgpm Hardware counters for Blue Gene/Q
coretemp Linux hwmon temperature and other info
coretemp_freebsd FreeBSD hwmon temperature and other info
emon EMON counters for Blue Gene/Q
example Simple example component
lmsensors Linux LMsensor statistics
mx Myricom MX (Myrinet Express) statistics
pcp Performance Co-Pilot
sde Software defined events
vmware Support for VMware (vmguest and pseudo counters)
sysdetect Support for system detection information

Example Code

The above links provide guides through example code. It is our intention that these code will be executable by simply copying it into a file, compiling it, and linking it to the PAPI library.


PAPI counts events that occur on a cpu or other subsystem. There are usually more events to be measured than counter registers to count them in, so PAPI also provides the means to map events to counters. To learn more about events, click here, or on the title above.

In addition to the events that are native to each component, PAPI defines a set of preset events that are standardized across all cpu components. To facilitate the discovery of supported events, PAPI provides query functions to inquire about the availability of specified events. Events are often referred to by name, but internally PAPI uses an opaque code to specify an event. Translation functions are provided to convert between names and codes. For convenience, event codes for a specific component can be collected into event sets. A variety of functions are available to manage event sets. Additionally, a number of options can be set, either for the behavior of the whole library, or for an individual event set.

All of these features are described in greater detail below.

Native events comprise the set of all events that are available for a specific component. For cpus, there are generally far more native events available than can be mapped onto PAPI preset events. For other components, native events are generally the only option available. Click here, or on the title above for more information on native events and examples of their use.

Preset events, also known as predefined events, are a common set of cpu events deemed relevant and useful for application performance tuning. PAPI defines a set of about 100 preset events for cpus. A given cpu will implement a subset of those, often no more than several dozen. Although the names and calling semantics of preset events are standardized across platforms, the exact definitions are determined by the underlying hardware. Caveat emptor. For more details on preset events and examples of their use, click here, or on the title above.

Several low-level functions can be called to learn more about preset or native events.

PAPI_query_event returns PAPI_OK if the event can be counted or if the event cannot be counted an error code is returned.

PAPI_get_event_info returns a structure containing information about a specific event; and

PAPI_enum_event returns the next event in a sequence given the event code of a specific event. This function is useful for enumerating over a list of events.

For more details on these functions and examples of their use, click here, or on the title above.

A preset or native event can be referenced by name or by event code. Most PAPI functions require an event code, while most user input and output is in terms of names. Two low-level functions are provided to translate between these formats. They are discussed with usage examples here or by clicking on the title above.

Event Sets are user-defined collections of hardware events (preset or native), which are measured together to provide meaningful information. Events in an Event Set must all belong to a single component. Multiple Event Sets can be defined at the same time, but only one per component can be active. For details on managing Event Sets, including function calls and example code, click here or on the title above.

There are a number of options that can globally affect the operation of the entire PAPI library or locally affect a specific event set. These options can be reviewed and set by calling a pair of low-level functions, as described in more detail here and via the title above.


PAPI Counter Interfaces

The high-level API (Application Programming Interface) provides the ability to record performance events inside instrumented regions of serial, multi-processing (MPI, SHMEM) and thread (OpenMP, Pthreads) parallel applications. It is designed for simplicity, not flexibility. For more details click here or on the title above.

PAPI provides four simplified functions to get Mflops/s (floating point operation rate), Mflips/s (floating point instruction rate), IPC (instructions per cycle), and EPC (arbitrary events per cycle). For more details click here or on the title above.

The low-level API (Application Programming Interface) manages hardware events in user-defined groups called Event Sets. It is meant for experienced application programmers and tool developers wanting fine-grained measurement and control of the PAPI interface. It provides access to both PAPI preset and native events, and supports all installed components. For more details on the Low Level API, click here or on the title above.


PAPI provides four functions to measure time in microseconds or cycles for either real (wall clock) time or virtual (process) time. These timers use the most accurate timers available on the platform in use. More information on these routines can be found here or by clicking the title above.


This section explains the PAPI functions associated with obtaining hardware and executable information. Code examples along with the corresponding output are included as well.


Advanced PAPI Features

PAPI supports a number of advanced features beyond simple event counting. You can learn more about these advanced topics by following the title links below.

Hardware Performance Counters are generally a scarce resource. There are often many more events of interest than counters to count them on. Multiplexing is one way around this dilemma. It doesn't come without trade-offs. Click here or the title above to learn more.

PAPI can be used with parallel as well as serial programs. For a discussion of issues that come up in threaded or multiprocess environments, click here or the title above.

Most processors can generate an interrupt when a performance counter exceeds a threshold value. PAPI allows you to attach an interrupt handler to that occurrence so you can perform periodic activities where the period is determined by an event other than time. Learn more by clicking here or the title above.

By using the overflow capabilities of PAPI, it is possible to create profiles of the distribution of various performance events across a selected address space. Learn more by clicking here or on the title above.


Sometimes things don't go as planned. Most PAPI routines will tell you when that happens. It's always a good idea to check if things worked and let someone know if they didn't. To learn more about the return codes that PAPI provides, and how to turn them into meaningful messages, click here or the title above.

Many of the code snippets in this Overview and in the PAPI man pages refer to a routine called handle_error. One possible implementation of this routine is shown here.


PAPI Utilities

A collection of simple utility commands is available in the src/utils directory. See individual utilities for details on usage.

Utility Name Description
papi_avail provides availability and detail information for PAPI preset events
papi_clockres prints clock latency and resolution
papi_cost provides costs of execution for PAPI start/stop, read and accum
papi_command_line executes PAPI preset or native events from the command line
papi_decode decodes PAPI preset events into a csv format suitable for PAPI_encode_events
papi_event_chooser given a list of named events, lists other events that can be counted with them
papi_mem_info provides information on the memory architecture of the current processor
papi_native_avail provides detailed information for PAPI native events