# Measuring performance in HIP applications

An understanding of how well HIP applications perform is a vital part of the development process. Two main techniques, **profiling** and **tracing** collect information about how well an application is performing. **Profiling** is the statistical collection of the cumulative time that threads spend in each program component. **Tracing** is a collection of both **when** and **for how long** threads spend in each application component. Since HIP applications use either an AMD or a CUDA backend, the profiling tools from each platform are available for use.

## Event based timing

Events in HIP are used with streams to check the progress of work that has been submitted and establish dependencies between workflows. They can also be used to time the execution of work, such as kernels and memory copies. Here is how they fit into the picture of a HIP application.

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-
                align:middle" src="../images/hip_components.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Components of a HIP application. Events are associated with streams, and provide a way to time the duration of work in a stream. </figcaption>
</figure>

## Example application

The code [mat_mult_profiling.cpp](mat_mult_profiling.cpp) contains a complete example where events are used to time the execution of the host to device memory copy as well as the timing of the matrix multiplication kernel. The data type **hipEvent_t** stores event data. 

### Source code changes

In [mat_mult_profiling.cpp](mat_mult_profiling.cpp) we use the function **hipEventCreate** to create two events **t1** and **t2**, as seen in line 111.

```C++
    // mat_mult_profiling.cpp:111

    // Create events for the memory copies and kernel runs
    hipEvent_t t1=0, t2=0;
    // Create the events
    H_ERRCHK(hipEventCreate(&t1));
    H_ERRCHK(hipEventCreate(&t2));
```

Now we wish to use these events to time the upload of host matrices **A_h** and **B_h** to the compute device. The HIP function **hipEventRecord** inserts the event into the "flow" of a stream. We haven't talked in depth about HIP streams yet and at this stage we can think of a stream as a queue to which work is submitted. Since we are not using a particular stream we are using the default stream (denoted by 0). We insert event `t1` into the default stream, perform the memory copies, then insert `t2` after the copy is launched.

```C++
    // Record the start event into the default stream
    H_ERRCHK(hipEventRecord(t1,0));
    
    // Peform the memory copies
    H_ERRCHK(hipMemcpy(A_d, A_h, nbytes_A, hipMemcpyHostToDevice));
    H_ERRCHK(hipMemcpy(B_d, B_h, nbytes_B, hipMemcpyHostToDevice));
    
    // Record the stop event into the default stream
    H_ERRCHK(hipEventRecord(t2,0));
```

The function **hipEventSynchronize** waits until events reach a complete status. Then we can use the function **hipEventElapsedTime** to get the time elapsed between the two events. The helper function **h_get_event_time_ms** takes care of calling these functions, prints performance measurement information, and returns the number of milliseconds between the two events.

```C++
    // Total number of Bytes copied
    size_t total_bytes = nbytes_A + nbytes_B;

    // Get the elapsed time in milliseconds
    float elapsed_ms = h_get_event_time_ms(t1, t2, "memcpy", &total_bytes);
```

The source code of **h_get_event_time_ms** is in <a href="../include/hip_helper.hpp">hip_helper.hpp</a> and reproduced below:

```C++
// Get how much time elapsed between two events that were recorded
float h_get_event_time_ms(
        // Assumes start and stop events have been recorded
        // with the hipEventRecord() function
        hipEvent_t t1,
        hipEvent_t t2,
        const char* message, 
        size_t* nbytes) {
    
    // Make sure the stop and start events have finished
    H_ERRCHK(hipEventSynchronize(t2));
    H_ERRCHK(hipEventSynchronize(t1));

    // Elapsed time in milliseconds
    float elapsed_ms=0;

    // Convert the time into milliseconds
    H_ERRCHK(hipEventElapsedTime(&elapsed_ms, t1, t2));
        
    // Print the timing message if necessary
    if ((message != NULL) && (strlen(message)>0)) {
        std::printf("Time for event \"%s\": %.3f ms", message, elapsed_ms);
        
        // Print transfer rate if nbytes is not NULL
        if (nbytes != NULL) {
            double io_rate_MBs = h_get_io_rate_MBs(
                elapsed_ms, 
                *nbytes
            );
            std::printf(" (%.2f MB/s)", io_rate_MBs);
        }
        std::printf("\n");
    }
    
    return elapsed_ms;
}
```

We can reuse the events to time the execution of the kernel. 

```C++
    // Record the start event into the default stream
    H_ERRCHK(hipEventRecord(t1,0));

    // Launch the kernel using hipLaunchKernelGGL method
    hipLaunchKernelGGL(mat_mult, 
            grid_nblocks, 
            block_size, sharedMemBytes, 0, 
            A_d, B_d, C_d,
            N1_A,
            N0_C,
            N1_C
    );

    // Record the stop event into the default stream 
    H_ERRCHK(hipEventRecord(t2,0));

    // Get the elapsed time in milliseconds
    elapsed_ms = h_get_event_time_ms(t1, t2, "mat_mult kernel", NULL);
```

When we are finished with an event we can destroy them with the **hipEventDestroy** function. 

```C++
    // Destroy events
    H_ERRCHK(hipEventDestroy(t1));
    H_ERRCHK(hipEventDestroy(t2));
```

In this manner we instrument the uploads, downloads, and kernel execution in the source file [mat_mult_profiling.cpp](mat_mult_profiling.cpp). Now we run the instrumented code and view the timing results. Change directory to **course_material/L5_Profiling** and run the following code.

## Import the environment

The command below brings the `run` and `build` commands within reach of the Jupyter notebook.

In [6]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:../install/bin"

# At a Bash terminal you need to do this instead
# source ../env

In [7]:
!build mat_mult_profiling.exe; run mat_mult_profiling.exe

[ 66%] Built target hip_helper
[100%] Built target mat_mult_profiling.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
Time for event "memcpy": 0.914 ms (1738.94 MB/s)
Time for event "mat_mult kernel": 0.809 ms
Maximum error (infinity norm) is: 2.28882e-05


## Performance measurement with AMD tools

AMD has a number of tools available to help with collection of performance data. The low-level AMD profiler tool **ROC-profiler** (rocprof) has the ability to collect traces and information from hardware performance counters. Tools like [Omnitrace](https://github.com/AMDResearch/omnitrace) expand on the information collected by `rocprof` to include CPU resources and system metrics like GPU temperature and power usage. Tools like [Omniperf](https://github.com/AMDResearch/omniperf) use information from rocprof to help understand **how well** an application is performing in relation to peak performance, using reports such as roofline analysis and making information collected by rocprof understandable through graphical interfaces.

### HIP application traces with rocprof

Collection of HIP application traces with **rocprof** is accomplished with both the **--hip-trace** and **--hsa-trace** flags. Tracing with **rocprof** only seems to work with the AMD HIP backend at present. Here is what a typical profling command looks like.

In [8]:
!rocprof --hip-trace --hsa-trace -o rocprof_trace/result.csv mat_mult_profiling.exe

RPL: on '240507_144305' from '/opt/rocm-6.0.2' in '/nethome/tpotter/Pelagos/Projects/HIP_Course/course_material/L5_Profiling'
RPL: profiling '"mat_mult_profiling.exe"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_240507_144305_1332270'
RPL: result dir '/tmp/rpl_data_240507_144305_1332270/input_results_240507_144305'
ROCtracer (1332292):
ROCProfiler: input from "/tmp/rpl_data_240507_144305_1332270/input.xml"
  0 metrics
    HSA-trace(*)
    HSA-activity-trace()
    HIP-trace(*)
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (

Inside the **rocprof_trace** folder you will find the following files:

| file | purpose |
| --- | --- |
| result.sysinfo.txt | System information on available devices |
| result.copy_stats.csv | Statistics on all IO calls |
| result.hip_stats.csv | Statistics on non-IO HIP function calls |
| result.hsa_stats.csv | Statistics on HSA function calls |
| result.stats.csv | Statistics on all kernel calls |
| result.db | SQLITE3 database of profiling information |
| result.json | Trace information in JSON format |
| result.csv | Information on kernels such as **mat_mult** |

We can load the trace file **rocprof_trace/result.json** using a web browser. In a web browser you can go to this site for a user interface on viewing trace information for offline use.

[https://ui.perfetto.dev/](https://ui.perfetto.dev/)

Download the trace file **result.json** to your computer and open it with the Perfetto UI in your web browser.

<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/Perfetto_UI.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Viewing rocprof application traces with Perfetto UI.</figcaption>
</figure>

If you zoom (using the `wasd` keys) in you can see calls in GPU threads, COPY threads and HOST threads on the CPU. Notice how the **hipEventRecord** function is executed before and after the **hipMemcpy** calls and the **mat_mult** kernel execution. If you click on the **mat_mult** function you can see how long the kernel took to execute.

<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/Perfetto_UI_kernel.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Determining the time for a kernel call</figcaption>
</figure>


### Hardware performance counters with rocprof

Hardware performance counters are devices in a processor that measure events, such as the number of wavefronts executed, or the number of times a cache is missed. Rocprof can collect performance counters on kernels. The type of performance counter information that can be captured is obtained with this command:

In [9]:
!rocprof --list-derived

RPL: on '240507_144318' from '/opt/rocm-6.0.2' in '/nethome/tpotter/Pelagos/Projects/HIP_Course/course_material/L5_Profiling'
Derived metrics:

  gpu-agent1 : TCC_EA1_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC EA1s.
      TCC_EA1_RDREQ_32B_sum = sum(TCC_EA1_RDREQ_32B,16)

  gpu-agent1 : TCC_EA1_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC EA1s.
      TCC_EA1_RDREQ_sum = sum(TCC_EA1_RDREQ,16)

  gpu-agent1 : TCC_EA1_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC EA1s.
      TCC_EA1_WRREQ_sum = sum(TCC_EA1_WRREQ,16)

  gpu-agent1 : TCC_EA1_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC EA1s.
      TCC_EA1_WRREQ_64B_sum = sum(TCC_EA1_WRREQ_64B,16)

  gpu-agent1 : TCC_WRREQ1_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.
      TCC_WRREQ1_STALL_max 

We can specify the counters to collect in a file such as [rocprof_counters.txt](rocprof_counters.txt). Here we specify some commonly used metrics for collection. Each **pmc** line is a unique experiment involving an individual run of the code. In this example we collect stats for the **mat_mult** kernel for the first 64 work-items on GPU 0.

```txt
# Cache hits and Cache misses
pmc: TCC_HIT_sum, TCC_MISS_sum

# Total video memory fetched and written
pmc: FETCH_SIZE, WRITE_SIZE

# Percentage of time the GPU was busy, total wavefronts executed
pmc: GPUBusy, Wavefronts

# Average number of vector and scalar instructions executed per work-item
pmc: VALUInsts, SALUInsts

# Average number of vector and scalar fetch instructions per work-item
pmc: VFetchInsts, SFetchInsts

# Average number of vector write instructions per work-item
pmc: VWriteInsts

# Average number of shared and global memory read or write instructions per work item
pmc: LDSInsts, GDSInsts

# Percentage of active vector ALU threads in a wave, percentage of GPU time vector and scalar instructions are processed
pmc: VALUUtilization, VALUBusy, SALUBusy, 

# Percentage of fetch, write, atomic, and other instructions that hit the L2 cache
pmc: L2CacheHit

# Percentage of time the memory unit is active (including stalled), and just stalled, percentage of time the write unit is stalled
pmc: MemUnitBusy, MemUnitStalled, WriteUnitStalled

# Percentage of time ALU's are stalled by shared memory access, percentage of GPU time local memory is stalled by bank conflicts
pmc: ALUStalledByLDS, LDSBankConflict

# Dispatches range, which work-items to profile
range: 0 : 64
# Which GPU's to profile
gpu: 0
# Names of kernels to profile
kernel: mat_mult
```

Then we can use rocprof to collect the data for these counters.

In [11]:
!rocprof -i rocprof_counters.txt -o rocprof_counters/result.csv mat_mult_profiling.exe

RPL: on '240507_144333' from '/opt/rocm-6.0.2' in '/nethome/tpotter/Pelagos/Projects/HIP_Course/course_material/L5_Profiling'
RPL: profiling '"mat_mult_profiling.exe"'
RPL: input file 'rocprof_counters.txt'
RPL: output dir '/tmp/rpl_data_240507_144333_1334300'
RPL: result dir '/tmp/rpl_data_240507_144333_1334300/input0_results_240507_144333'
ROCProfiler: input from "/tmp/rpl_data_240507_144333_1334300/input0.xml"
  gpu_index = 0
  kernel = mat_mult
  range = 0:64
  2 metrics
    TCC_HIT_sum, TCC_MISS_sum
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:     

If your chosen performance counters are supported, then the file [rocprof_counters/result.csv](rocprof_counters/result.csv) should contain a count for every time the counter was triggered. The file [rocprof_counters/example.csv](rocprof_counters/example.csv) is an example file collected with rocprof on **mat_mult_profiling.exe**. This [page](https://docs.amd.com/bundle/ROCProfiler-User-Guide-v5.1/page/rocprof_Command_Line_Tool.html) has information on what the keys in the CSV file mean.

### Rocprof under a job manager

Rocprof runs fine under a job manager like SLURM, you just need to make an output file for each process launched. For example on SLURM the `$SLURM_JOBID` and `$SLURM_PROCID` environment variables are helpful in separating the output. Put the rocprof commands in a script called **profile.sh**.

```bash
#!/bin/bash
rocprof -i rocprof_counters.txt -o rocprof_counters/result-$SLURM_JOBID-$SLURM_PROCID.csv mat_mult_profiling_mpi.exe
```

Then you can run the script from **srun** like this so it picks up the environment variable **$SLURM_PROCID** from within the script.

```bash
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./profile.sh
```

A complete example for using rocprof with an MPI-enabled application is in **course_material/L2_Using_HIP_On_Setonix/rocprof_mpi**.

### Rocprofiler API

If you'd like to instrument code with profiling calls the **[rocprofiler API](https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/doc/rocprofiler_spec.md)** is available.

### Tracing with Omnitrace

[Omnitrace](https://github.com/AMDResearch/omnitrace) is an AMD research project to collect performance information on a program at runtime. It supports programs written in C, C++, Fortran and Python, as well as compute frameworks like OpenCL and HIP. Load the modules for Omnitrace, (you will find these commands in either the welcome letter or in Lesson 2). Now compile the software with `make`.

```bash
cd course_material/L5_Profiling
make
```

Then we can use Omnitrace to make a trace of **mat_mult_profiling.exe**.

```bash
omnitrace-instrument -- mat_mult_profiling.exe
```

Or we can have omnitrace **instrument** the application for profiling. This is useful if we want to run an application with MPI support.

```bash
omnitrace-instrument -v -1 -o mat_mult_profiling.inst.exe -- mat_mult_profiling.exe
omnitrace-run -- mat_mult_profiling.inst.exe
```

If you look in the subfolders 

* **omnitrace-mat_mult_profiling-output**
* **omnitrace-mat_mult_profiling.inst-output**, 

either in **course_material/L5_Profiling** or in the example folder **course_material/L5_Profiling/omnitrace_example** there are subfolders with dates on them. In those subfolders are `*.proto` files for use with perfetto. Download the **.proto** file to your computer and open it with [ui.perfetto.dev](https://ui.perfetto.dev) in a similar way to the json trace files from rocprof. You should see when and for how long functions are executed on the host and for how long kernels are executed on the device, along with a more detailed set of metrics such as CPU frequency and power consumption.

<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/omnitrace.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Examining the output from Omnitrace using <a href="https://ui.perfetto.dev">ui.perfetto.dev</a></figcaption>
</figure>

### Performance measurement with Omniperf

The AMD research tool Omniperf [Omniperf](https://github.com/AMDResearch/omniperf) is a powerful tool for measuring the performance of applications on AMD Instinct GPU's like the MI250X on Setonix. It can perform feats like [Roofline Analysis](https://en.wikipedia.org/wiki/Roofline_model). Load the Omniperf modules, using the module load commands from either the welcome letter or from Lesson 2. Then use Omniperf like this to make an analysis.

```bash
omniperf profile -n mat_mult -- mat_mult_profiling.exe -o mat_mult.csv
```

The resulting hardware collection information is in a directory called **workloads/mat_mult**. You can view the output in text format using the command

```bash
omniperf analyze -p workloads/mat_mult/mi200 &> analysis.txt
```

or, if you have Omniperf installed to your laptop you can see the results from your web browser. Download the **workloads** directory to your computer and run this command. 

```bash
omniperf analyze -p workloads/mat_mult/mi200 --gui
```

Then you should be able to go to the location [http://127.0.0.1:8050](http://127.0.0.1:8050) and view the profiling information collected. An example data collection is in **course_material/L5_Profiling/omniperf_example**.

#### Roofline models with Omniperf

The **[Arithmetic intensity](https://en.wikipedia.org/wiki/Roofline_model#Arithmetic_intensity)** of an algorithm is the ratio of floating point operations (FLOPS) computed per byte transferred. It helps us gauge if an algorithm is likely to be constrained by either the bandwidth or floating point performance of a compute resource. In matrix multiplication the input matrix **A** is of size ($N_{0,C}, N_{1,A}$) and **B** is of size ($N_{1,A}, N_{1,C}$). Every element of matrix **C** requires $N_{1,A}$ loads from A, $N_{1,A}$ loads from B, and 1 store to **C**. It also requires $N_{1,A}$ multiplications and $N_{1,A}$ additions. The arithmetic intensity of matrix multiplication is then

$$ a = \frac{2N1_A}{(2N1_A+1)b} $$

where **b** is the number of bytes stored per element. When $N1_A$ is large the theoretical arithmetic intensity for matrix multiplication is

$$ a \approx \frac{1}{b}. $$

If a processor has a peak floating point performance of $\textbf{F}_{P}$ FLOP/second, and a particular cache can feed that processor at a peak bandwidth of $\textbf{B}_{P}$ bytes/second, then we can calculate a floating point limit that is dependent on memory bandwidth.

$$F_{B} = a  \frac{\mbox{FLOP}}{\mbox{byte}} B_{P}\frac{\mbox{byte}}{\mbox{second}} = a B_{P} \frac{\mbox{FLOP}}{\mbox{second}}$$ 

The actual attainable floating point performance will be either $F_{B}$ or $F_{P}$, whatever is lower. If we set $F_{B} = F_{P}$ then we can solve for the crossover point in arithmetic intensity.

$$a_{0}=\frac{F_{P}}{B_{P}}$$

Therefore the limits (or roofline) on performance is as follows:  

$$
F = \left \{
\begin{array}{rl}
aB_{P} & \mbox{if} \space a<\frac{F_{P}}{B_{P}},\\
F_{P}& \mbox{otherwise}
\end{array}
\right .
$$

For example, a single compute device in a AMD Mi250x GPU processor has a peak 32-bit floating point processing rate of $F_{P} = 23.95$ TFLOPS and a peak memory bandwidth of $F_{B}=1.6$ TB/s from global memory. Problems will be constrained by memory bandwidth up to an arithmetic intensity of 

$$a_{0}=\frac{23.95}{1.6} \approx 15$$

Shown below is a roofline plot of **mat_mult_profiling.cpp**, showing the various rooflines for the L1, L2, and global memory (HBM) caches. The crossover point for 32-bit floating point arithmetic and global memory is correctly situated around **a=15**. Performance in loading memory cache appears to be close to optimal at the theoretical arithmetic intensity of a=0.25, however significant gains in performance look possible if we can improve loads from main memory.  

<figure style="margin-left:0; margin-right:auto; width:70%;">
    <img style="vertical-align:middle" src="../images/roofline_plot.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Roofline model created from mat_mult_profiling.exe</figcaption>
</figure>

If you just want **pdf** versions of the roofline models then run this command

```bash
omniperf profile -n mat_mult --roof-only -- mat_mult_profiling.exe
```

The pdf files will be available in **workloads/mat_mult/mi200**.

## Performance measurement with NVIDIA tools

HIP applications that use the CUDA backend (when CUDA is available and the environment variable `HIP_PLATFORM` is set to `nvidia`) have access to the NVIDIA performance measurement tools such as [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) and [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute). Here we briefly cover how to use these tools.

### Tracing with Nsight Systems

The command line application **nsys** can collect traces on **mat_mult_profiling.exe**. Make sure you have re-compiled **mat_mult_profiling.exe** with the `HIP_PLATFORM` environment variable set to `nvidia`.

In [12]:
!nsys profile -o nsys_trace/results mat_mult_profiling.exe

/bin/bash: line 1: nsys: command not found


Then you can use this command under Linux to view the application trace 

```bash
nsys-ui nsys_trace/results.nsys-rep
```

It's important to note that when using the NVIDIA backend it is important to note that HIP is a **thin layer** over CUDA. NVIDIA performance tools will report usage for the underlying CUDA functions instead of the HIP labelled functions. For example, when using Nsight Systems it will report a call to **cudaDeviceSynchronize** instead of **hipDeviceSynchronize**. One has to make the mental mapping between HIP and CUDA API calls.

### Hardware collection with Nsight compute

Nsight compute has the ability to collect hardware performance counters, however this ability needs either administrator access or access granted to performance counters at the OS level. If this access is possible then the following command will collect hardware performance counters on **mat_mult_profiling.exe**.

In [13]:
!ncu -f -o ncu_counters/results mat_mult_profiling.exe

/bin/bash: line 1: ncu: command not found


Then you can run the command:

```bash
ncu-ui
```

To view the hardware performance counter information.

## Summary

This chapter covers how to measure performance in HIP applications. HIP events are tools within the HIP framework to measure the execution time of kernels or memory copies. External tools such as `rocprof` can trace applications to collect information on **when** and for **how long** compute resources are used, as well as collecting low level information from hardware performance counters. Higher level tools like **Omnitrace** and **Omniperf** collect additional information and make the information obtained through rocprof more accessible through GUI-based reporting tools. Since HIP is a cross-platform environment, we conclude the chapter by walking through some performance monitoring tools that NVIDIA backends can make use of.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the <a href="https://pawsey.org.au">Pawsey Supercomputing Centre</a>.<br>
</address>