# Measuring performance in HIP applications

Having an understanding of how well HIP applications perform is a vital part of the development process. The two main tools, **profiling** and **tracing** collect information about how well an application is performing. **Profiling** is the statistical collection of the cumulative time that threads spend in each program component. **Tracing** is a collection of both **when** and **for how long** threads spend in each application component. Since HIP applications use either an AMD or a CUDA backend, the profiling tools from each platform are available for use.

## Event based profiling

Events in HIP are used to check the progress of work that has been submitted and establish dependencies between workflows. They can also be used to time the execution of work such as kernels and memory copies. The code [mat_mult_profiling.cpp](mat_mult_profiling.cpp) contains a complete example where events are used to time the execution of the host to device memory copy as well as the timing of the matrix multiplication kernel. The data type **HipEvent_t** stores event data. 

### Source code changes

In [mat_mult_profiling.cpp](mat_mult_profiling.cpp) we use the function **hipEventCreate** to create two events **t1** and **t2** as follows:

```C++
    // Create events for the memory copies and kernel runs
    hipEvent_t t1=0, t2=0;
    // Create the events
    H_ERRCHK(hipEventCreate(&t1));
    H_ERRCHK(hipEventCreate(&t2));
```

Now we wish to use these events to time the upload of host matrices **A_h** and **B_h** to the compute device. The HIP function **hipEventRecord** inserts the event into the "flow" of a stream. We haven't talked in depth about HIP streams yet and at this stage we can think of them as a queue to which work is submitted. Since we are not using a particular stream we are using the default stream (denoted by 0). We insert t1 into the default stream, perform the memory copies, and insert t2 after the copies are complete.

```C++
    // Record the start event into the default stream
    H_ERRCHK(hipEventRecord(t1,0));
    
    // Peform the memory copies
    H_ERRCHK(hipMemcpy(A_d, A_h, nbytes_A, hipMemcpyHostToDevice));
    H_ERRCHK(hipMemcpy(B_d, B_h, nbytes_B, hipMemcpyHostToDevice));
    
    // Record the stop event into the default stream
    H_ERRCHK(hipEventRecord(t2,0));
```

The function **hipEventSynchronize** waits until events are complete. Then we can use the function **hipEventElapsedTime** to get the time elapsed between the two events. The helper function **h_get_event_time_ms** takes care of calling these functions, prints performance measurement information, and returns the number of milliseconds between the two events.

```C++
    // Total number of Bytes copied
    size_t total_bytes = nbytes_A + nbytes_B;

    // Get the elapsed time in milliseconds
    float elapsed_ms = h_get_event_time_ms(t1, t2, "memcpy", &total_bytes);
```

The source code of **h_get_event_time_ms** is in <a href="../include/hip_helper.hpp">hip_helper.hpp</a> and reproduced below:

```C++
// Get how much time elapsed between two events that were recorded
float h_get_event_time_ms(
        // Assumes start and stop events have been recorded
        // with the hipEventRecord() function
        hipEvent_t t1,
        hipEvent_t t2,
        const char* message, 
        size_t* nbytes) {
    
    // Make sure the stop and start events have finished
    H_ERRCHK(hipEventSynchronize(t2));
    H_ERRCHK(hipEventSynchronize(t1));

    // Elapsed time in milliseconds
    float elapsed_ms=0;

    // Convert the time into milliseconds
    H_ERRCHK(hipEventElapsedTime(&elapsed_ms, t1, t2));
        
    // Print the timing message if necessary
    if ((message != NULL) && (strlen(message)>0)) {
        std::printf("Time for event \"%s\": %.3f ms", message, elapsed_ms);
        
        // Print transfer rate if nbytes is not NULL
        if (nbytes != NULL) {
            double io_rate_MBs = h_get_io_rate_MBs(
                elapsed_ms, 
                *nbytes
            );
            std::printf(" (%.2f MB/s)", io_rate_MBs);
        }
        std::printf("\n");
    }
    
    return elapsed_ms;
}
```

We can reuse the events to time the execution of the kernel. 

```C++
    // Record the start event into the default stream
    H_ERRCHK(hipEventRecord(t1,0));

    // Launch the kernel using hipLaunchKernelGGL method
    hipLaunchKernelGGL(mat_mult, 
            grid_nblocks, 
            block_size, sharedMemBytes, 0, 
            A_d, B_d, C_d,
            N1_A,
            N0_C,
            N1_C
    );

    // Record the stop event into the default stream 
    H_ERRCHK(hipEventRecord(t2,0));

    // Get the elapsed time in milliseconds
    elapsed_ms = h_get_event_time_ms(t1, t2, "mat_mult kernel", NULL);
```



In this manner we instrument the uploads, downloads, and kernel execution in the source file [mat_mult_profiling.cpp](mat_mult_profiling.cpp). Now we run the instrumented code and view the timing results. Change directory to **L5_Profiling** and run the following code.

In [6]:
!make mat_mult_profiling.exe; ./mat_mult_profiling.exe

make: 'mat_mult_profiling.exe' is up to date.
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    536 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
Time for event "memcpy": 0.506 ms (3141.97 MB/s)
Time for event "mat_mult kernel": 2.876 ms
Maximum error (infinity norm) is: 2.28882e-05


## Performance measurement with AMD tools

The AMD profiler **ROCPROF** has the ability to collect traces and information from hardware performance counters.

#### HIP application traces with rocprof

An application trace is information on when functions execute and for how long they took to execute. Collecting HIP application traces with **rocprof** is accomplished with the **--hip-trace** flag. Tracing with **rocprof** only seems to work with the **AMD** backend at present. 

In [7]:
!rocprof --hip-trace -o rocprof_trace/result.csv ./mat_mult_profiling.exe

RPL: on '230327_162637' from '/opt/rocm-5.4.1' in '/home/toby/Pelagos/Projects/HIP_Course/course_material/L5_Profiling'
RPL: profiling '"./mat_mult_profiling.exe"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_230327_162637_23484'
RPL: result dir '/tmp/rpl_data_230327_162637_23484/input_results_230327_162637'
ROCtracer (23510):
    HIP-trace(*)
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    536 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
Time for event "memcpy": 0.385 ms (4129.99 MB/s)
Time for event "mat_mult kernel": 1.665 ms
Maximum error (infinity norm) is: 2.28882e-05
hsa_copy_deps: 0
scan ops data 3:4                                           

Inside the **rocprof_trace** folder you will find the following files:

| file | purpose |
| --- | --- |
| profile.sysinfo.txt | System information on available devices |
| profile.copy_stats.csv | Statistics on all IO calls |
| profile.hip_stats.csv | Statistics on non-IO HIP function calls |
| profile.stats.csv | Statistics on all kernel calls |
| profile.db | SQLITE3 database of profiling information |
| profile.json | Trace information in JSON format |

We can load the trace file using a web browser. In a web browser you can go to this site for a user interface on viewing trace information.

[https://ui.perfetto.dev/](https://ui.perfetto.dev/)

Download the trace file **profile.json** to your computer and open it with the Perfetto UI in your web browser.

<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/Perfetto_UI.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Viewing rocprof application traces with Perfetto UI.</figcaption>
</figure>

If you zoom (using the "wasd" keys) in you can see calls in GPU threads, COPY threads and HOST threads on the CPU. Notice how the **hipEventRecord** function is executed before and after the **hipMemcpy** calls and the **mat_mult** kernel execution. If you click on the **mat_mult** function you can see how long the kernel took to execute.

<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/Perfetto_UI_kernel.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Determining the time for a kernel call</figcaption>
</figure>


Other files that are generated with a trace include: 

|File|Purpose|
|:--- | :--- |
|\<experiment_name\>.copy_stats.csv| Profile of all IO operations such as performed by **hipMemcpy**. |
|\<experiment_name\>.db| SQLite database of profiling results |
|\<experiment_name\>.hip_stats.csv | Profile of all HIP calls such as **hipDeviceSynchronize** |
|\<experiment_name\>.stats.csv | Profile of all kernel calls such as **mat_mult** |
|\<experiment_name\>.json | Application trace to be viewed with Chrome tracing or Perfetto |
|\<experiment_name\>.sysinfo.txt | System information obtained through rocminfo. |

### Hardware performance counters with rocprof

Hardware performance counters are devices in a processor that measure events, such as the number of wavefronts executed, or the number of times a cache is missed. Rocprof can collect performance counters on kernels. The type of performance counter information that can be captured is obtained with this command:

In [1]:
!rocprof --list-derived

RPL: on '230329_173950' from '/opt/rocm-5.4.1' in '/home/toby/Pelagos/Projects/HIP_Course/course_material/L5_Profiling'
Derived metrics:

  gpu-agent0 : GPU_UTIL : Percentage of the time that GUI is active
      GPU_UTIL = 100*GRBM_GUI_ACTIVE/GRBM_COUNT

  gpu-agent0 : CP_UTIL : Percentage of the GRBM_GUI_ACTIVE time that any of the Command Processor (CPG/CPC/CPF) blocks are busy
      CP_UTIL = 100*GRBM_CP_BUSY/GRBM_GUI_ACTIVE

  gpu-agent0 : SPI_UTIL : Percentage of the GRBM_GUI_ACTIVE time that any of the Shader Pipe Interpolators (SPI) are busy in the shader engine(s)
      SPI_UTIL = 100*GRBM_SPI_BUSY/GRBM_GUI_ACTIVE

  gpu-agent0 : TA_UTIL : Percentage of the GRBM_GUI_ACTIVE time that any of the Texture Pipes (TA) are busy in the shader engine(s).
      TA_UTIL = 100*GRBM_TA_BUSY/GRBM_GUI_ACTIVE

  gpu-agent0 : GDS_UTIL : Percentage of the GRBM_GUI_ACTIVE time that the Global Data Share (GDS) is busy.
      GDS_UTIL = 100*GRBM_GDS_BUSY/GRBM_GUI_ACTIVE

  gpu-agent0 : EA_UTIL : Pe

We can specify the counters to collect in a file such as [rocprof_counters.txt](rocprof_counters.txt). Here we specify some commonly used metrics for collection. Each **pmc** line is a unique experiment involving an individual run of the code. In this example we collect stats for the **mat_mult** kernel for the first 64 work-items on GPU 0.

```txt
# Cache hits and Cache misses
pmc: TCC_HIT_sum, TCC_MISS_sum

# Total video memory fetched and written
pmc: FETCH_SIZE, WRITE_SIZE

# Percentage of time the GPU was busy, total wavefronts executed
pmc: GPUBusy, Wavefronts

# Average number of vector and scalar instructions executed per work-item
pmc: VALUInsts, SALUInsts

# Average number of vector and scalar fetch instructions per work-item
pmc: VFetchInsts, SFetchInsts

# Average number of vector write instructions per work-item
pmc: VWriteInsts

# Average number of shared and global memory read or write instructions per work item
pmc: LDSInsts, GDSInsts

# Percentage of active vector ALU threads in a wave, percentage of GPU time vector and scalar instructions are processed
pmc: VALUUtilization, VALUBusy, SALUBusy, 

# Percentage of fetch, write, atomic, and other instructions that hit the L2 cache
pmc: L2CacheHit

# Percentage of time the memory unit is active (including stalled), and just stalled, percentage of time the write unit is stalled
pmc: MemUnitBusy, MemUnitStalled, WriteUnitStalled

# Percentage of time ALU's are stalled by shared memory access, percentage of GPU time local memory is stalled by bank conflicts
pmc: ALUStalledByLDS, LDSBankConflict

# Dispatches range, which work-items to profile
range: 0 : 64
# Which GPU's to profile
gpu: 0
# Names of kernels to profile
kernel: mat_mult
```

Then we can use rocprof to collect the data for these counters.

In [4]:
!rocprof -i rocprof_counters.txt -o rocprof_counters/result.csv ./mat_mult_profiling.exe

RPL: on '230329_174856' from '/opt/rocm-5.4.1' in '/home/toby/Pelagos/Projects/HIP_Course/course_material/L5_Profiling'
RPL: profiling '"./mat_mult_profiling.exe"'
RPL: input file 'rocprof_counters.txt'
RPL: output dir '/tmp/rpl_data_230329_174856_84187'
RPL: result dir '/tmp/rpl_data_230329_174856_84187/input0_results_230329_174856'
ROCProfiler: input from "/tmp/rpl_data_230329_174856_84187/input0.xml"
  gpu_index = 0
  kernel = mat_mult
  range = 0:64
  2 metrics
    TCC_HIT_sum, TCC_MISS_sum
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    536 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
Time for event "memcpy": 1.370 ms (1160.10 MB/s)
Time for event "m

If your chosen performance counters are supported, then the file [rocprof_counters/result.csv](rocprof_counters/result.csv) should contain a count for every time the counter was triggered. The file [rocprof_counters/example.csv](rocprof_counters/example.csv) is an example file collected with rocprof on **mat_mult_profiling.exe**. This [page](https://docs.amd.com/bundle/ROCProfiler-User-Guide-v5.1/page/rocprof_Command_Line_Tool.html) has information on what the keys in the CSV file mean.

### Rocprof under a job manager

Rocprof runs fine under a job manager like SLURM, you just need to make an output file for each process launched. For example on SLURM the **SLURM_JOBID** and **SLURM_PROCID** environment variables are helpful in separating the output. Put the rocprof commands in a script called **profile.sh**.

```bash
#!/bin/bash
rocprof -i rocprof_counters.txt -o rocprof_counters/result-$SLURM_JOBID-$SLURM_PROCID.csv ./mat_mult_profiling.exe
```

Then you can run the script from **srun** like this so it picks up the environment variable **$SLURM_PROCID** from within the script.

```bash
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./profile.sh
```

### Rocprofiler API

If you'd like to instrument code with profiling calls the **[rocprofiler API](https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/doc/rocprofiler_spec.md)** is available.

### Tracing with Omnitrace

[Omnitrace](https://github.com/AMDResearch/omnitrace) is an AMD research project to collect performance information on a program at runtime. It supports programs written in C, C++, Fortran and Python, as well as compute frameworks like OpenCL and HIP. We load the Omnitrace module using something like: 

```bash
module load omnitrace/version
```

Then we can use Omnitrace to make a trace of **mat_mult_profiling.exe**.

```bash
cd omni_trace
omnitrace -- ../mat_mult_profiling.exe
```

If you look in the sub folder **omni_trace/omnitrace-mat_mult_profiling-output** there is a folder with the date of the trace. Download the **.proto** file and open it with [ui.perfetto.dev](https://ui.perfetto.dev) in a similar way to the json trace files from rocprof. You should see when and for how long functions are executed on the host and for how long kernels are executed on the device, along with a more detailed set of metrics such as CPU frequency and power consumption.

<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/omnitrace.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Examining the output from Omnitrace using <a href="https://ui.perfetto.dev">ui.perfetto.dev</a></figcaption>
</figure>

### Performance measurement with Omniperf

The AMD research tool Omniperf [Omniperf](https://github.com/AMDResearch/omniperf) is a powerful tool for measuring the performance of applications on AMD Instinct GPU's like the MI250X on Setonix. It can perform feats like [Roofline Analysis](https://en.wikipedia.org/wiki/Roofline_model).

## Performance measurement with NVIDIA tools

HIP applications that use the CUDA backend (i.e compiled with HIP_PLATFORM=nvidia) have access to the NVIDIA performance measurement tools such as [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) and [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute). Here we briefly cover how to use these tools.

### Tracing with Nsight Systems

The command line application **nsys** can collect traces on **mat_mult_profiling.exe**.

In [13]:
!nsys profile -o nsys_trace/results ./mat_mult_profiling.exe

Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Device id: 0
	name:                                    NVIDIA GeForce RTX 3060 Laptop GPU
	global memory size:                      6226 MB
	available registers per block:           65536 
	maximum shared memory size per block:    49 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,64)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65535,65535)
Time for event "memcpy": 0.152 ms (10464.39 MB/s)
Time for event "mat_mult kernel": 0.452 ms
Maximum error (infinity norm) is: 2.28882e-05
Generating '/tmp/nsys-report-ce04.qdstrm'
Failed to create '/home/toby/Pelagos/Projects/HIP_Course/course_material/L5_Profiling/nsys_trace/results.nsys-rep': File exists.
Use `--force-overwrite true` to overwrite existing files.
Generated:
    /tmp/nsys-report-077a.nsys-rep


Then you can use this command under Linux to view the application trace 

```bash
nsys-ui nsys_trace/results.nsys-rep
```

It's important to note that when using the NVIDIA backend it is important to note that HIP is a thin layer over CUDA. NVIDIA performance tools will report usage for the underlying CUDA functions instead of the HIP labelled functions. For example, when using Nsight Systems it will report a call to **cudaDeviceSynchronize** instead of **hipDeviceSynchronize**. One has to make the mental mapping between HIP and CUDA.

### Hardware collection with Nsight compute

Nsight compute has the ability to collect hardware performance counters, however this ability needs either administrator access or access granted to performance counters at the OS level. If this access is possible then the following command will collect hardware performance counters on **mat_mult_profiling.exe**.

In [7]:
!ncu -f -o ncu_counters/results ./mat_mult_profiling.exe

==PROF== Connected to process 9050 (/home/toby/Pelagos/Projects/HIP_Course/course_material/L5_Profiling/mat_mult_profiling.exe)
Device id: 0
	name:                                    NVIDIA GeForce RTX 3060 Laptop GPU
	global memory size:                      6226 MB
	available registers per block:           65536 
	maximum shared memory size per block:    49 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,64)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65535,65535)
Time for event "memcpy": 0.140 ms (11341.40 MB/s)
==PROF== Profiling "mat_mult" - 0: 0%....50%....100% - 9 passes
Time for event "mat_mult kernel": 200.655 ms
Maximum error (infinity norm) is: 2.28882e-05
==PROF== Disconnected from process 9050
==PROF== Report: /home/toby/Pelagos/Projects/HIP_Course/course_material/L5_Profiling/ncu_counters/results.ncu-rep


Then you can run the command:

```bash
ncu-ui
```

To view the hardware performance counter information.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>