# Measuring peformance in HIP applications

Having an understanding of how well HIP applications perform is a vital part of the development process. The two main tools, **profiling** and **tracing** collect information about how well an application is performing. **Profiling** is the statistical collection of the cumulative time that threads spend in each program component. **Tracing** is a collection of both **when** and **for how long** threads spend in each application component. Since HIP applications use either an AMD or a CUDA backend, the profiling tools from each platform are available for use.

## Event based profiling

Events in HIP are used to check the progress of work that has been submitted and establish dependencies between workflows. They can also be used to time the execution of work such as kernels and memory copies. The code [mat_mult_profiling.cpp](mat_mult_profiling.cpp) contains a complete example where events are used to time the execution of the host to device memory copy as well as the timing of the matrix multiplication kernel. The data type **HipEvent_t** stores event data. 

### Source code changes

In [mat_mult_profiling.cpp](mat_mult_profiling.cpp) we use the function **hipEventCreate** to create two events **t1** and **t2** as follows:

```C++
    // Create events for the memory copies and kernel runs
    hipEvent_t t1=0, t2=0;
    // Create the events
    H_ERRCHK(hipEventCreate(&t1));
    H_ERRCHK(hipEventCreate(&t2));
```

Now we wish to use these events to time the upload of host matrices **A_h** and **B_h** to the compute device. The HIP function **hipEventRecord** inserts the event into the "flow" of a stream. We haven't talked in depth about HIP streams yet and at this stage we can think of them as a queue to which work is submitted. Since we are not using a particular stream we are using the default stream (denoted by 0). We insert t1 into the default stream, perform the memory copies, and insert t2 after the copies are complete.

```C++
    // Start the recorder
    H_ERRCHK(hipEventRecord(t1,0));
    
    // Peform the memory copies
    H_ERRCHK(hipMemcpy(A_d, A_h, nbytes_A, hipMemcpyHostToDevice));
    H_ERRCHK(hipMemcpy(B_d, B_h, nbytes_B, hipMemcpyHostToDevice));
    
    // Stop the recorder
    H_ERRCHK(hipEventRecord(t2,0));
```

The function **hipEventSyynchronize** waits until events are complete. Then we can use the function **hipEventElapsedTime** to get the time elapsed between the two events. The helper function **h_get_event_time_ms** takes care of calling these functions, prints performance measurement information, and returns the number of milliseconds between the two events.

```C++
    // Total number of Bytes copied
    size_t total_bytes = nbytes_A + nbytes_B;

    // Get the elapsed time in milliseconds
    float elapsed_ms = h_get_event_time_ms(t1, t2, "memcpy", &total_bytes);
```

The source code of **h_get_event_time_ms** is in <a href="../include/hip_helper.hpp">hip_helper.hpp</a> and reproduced below:

```C++
// Get how much time elapsed between two events that were recorded
float h_get_event_time_ms(
        // Assumes start and stop events have been recorded
        // with the hipEventRecord() function
        hipEvent_t t1,
        hipEvent_t t2,
        const char* message, 
        size_t* nbytes) {
    
    // Make sure the stop and start events have finished
    H_ERRCHK(hipEventSynchronize(t2));
    H_ERRCHK(hipEventSynchronize(t1));

    // Elapsed time in milliseconds
    float elapsed_ms=0;

    // Convert the time into milliseconds
    H_ERRCHK(hipEventElapsedTime(&elapsed_ms, t1, t2));
        
    // Print the timing message if necessary
    if ((message != NULL) && (strlen(message)>0)) {
        std::printf("Time for event \"%s\": %.3f ms", message, elapsed_ms);
        
        // Print transfer rate if nbytes is not NULL
        if (nbytes != NULL) {
            double io_rate_MBs = h_get_io_rate_MBs(
                elapsed_ms, 
                *nbytes
            );
            std::printf(" (%.2f MB/s)", io_rate_MBs);
        }
        std::printf("\n");
    }
    
    return elapsed_ms;
}
```

We can reuse the events to time the execution of the kernel. 

```C++
    // Record the kernel timer
    H_ERRCHK(hipEventRecord(t1,0));

    // Launch the kernel using hipLaunchKernelGGL method
    hipLaunchKernelGGL(mat_mult, 
            grid_nblocks, 
            block_size, sharedMemBytes, 0, 
            A_d, B_d, C_d,
            N1_A,
            N0_C,
            N1_C
    );

    // Stop the kernel timer
    H_ERRCHK(hipEventRecord(t2,0));

    // Get the elapsed time in milliseconds
    elapsed_ms = h_get_event_time_ms(t1, t2, "mat_mult kernel", NULL);
```



In this manner we instrument the uploads, downloads, and kernel execution in the source file [mat_mult_profiling.cpp](mat_mult_profiling.cpp). Now we run the instrumented code and view the timing results.

In [2]:
!make mat_mult_profiling.exe; ./mat_mult_profiling.exe

make: 'mat_mult_profiling.exe' is up to date.
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    536 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
Time for event "memcpy": 0.476 ms (3341.60 MB/s)
Time for event "mat_mult kernel": 5.442 ms
Maximum error (infinity norm) is: 2.28882e-05


## Performance measurement with AMD tools

#### HSA application traces

```bash
rocprof --hsa-trace -o amd_profiles/result.csv ./mat_mult_profiling.exe
```

Then in Google Chrome go to the address

chrome://tracing

to load **result.json**

#### Performance counters

Rocprof can collect performance counters on kernels. We specify a list of counters and the kernels they should apply to in the file **[rocprof_counters.txt](rocprof_counters.txt)**

```bash
rocprof -i rocprof_counters.txt --timestamp on --stats -o amd_profiles/result.csv ./mat_mult_profiling.exe
```

## Performance measurement with NVIDIA tools

### Profiling with nvprof

Historically there was limited functionality for profiling OpenCL events with NVIDIA's [NVVP](http://uob-hpc.github.io/2015/05/27/nvvp-import-opencl.html), however profiling support for OpenCL has largely disappeared, with the implementation of Nsight Compute and Nsight systems.




## Performance measurement with Open-source tools

### Tau

[Tau](https://www.cs.uoregon.edu/research/tau/home.php) is a commonly used open-source profiling and tracing toolkit for HPC applications. For OpenCL applications it provides both profiling and tracing functionality.

#### Profiling

The Tau application **tau_exec** can be used to collect profiling information. Profiling information can then be visualised with the Tau applications **paraprof** (GUI), or **pprof** (command-line).

We set the environment variables **PROFILEDIR=./tau** to tell Tau where to put files.

In [5]:
%env PROFILEDIR=./tau

env: PROFILEDIR=./tau


Then we use the following call to **tau_exec** to collect profiling information for opencl calls.

In [10]:
!tau_exec -T serial -opencl ./mat_mult_profiling.exe

	               name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz 
	 global memory size: 16468 MB
	    max buffer size: 8234 MB
	     max local size: (8192,8192,8192)
	     max work-items: 8192
device id: -1629247368.
command id: 94891404868872.
vendor id: 0.
Got a bogus start! 2 .TAU application
Time for event "Uploading Buffer A": 0.216 ms (2461.77 MB/s)
Time for event "Uploading Buffer B": 0.323 ms (3268.10 MB/s)
Time for event "Running kernel": 53.618 ms
Time for event "Downloading Buffer C": 0.456 ms (4706.89 MB/s)


Now have a look at the contents of the tau directory

In [11]:
!ls ./tau

events.0.edf   profile.0.0.2  tau.trc		  tautrace.0.0.2.trc
profile.0.0.0  profile.txt    tautrace.0.0.0.trc  trace.json
profile.0.0.1  tau.edf	      tautrace.0.0.1.trc


Use the Tau application **pprof** to get a text mode profile of the app.

In [12]:
!pprof > ./tau/profile.txt

We see from the profile that the call to **mat_mult** took approximately **12ms**. This is similar to what was measured from the profiling interface.

#### Tracing with Google Chrome tracing

For tracing we set the environment variables **TRACEDIR=./tau** **TAU_TRACE=1**.

In [13]:
%env TAU_TRACE=1
%env TRACEDIR=./tau

env: TAU_TRACE=1
env: TRACEDIR=./tau


Capture OpenCL information as before

In [14]:
!tau_exec -T serial -opencl ./mat_mult_profiling.exe

	               name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz 
	 global memory size: 16468 MB
	    max buffer size: 8234 MB
	     max local size: (8192,8192,8192)
	     max work-items: 8192
device id: 2022348632.
command id: 94684603067064.
vendor id: 0.
Got a bogus start! 2 .TAU application
Time for event "Uploading Buffer A": 0.180 ms (2955.50 MB/s)
Time for event "Uploading Buffer B": 0.302 ms (3497.29 MB/s)
Time for event "Running kernel": 52.092 ms
Time for event "Downloading Buffer C": 0.499 ms (4305.15 MB/s)


Now merge the trace into a downloadable JSON document

In [15]:
!cd tau; echo 'y' | tau_treemerge.pl

/opt/tau/2.31.1/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.0.edf events.0.edf tautrace.0.0.0.trc tautrace.0.0.1.trc tautrace.0.0.2.trc tau.trc
tau.trc exists; override [y]? tautrace.0.0.0.trc: 418 records read.
tautrace.0.0.1.trc: 6 records read.
tautrace.0.0.2.trc: 41 records read.


In [16]:
!cd tau; tau_trace2json ./tau.trc ./tau.edf -chrome -ignoreatomic -o trace.json

<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/Chrome_trace.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Google Chrome tracing render.</figcaption>
</figure>

In this instance we see that the kernel **mat_mult** has taken approximately 22ms to complete.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>