# Measuring peformance in OpenCL applications

Having an understanding of how well OpenCL applications perform is a vital part of the development process. The two main tools, **profiling** and **tracing** collect information about how well an application is performing. **Profiling** is the statistical collection of the cumulative time that threads spend in each program component. **Tracing** is the collection of both **when** and **for how long** threads spend in each application component. While it is true that many vendors have largely abandoned their OpenCL performance measurement tools, the OpenCL standard itself provides a profiling interface and there are still a few open-source and commercial tools available.

## Event based profiling

Command queues have the ability to capture timing information for commands they process. In order to time commands submitted to an OpenCL command queue we enable a profiling flag **CL_QUEUE_PROFILING_ENABLE** during command queue creation. The time elap elapsed may be extracted directly from profiling events. In the code [mat_mult_profiling.cpp](mat_mult_profiling.cpp) we set the profiling flag to CL_TRUE.

```C++
    // mat_mult_profiling.cpp source

    // Do we enable profiling?
    cl_bool profiling = CL_TRUE;
```

Then from within **h_create_command_queues** in <a href="../include/cl_helper.hpp">cl_helper.hpp</a>, the profiling flag CL_QUEUE_PROFILING_ENABLE is incorporated into the command queue properties and passed to [clCreateCommandQueue](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clCreateCommandQueue.html).

```C++
    // cl_helper.hpp source

    // Manage bit fields for the command queue properties
    if (profiling_enable == CL_TRUE) {
        queue_properties = queue_properties | CL_QUEUE_PROFILING_ENABLE;    
    }

    // Allocate memory for the command queues
    cl_command_queue *command_queues = (cl_command_queue*)calloc(num_command_queues, sizeof(cl_command_queue));

    // Fill command queues in a Round-Robin fashion
    for (cl_uint n=0; n<num_command_queues; n++) {
        command_queues[n] = clCreateCommandQueue(
            contexts[n % num_devices],
            devices[n % num_devices],
            queue_properties,
            &errcode    
        );
        h_errchk(errcode, "Creating a command queue");        
    }
```

The function [clGetEventProfilingInfo](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clGetEventProfilingInfo.html) extracts information such as start and end walltimes (in nanoseconds) for an OpenCL event associated with a queued command. We use the helper function **h_get_event_time_ms** in <a href="../include/cl_helper.hpp">cl_helper.hpp</a> to extract the elapsed time.

```C++

// cl_helper.hpp source

cl_double h_get_event_time_ms(
        cl_event *event, 
        const char* message, 
        size_t* nbytes) {
    
    // Make sure the event has finished
    h_errchk(clWaitForEvents(1, event), message);
    
    // Start and end times
    cl_ulong t1, t2;
        
    // Fetch the start and end times in nanoseconds
    h_errchk(
        clGetEventProfilingInfo(
            *event,
            CL_PROFILING_COMMAND_START,
            sizeof(cl_ulong),
            &t1,
            NULL
        ),
        "Fetching start time for event"
    );

    h_errchk(
        clGetEventProfilingInfo(
            *event,
            CL_PROFILING_COMMAND_END,
            sizeof(cl_ulong),
            &t2,
            NULL
        ),
        "Fetching end time for event"
    );
    
    // Convert the time into milliseconds
    cl_double elapsed = (cl_double)(t2-t1)*(cl_double)1.0e-6;
        
    // Print the timing message if necessary
    if (strlen(message)>0) {
        std::printf("Time for event \"%s\": %.3f ms", message, elapsed);
        
        // Print transfer rate if nbytes is specified
        if (nbytes != NULL) {
            cl_double io_rate_MBs = h_get_io_rate_MBs(
                elapsed, 
                *nbytes
            );
            std::printf(" (%.2f MB/s)", io_rate_MBs);
        }
        std::printf("\n");
    }
    
    return elapsed;
}
```

Every command submitted to a command queue may have an event associated with it. 

### Instrumenting the buffer copy

We construct a **cl_event** object and use that event to collect timing information. For example, during writes to a device buffer we pass in a **cl_event** object to collect timing information.

```C++
    // mat_mult.cpp source
   cl_event io_event;

    H_ERRCHK(
        clEnqueueWriteBuffer(
            command_queue,
            A_d,
            blocking,
            0,
            nbytes_A,
            A_h,
            0,
            NULL,
            &io_event
        ) 
    );

```

Then, we use **h_get_event_time_ms** to extract the elapsed time and print out the transfer rate.

```C++
    // Time how long it takes to complete event
    cl_double upload_A_ms = h_get_event_time_ms(
        &io_event, 
        "Uploading Buffer A",
        &nbytes_A
    );
```

The buffer copies from **B_h** to **B_d**, and then from **C_d** to **C_h** are also instrumented in a similar way. From the previous call to **h_get_event_time_ms** we know **io_event** is in a complete state, so we can reuse it.

### Instrumenting the kernel

Similarly, the **[clEnqueueNDRangeKernel](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueNDRangeKernel.html)** command accepts an OpenCL Event. 

```C++
    // Event for the kernel
    cl_event kernel_event;
    
    // Now enqueue the kernel
    H_ERRCHK(
        clEnqueueNDRangeKernel(
            command_queue,
            kernel,
            work_dim,
            NULL,
            global_size,
            local_size,
            0,
            NULL,
            &kernel_event
        ) 
    );

    // Time how long it takes to complete event
    cl_double run_kernel_ms = h_get_event_time_ms(
        &kernel_event, 
        "Running kernel",
        NULL
    );
```

In this manner we instrument the uploads, downloads, and kernel execution in the source file [mat_mult_profiling.cpp](mat_mult_profiling.cpp). Now we run the instrumented code and print out the results.

## Compile and run the appliciation

The makefile is set to compile the example [mat_mult.cpp](mat_mult.cpp). The program creates and fills matrices **A** and **B** with random numbers in the range [0-1] and then uses HIP to compute the solution in matrix **C**. The matrices are written to the following files in binary format:

* arrayA.dat
* arrayB.dat
* arrayC.dat

On your terminal change directory to **L3_Matrix_Multiplication** and compile and run with these commands (without the exclamation mark !).

In [8]:
!make clean; make; ./mat_mult_profiling.exe

rm -r *.exe
g++ -std=c++11 -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_mult_profiling.cpp\
	-o mat_mult_profiling.exe -lOpenCL
In file included from [01m[Kmat_mult_profiling.cpp:18[m[K:
[01m[K../include/cl_helper.hpp:[m[K In function ‘[01m[K_cl_command_queue** h_create_command_queues(_cl_device_id**, _cl_context**, cl_uint, cl_uint, cl_bool, cl_bool)[m[K’:
  337 |         command_queues[n] = [01;35m[KclCreateCommandQueue([m[K
      |                             [01;35m[K~~~~~~~~~~~~~~~~~~~~^[m[K
  338 | [01;35m[K            contexts[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K           
  339 | [01;35m[K            devices[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~[m[K            
  340 | [01;35m[K            queue_properties,[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~[m[K                    
  341 | [01;35m[K            &errcode

## Open-source profiling tools

### Tau

[Tau](https://www.cs.uoregon.edu/research/tau/home.php) is a commonly used open-source profiling and tracing toolkit for HPC applications. For OpenCL applications it provides both profiling and tracing functionality.

#### Profiling

The Tau application **tau_exec** can be used to collect profiling information. Profiling information can then be visualised with the Tau applications **paraprof** (GUI), or **pprof** (command-line).

We set the environment variables **PROFILEDIR=./tau** to tell Tau where to put files.

In [1]:
%env PROFILEDIR=./tau

env: PROFILEDIR=./tau


Then we use the following call to **tau_exec** to collect profiling information for opencl calls.

In [2]:
!tau_exec -T serial -opencl ./mat_mult_profiling.exe

	               name: NVIDIA GeForce RTX 3060 Laptop GPU 
	 global memory size: 6226 MB
	    max buffer size: 1556 MB
	     max local size: (1024,1024,64)
	     max work-items: 1024
device id: 181352528.
command id: 93995599091328.
vendor id: 0.
Got a bogus start! 2 .TAU application
Time for event "Uploading Buffer A": 0.041 ms (12879.26 MB/s)
Time for event "Uploading Buffer B": 0.041 ms (25560.37 MB/s)
Time for event "Running kernel": 0.437 ms
Time for event "Downloading Buffer C": 0.213 ms (10094.81 MB/s)
Maximum error (infinity norm) is: 2.28882e-05


Now have a look at the contents of the tau directory

In [3]:
!ls ./tau

events.0.edf   profile.0.0.2  tautrace.0.0.0.trc  tau.trc
profile.0.0.0  profile.txt    tautrace.0.0.1.trc  trace.json
profile.0.0.1  tau.edf	      tautrace.0.0.2.trc


Use the Tau application **pprof** to get a text mode profile of the app.

In [4]:
!pprof > ./tau/profile.txt

We see from the profile that the call to **mat_mult** took approximately **12ms**. This is similar to what was measured from the profiling interface.

#### Tracing with Google Chrome tracing

For tracing we set the environment variables **TRACEDIR=./tau** **TAU_TRACE=1**.

In [5]:
%env TAU_TRACE=1
%env TRACEDIR=./tau

env: TAU_TRACE=1
env: TRACEDIR=./tau


Capture OpenCL information as before

In [6]:
!tau_exec -T serial -opencl ./mat_mult_profiling.exe

	               name: NVIDIA GeForce RTX 3060 Laptop GPU 
	 global memory size: 6226 MB
	    max buffer size: 1556 MB
	     max local size: (1024,1024,64)
	     max work-items: 1024
device id: 478091040.
command id: 94395327787920.
vendor id: 0.
Got a bogus start! 2 .TAU application
Time for event "Uploading Buffer A": 0.041 ms (12889.23 MB/s)
Time for event "Uploading Buffer B": 0.041 ms (25580.17 MB/s)
Time for event "Running kernel": 0.449 ms
Time for event "Downloading Buffer C": 0.202 ms (10600.51 MB/s)
Maximum error (infinity norm) is: 2.28882e-05


Now merge the trace into a downloadable JSON document

In [7]:
!cd tau; echo 'y' | tau_treemerge.pl

/opt/tau/2.31.1/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.0.edf events.0.edf tautrace.0.0.0.trc tautrace.0.0.1.trc tautrace.0.0.2.trc tau.trc
tau.trc exists; override [y]? tautrace.0.0.0.trc: 450 records read.
tautrace.0.0.1.trc: 6 records read.
tautrace.0.0.2.trc: 41 records read.


In [8]:
!cd tau; tau_trace2json ./tau.trc ./tau.edf -chrome -ignoreatomic -o trace.json

Using the file manager on the left download the file in **tau/trace.json** to your computer. Then in your browser you can go to the address [https://ui.perfetto.dev](https://ui.perfetto.dev) and load the trace for viewing on your local machine. 


<figure style="margin-left:0; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/Chrome_trace.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Tracing OpenCL calls with <a href="https://ui.perfetto.dev">ui.perfetto.dev</a>.</figcaption>
</figure>

In this instance we see that the kernel **mat_mult** has taken approximately 0.45ms to complete.

## Commercial profiling tools

### CLTracer

[CLTracer](https://www.cltracer.com/) is a commerical product that profiles OpenCL calls for Windows and Linux OpenCL applications. It requires a GUI to run and provides 

* A timeline of OpenCL calls, separated into API and kernel calls
* Tables of time spent in each call
    * Global and local kernel size recorded
    * Size of transfers recorded
* Time spent in the API vs time spent blocking
* Breakdown of time spent in kernels
* Breakdown of time spent in queues

Setting up project settings and running a trace for [mat_mult_profiling.cpp](mat_mult_profiling.cpp) was really easy. Unfortunately I don't see the ability to fetch information from the command line.

#### Timeline view

The timeline view shows when and for how long each OpenCL call lasts. The overview shows that setting 

<figure style="margin-left:auto; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/cltracer_timeline_overview.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">CLTracer timeline overview.</figcaption>
</figure>

If we zoom in to the kernel region we see that executing the kernel only took around 2.9ms and that it takes less time to upload and download arrays than the kernel spends executing.

<figure style="margin-left:auto; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/cltracer_timeline_zoom.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">CLTracer timeline, zoomed in on kernel region.</figcaption>
</figure>

#### Tables

The timeline can also be viewed in tabular format. In addition to times for each OpenCL call you can see global and local sizes of the kernels as well as the size of kernel uploads and downloads.

<figure style="margin-left:auto; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/cltracer_tables.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">CLTracer table of OpenCL calls.</figcaption>
</figure>

For more information on available tools within CLTracer please see the CLTracer [Documentation](https://www.cltracer.com/docs).

## Vendor profiling tools

### AMD

#### HSA application traces

The AMD utility **rocprof** has the ability to collect traces 

```bash
rocprof --hsa-trace -o rocprof_trace/result.csv ./mat_mult_profiling.exe
```

Then copy the file **rocprof_trace/result.json** back to your computer and go to the address

[https://ui.perfetto.dev](https://ui.perfetto.dev)

to load **result.json** and display the HSA trace. This is of limited utility as you'll need to guess which HSA calls correspond to OpenCL calls. 

#### Performance counters

Rocprof can collect performance counters on kernels. We specify a list of counters and the kernels they should apply to in the file **[rocprof_counters.txt](rocprof_counters.txt)**

```bash
rocprof -i rocprof_counters.txt --timestamp on --stats -o rocprof_counters/result.csv ./mat_mult_profiling.exe
```

### NVIDIA

#### Profiling with nvprof

Historically there was limited functionality for profiling OpenCL events with NVIDIA's [NVVP](http://uob-hpc.github.io/2015/05/27/nvvp-import-opencl.html), however profiling support for OpenCL has largely disappeared, with the implementation of Nsight Compute and Nsight systems.




<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>