# Kernel Tuner Tutorial

## Intermediate Hands-on

In this hands-on we will look at three features of Kernel Tuner that have been recently introduced to you: **search space restrictions**, **caching**, and **output verification**.

But first, if you have not done it already, it is time to install and import `kernel_tuner` and its dependencies.

In [1]:
%pip install kernel_tuner

import numpy as np
import kernel_tuner as kt
import collections

Collecting kernel_tuner
  Downloading kernel_tuner-1.1.3-py3-none-any.whl.metadata (12 kB)
Collecting python-constraint2<3.0.0,>=2.1.0 (from kernel_tuner)
  Downloading python_constraint2-2.2.3-cp311-cp311-manylinux_2_35_x86_64.whl.metadata (10 kB)
Collecting xmltodict (from kernel_tuner)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading kernel_tuner-1.1.3-py3-none-any.whl (151 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.6/151.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_constraint2-2.2.3-cp311-cp311-manylinux_2_35_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict, python-constraint2, kernel_tuner
Successfully installed kernel_tuner-1.1.3 python-constraint2-2.2.3 xmltodict-0.14.2


To work with these features we will use a matrix multiplication kernel.

Matrix multiplication is one of the most well-known and widely-used linear algebra operations, and is frequently used to demonstrate the high-performance computing capabilities of GPUs. As such, matrix multiplication presents a familiar starting point for many GPU programmers. More information about matrix multiplication can be found on [Wikipedia](https://en.wikipedia.org/wiki/Matrix_multiplication).

The following cell contains the code of a matrix multiply kernel using shared memory. The content of the cell is written to the `matmul_shared.cu` file, and you only need to execute the cell once as this hands-on does not require to change the implementation of the kernel.

This kernel assumes that the width and height of the matrices `A`, `B`, and `C` is equal to `WIDTH`, which is known at compile time. Of course, you'll want a more flexible solution in reality, but this is just an example kernel to demonstrate how to use Kernel Tuner.

In [2]:
%%writefile matmul_shared.cu

#define WIDTH 512

__global__ void matmul_kernel(float *C, float *A, float *B) {

    __shared__ float sA[block_size_y][block_size_x];
    __shared__ float sB[block_size_y][block_size_x];

    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int x = blockIdx.x * block_size_x + tx;
    int y = blockIdx.y * block_size_y + ty;

    float sum = 0.0;
    int k,kb;

    for (k=0; k<WIDTH; k+=block_size_x) {
        __syncthreads();
        sA[ty][tx] = A[y*WIDTH+k+tx];
        sB[ty][tx] = B[(k+ty)*WIDTH+x];
        __syncthreads();

        for (kb=0; kb<block_size_x; kb++) {
            sum += sA[ty][kb] * sB[kb][tx];
        }

    }

    C[y*WIDTH+x] = sum;
}

Writing matmul_shared.cu


Before running the code we need to allocate input and output matrices, and add some tuning parameters.

In [3]:
# matrix width needs to match the value in the kernel source
problem_size = (512, 512)

A = np.random.randn(*problem_size).astype(np.float32)
B = np.random.randn(*problem_size).astype(np.float32)
C = np.zeros_like(A)

args = [C, A, B]

tune_params = collections.OrderedDict()
tune_params["block_size_x"] = [2**i for i in range(0, 11)]
tune_params["block_size_y"] = [2**i for i in range(0, 11)]

It is now your turn to add some **search space restrictions**. You are free to add all the restrictions you want, but there is one in particular that is required for the kernel to produce correct results: the shape of the thread block needs to be **exactly** a square.

Remember that restrictions are specified as either a Python list containing strings, each string being one restriction, or as a callable object that returns `True` if the configuration is valid and `False` otherwise.

In [5]:
# EXERCISE 1: Define the required search space restriction for the matrix multiplication kernel
restrict = ["block_size_y==block_size_x"]

To enable the **caching** of intermediate results during tuning, Kernel Tuner needs to know the name of the cache file. The name can be specified as a string, to which Kernel Tuner automatically adds the `.json` extension if not specified.

In [6]:
# define a string containing the cache file name
cache_name = "my_cache_file.json"

Do not forget to pass the restrictions to the `tune_kernel` function and enable caching as documented in Kernel Tuner's [API](https://KernelTuner.github.io/kernel_tuner/stable/user-api.html).

In [7]:
if not restrict:
    print("Error: you must first define a search space restriction! (Exercise 1)")

# Call the tuner with the restricted search space
else:
    results, env = kt.tune_kernel("matmul_kernel", "matmul_shared.cu",
                                  problem_size, args, tune_params, restrictions=restrict,
                                  cache=cache_name, verbose=True, lang="cupy")

    print(f"Number of configurations: {len(results)}")


Using: Tesla T4
Searchspace has 6 configurations after restrictions.
block_size_x=1, block_size_y=1, time=50.751ms
block_size_x=2, block_size_y=2, time=7.970ms
block_size_x=4, block_size_y=4, time=1.419ms
block_size_x=8, block_size_y=8, time=0.438ms
block_size_x=16, block_size_y=16, time=0.285ms
block_size_x=32, block_size_y=32, time=0.272ms
best performing configuration:
block_size_x=32, block_size_y=32, time=0.272ms
Number of configurations: 6


### Output verification

There are times, like with this matrix multiplication kernel, when some tuning configurations may produce wrong results.

It is important to catch this as soon as possible, and Kernel Tuner allows to pass to the `tune_kernel` function a reference answer to which the results produced by all configuration are compared against.

The reference answer is a Python list that matches in size and order the argument list provided to the kernel (`args` in our case), with `None` for all elements for which a comparison is not needed. In case of working with floating point values, Kernel Tuner allows also to specify a tolerance value.

Again refer to the [API](https://KernelTuner.github.io/kernel_tuner/stable/user-api.html) for more information.

In [8]:
# compute the reference result, e.g. by using NumPy
reference = A.dot(B)

# EXERCISE 2: Correctly construct the answer list required by Kernel Tuner
answer = [reference, None,None]


Now, we are ready to call the tuner again with output verification enabled.

In [9]:
if not answer:
    print("Error: you must first setup the answer list correctly! (Exercise 2)")

# Call the tuner with output verification enabled
else:
    results, env = kt.tune_kernel("matmul_kernel", "matmul_shared.cu",
                             problem_size, args, tune_params, restrictions=restrict,
                             answer=answer, lang="cupy", atol=1e-4)

    print(f"Number of configurations: {len(results)}")

Using: Tesla T4
block_size_x=1, block_size_y=1, time=50.734ms
block_size_x=2, block_size_y=2, time=7.967ms
block_size_x=4, block_size_y=4, time=1.422ms
block_size_x=8, block_size_y=8, time=0.435ms
block_size_x=16, block_size_y=16, time=0.284ms
block_size_x=32, block_size_y=32, time=0.266ms
best performing configuration:
block_size_x=32, block_size_y=32, time=0.266ms
Number of configurations: 6
