Most of the examples show how to use Kernel Tuner to tune a CUDA, OpenCL, or C kernel, while demonstrating a particular usecase of Kernel Tuner.
Except for test_vector_add.py and test_vector_add_parameterized.py, which show how to write tests for GPU kernels with Kernel Tuner.
Note
Please do not use the examples as performance benchmarks. The examples here are created specifically to highlight certain features in Kernel Tuner. Please contact the developers if you are interested in benchmarking Kernel Tuner.
Below we list the example applications and the features they illustrate.
- [CUDA] [CUDA-C++] [OpenCL] [C] [Fortran] [OpenACC-C++] [OpenACC-Fortran]
- use Kernel Tuner to tune a simple kernel
- [CUDA] [OpenCL]
- use a 2-dimensional problem domain with 2-dimensional thread blocks in a simple and clean example
- [CUDA] [OpenCL]
- pass a filename instead of a string with code
- use 2-dimensional thread blocks and tiling in both dimensions
- tell Kernel Tuner to compute the grid dimensions for 2D thread blocks with tiling
- use the restrictions option to limit the search to only valid configurations
- use a user-defined performance metric like GFLOP/s
There are several different examples centered around the convolution kernel [CUDA] [OpenCL]
- [CUDA] [OpenCL]
- use tunable parameters for tuning for multiple input sizes
- pass constant memory arguments to the kernel
- write output to a json file
- [CUDA] [OpenCL]
- use the convolution kernel for separable filters
- write output to a csv file using Pandas
- [CUDA] [OpenCL]
- use run_kernel to compute a reference answer
- verify the output of every benchmarked kernel
- [CUDA]
- allocate page-locked host memory from Python
- overlap transfers to and from the GPU with computation
- tune parameters in the host code in combination with those in the kernel
- use the lang="C" option and set compiler options
- pass a list of filenames instead of strings with kernel code
- [CUDA] [OpenCL]
- use vector types and shuffle instructions (shuffle is only available in CUDA)
- tune the number of thread blocks the kernel is executed with
- tune the partial loop unrolling factor of a for-loop
- tune pipeline that consists of two kernels
- tune with custom output verification function
- [CUDA]
- use scipy to compute a reference answer and verify all benchmarked kernels
- express that the number of thread blocks depends on the values of tunable parameters
- [CUDA]
- overlap transfers with device mapped host memory
- tune on different implementations of an algorithm
- [CUDA]
- in-thread block 2D reduction using CUB library
- C++ in CUDA kernel code
- tune multiple kernels in pipeline