Kernel Float

Kernel Float is a header-only library for CUDA that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.

Summary

CUDA natively offers several reduced precision floating-point types (__half, __nv_bfloat16, __nv_fp8_e4m3, __nv_fp8_e5m2) and vector types (e.g., __half2, __nv_fp8x4_e4m3, float3). However, working with these types is cumbersome: mathematical operations require intrinsics (e.g., __hadd2 performs addition for __half2), type conversion is awkward (e.g., __nv_cvt_halfraw2_to_fp8x2 converts float16 to float8), and some functionality is missing (e.g., one cannot convert a __half to __nv_bfloat16).

Kernel Float resolves this by offering a single data type kernel_float::vec<T, N> that stores N elements of type T. Internally, the data is stored as a fixed-sized array of elements. Operator overloading (like +, *, &&) has been implemented such that the most optimal intrinsic for the available types is selected automatically. Many mathetical functions (like log, exp, sin) and common operations (such as sum, range, for_each) are also available.

By using this library, developers can avoid the complexity of working with reduced precision floating-point types in CUDA and focus on their applications.

Features

In a nutshell, Kernel Float offers the following features:

Single type vec<T, N> that unifies all vector types.
Operator overloading to simplify programming.
Support for half (16 bit) floating-point arithmetic, with a fallback to single precision for unsupported operations.
Support for quarter (8 bit) floating-point types.
Easy integration as a single header file.
Written for C++17.
Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).

Example

Check out the examples directory for some examples.

Below shows a simple example of a CUDA kernel that adds a constant to the input array and writes the results to the output array. Each thread processes two elements. Notice how easy it would be change the precision (for example, double to half) or the vector size (for example, 4 instead of 2 items per thread).

#include "kernel_float.h"
namespace kf = kernel_float;

__global__ void kernel(const kf::vec<half, 2>* input, float constant, kf::vec<float, 2>* output) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    output[i] = input[i] + kf::cast<half>(constant);
}

Here is how the same kernel would like without Kernel Float.

__global__ void kernel(const __half* input, float constant, float* output) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    __half in0 = input[2 * i + 0];
    __half in1 = input[2 * 1 + 1];
    __half2 a = __halves2half2(in0, int1);
    float b = float(constant);
    __half c = __float2half(b);
    __half2 d = __half2half2(c);
    __half2 e = __hadd2(a, d);
    __half f = __low2half(e);
    __half g = __high2half(e);
    float out0 = __half2float(f);
    float out1 = __half2float(g);
    output[2 * i + 0] = out0;
    output[2 * i + 1] = out1;
}

Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.

Installation

This is a header-only library. Copy the file single_include/kernel_float.h to your project and include it:

#include "kernel_float.h"

Use the provided Makefile to generate this single-include header file if it is outdated:

make

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
include		include
kernel_tuner		kernel_tuner
single_include		single_include
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
combine.py		combine.py

License

KernelTuner/kernel_float

Folders and files

Latest commit

History

Repository files navigation

Kernel Float

Summary

Features

Example

Installation

Documentation

License

Related Work

About

Resources

License

Stars

Watchers

Forks

Languages