Hemi 2: Simpler, More Portable CUDA C++

Hemi 2 simplifies writing portable CUDA C/C++ code. With Hemi,

You can write parallel loops in line in your CPU code, and run them on your GPU;
You can easily write code that compiles and runs either on the CPU or GPU;
You can easily launch C++ Lambda functions as GPU kernels;
Launch configuration details like thread block size and grid size are an optimization detail, rather than a requirement;

With Hemi, parallel code for the GPU can be as simple as the parallel_for loop in the following code, which can also be compiled and run on the CPU.

void saxpy(int n, float a, const float *x, float *y)
{
    hemi::parallel_for(0, n, [=] HEMI_LAMBDA (int i) {
        y[i] = a * x[i] + y[i];
    }); 
}

New Features

hemi::launch() for launching portable functions either as parallel kernels on the device, or as serial functions on the host.
hemi::cudaLaunch() for launching CUDA kernels (portable or otherwise).
hemi::parallel_for() for expressing in-line parallel loops that are launched as CUDA kernels (or run on the host).
Support for GPU lambdas with HEMI_LAMBDA. GPU Lambdas can be defined in host code and launched on the device using hemi::launch() or hemi::parallel_for()
Automatic parallel execution configuration with hemi::launch(), hemi::cudaLaunch(), and hemi::parallel_for(). This leaves the specification of the thread block and grid size up to the runtime, so that execution configuration becomes an optimization rather than a requirement.
Grid-stride range-based for loops with the hemi::grid_stride_range() helper.
Complete overhaul resulting in greater portability and improved simplicity.
New and improved samples.
Tests!

Enjoy Hemi 2. Please report any issues via the Github issue tracker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hemi 2

Hemi 2: Simpler, More Portable CUDA C++

New Features