Skip to content

Hemi 2

Pre-release
Pre-release
Compare
Choose a tag to compare
@harrism harrism released this 18 Sep 03:23
· 18 commits to master since this release

Hemi 2: Simpler, More Portable CUDA C++

Hemi 2 simplifies writing portable CUDA C/C++ code. With Hemi,

  • You can write parallel loops in line in your CPU code, and run them on your GPU;
  • You can easily write code that compiles and runs either on the CPU or GPU;
  • You can easily launch C++ Lambda functions as GPU kernels;
  • Launch configuration details like thread block size and grid size are an optimization detail, rather than a requirement;

With Hemi, parallel code for the GPU can be as simple as the parallel_for loop in the following code, which can also be compiled and run on the CPU.

void saxpy(int n, float a, const float *x, float *y)
{
    hemi::parallel_for(0, n, [=] HEMI_LAMBDA (int i) {
        y[i] = a * x[i] + y[i];
    }); 
}

New Features

  • hemi::launch() for launching portable functions either as parallel kernels on the device, or as serial functions on the host.
  • hemi::cudaLaunch() for launching CUDA kernels (portable or otherwise).
  • hemi::parallel_for() for expressing in-line parallel loops that are launched as CUDA kernels (or run on the host).
  • Support for GPU lambdas with HEMI_LAMBDA. GPU Lambdas can be defined in host code and launched on the device using hemi::launch() or hemi::parallel_for()
  • Automatic parallel execution configuration with hemi::launch(), hemi::cudaLaunch(), and hemi::parallel_for(). This leaves the specification of the thread block and grid size up to the runtime, so that execution configuration becomes an optimization rather than a requirement.
  • Grid-stride range-based for loops with the hemi::grid_stride_range() helper.
  • Complete overhaul resulting in greater portability and improved simplicity.
  • New and improved samples.
  • Tests!

Enjoy Hemi 2. Please report any issues via the Github issue tracker.