NVIDIA cub support #3200

sjperkins · 2018-08-06T09:28:44Z

Feature request

tl;dr I'd like to use NVIDIA cub primitive's in a jitted CUDA function, but am not sure how to achieve this.

NVIDIA cub contains many useful Warp/Block and Device wide-primitives (sorts/segmented algorithms/reductions) for CUDA GPU programming. Additionally, the contained primitives are optimised and maintained for different architectures, reducing the maintenance overhead of a user.

I see that cub's device-wide radix sort is wrapped in pyculib_sorting. While it's possible to wrap templated device-wide function calls, I wondered if there was a reasonable way to support the block and warp-wide algorithms, which are defined by templating Trait/Policy classes?

As I understand numba, it only supports the CUDA python specification, so supporting templated C++ code in a python style does not seem obvious to me. However, I don't fully understand numba's internal typing mechanism, so it isn't clear to me whether its easy to extend the CUDA python spec, or via some extension in a user library.

sjperkins · 2018-08-06T09:47:36Z

Another way of expressing this is given the BlockScan API:

template<
    typename T, 
    int BLOCK_DIM_X, 
    BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, 
    int BLOCK_DIM_Y = 1, 
    int BLOCK_DIM_Z = 1, 
    int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

is it possible to express the above as

block_scan = cub.BlockScan(numba.float64, 32, cub.BLOCK_SCAN_RAKING, 32, 1)
temp_storage = block_scan.temp_storage()

# In the cub example this is 
# int thread_data[4]
# Not sure if this is the correct analogue for numba
thread_data = numba.int64(4) 

block_scan(temp_storage)(thread_data, thread_data)

sjperkins · 2018-08-06T12:14:31Z

After some digging around in the docs, it seems that its very doable to create a numba BlockScan type, parameterised on the above template parameters.

It seems that it may be possible to used the @overload and @overload_method decorators to implement native C++/CUDA code (using cppimport for example). Could I receive comment on this?

seibert · 2018-08-27T01:41:29Z

A general implementation of CUB primitives in Numba will be very challenging because CUB is a C++ template library, and Numba translates directly from Python to PTX using libNVVM. Being able to use CUB-style primitives within CUDA Python functions would require reimplementing the templating mechanism and algorithms from scratch, or refashioning the CUDA target in Numba as a CUDA Python to CUDA C++ translator, and then using NVRTC to compile the resulting function.

While both of these options are interesting, they are out of scope for Numba itself and would be external projects (that still could depend on Numba).

stuartarchibald added feature_request CUDA CUDA related issue/PR labels Aug 9, 2018

seibert self-assigned this Aug 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA cub support #3200

NVIDIA cub support #3200

sjperkins commented Aug 6, 2018

sjperkins commented Aug 6, 2018

sjperkins commented Aug 6, 2018

seibert commented Aug 27, 2018 •

edited

NVIDIA cub support #3200

NVIDIA cub support #3200

Comments

sjperkins commented Aug 6, 2018

Feature request

sjperkins commented Aug 6, 2018

sjperkins commented Aug 6, 2018

seibert commented Aug 27, 2018 • edited

seibert commented Aug 27, 2018 •

edited