Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA cub support #3200

Open
sjperkins opened this issue Aug 6, 2018 · 3 comments
Open

NVIDIA cub support #3200

sjperkins opened this issue Aug 6, 2018 · 3 comments
Assignees
Labels
CUDA CUDA related issue/PR feature_request

Comments

@sjperkins
Copy link
Contributor

Feature request

tl;dr I'd like to use NVIDIA cub primitive's in a jitted CUDA function, but am not sure how to achieve this.

NVIDIA cub contains many useful Warp/Block and Device wide-primitives (sorts/segmented algorithms/reductions) for CUDA GPU programming. Additionally, the contained primitives are optimised and maintained for different architectures, reducing the maintenance overhead of a user.

I see that cub's device-wide radix sort is wrapped in pyculib_sorting. While it's possible to wrap templated device-wide function calls, I wondered if there was a reasonable way to support the block and warp-wide algorithms, which are defined by templating Trait/Policy classes?

As I understand numba, it only supports the CUDA python specification, so supporting templated C++ code in a python style does not seem obvious to me. However, I don't fully understand numba's internal typing mechanism, so it isn't clear to me whether its easy to extend the CUDA python spec, or via some extension in a user library.

@sjperkins
Copy link
Contributor Author

Another way of expressing this is given the BlockScan API:

template<
    typename T, 
    int BLOCK_DIM_X, 
    BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, 
    int BLOCK_DIM_Y = 1, 
    int BLOCK_DIM_Z = 1, 
    int PTX_ARCH = CUB_PTX_ARCH>
class cub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >

is it possible to express the above as

block_scan = cub.BlockScan(numba.float64, 32, cub.BLOCK_SCAN_RAKING, 32, 1)
temp_storage = block_scan.temp_storage()

# In the cub example this is 
# int thread_data[4]
# Not sure if this is the correct analogue for numba
thread_data = numba.int64(4) 

block_scan(temp_storage)(thread_data, thread_data)

@sjperkins
Copy link
Contributor Author

After some digging around in the docs, it seems that its very doable to create a numba BlockScan type, parameterised on the above template parameters.

It seems that it may be possible to used the @overload and @overload_method decorators to implement native C++/CUDA code (using cppimport for example). Could I receive comment on this?

@stuartarchibald stuartarchibald added feature_request CUDA CUDA related issue/PR labels Aug 9, 2018
@seibert seibert self-assigned this Aug 9, 2018
@seibert
Copy link
Contributor

seibert commented Aug 27, 2018

A general implementation of CUB primitives in Numba will be very challenging because CUB is a C++ template library, and Numba translates directly from Python to PTX using libNVVM. Being able to use CUB-style primitives within CUDA Python functions would require reimplementing the templating mechanism and algorithms from scratch, or refashioning the CUDA target in Numba as a CUDA Python to CUDA C++ translator, and then using NVRTC to compile the resulting function.

While both of these options are interesting, they are out of scope for Numba itself and would be external projects (that still could depend on Numba).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA CUDA related issue/PR feature_request
Projects
None yet
Development

No branches or pull requests

3 participants