You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr I'd like to use NVIDIA cub primitive's in a jitted CUDA function, but am not sure how to achieve this.
NVIDIA cub contains many useful Warp/Block and Device wide-primitives (sorts/segmented algorithms/reductions) for CUDA GPU programming. Additionally, the contained primitives are optimised and maintained for different architectures, reducing the maintenance overhead of a user.
I see that cub's device-wide radix sort is wrapped in pyculib_sorting. While it's possible to wrap templated device-wide function calls, I wondered if there was a reasonable way to support the block and warp-wide algorithms, which are defined by templating Trait/Policy classes?
As I understand numba, it only supports the CUDA python specification, so supporting templated C++ code in a python style does not seem obvious to me. However, I don't fully understand numba's internal typing mechanism, so it isn't clear to me whether its easy to extend the CUDA python spec, or via some extension in a user library.
The text was updated successfully, but these errors were encountered:
Another way of expressing this is given the BlockScan API:
template<
typename T,
int BLOCK_DIM_X,
BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING,
int BLOCK_DIM_Y = 1,
int BLOCK_DIM_Z = 1,
int PTX_ARCH = CUB_PTX_ARCH>
classcub::BlockScan< T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH >
is it possible to express the above as
block_scan=cub.BlockScan(numba.float64, 32, cub.BLOCK_SCAN_RAKING, 32, 1)
temp_storage=block_scan.temp_storage()
# In the cub example this is # int thread_data[4]# Not sure if this is the correct analogue for numbathread_data=numba.int64(4)
block_scan(temp_storage)(thread_data, thread_data)
After some digging around in the docs, it seems that its very doable to create a numba BlockScan type, parameterised on the above template parameters.
It seems that it may be possible to used the @overload and @overload_method decorators to implement native C++/CUDA code (using cppimport for example). Could I receive comment on this?
A general implementation of CUB primitives in Numba will be very challenging because CUB is a C++ template library, and Numba translates directly from Python to PTX using libNVVM. Being able to use CUB-style primitives within CUDA Python functions would require reimplementing the templating mechanism and algorithms from scratch, or refashioning the CUDA target in Numba as a CUDA Python to CUDA C++ translator, and then using NVRTC to compile the resulting function.
While both of these options are interesting, they are out of scope for Numba itself and would be external projects (that still could depend on Numba).
Feature request
tl;dr I'd like to use NVIDIA cub primitive's in a jitted CUDA function, but am not sure how to achieve this.
NVIDIA cub contains many useful Warp/Block and Device wide-primitives (sorts/segmented algorithms/reductions) for CUDA GPU programming. Additionally, the contained primitives are optimised and maintained for different architectures, reducing the maintenance overhead of a user.
I see that cub's device-wide radix sort is wrapped in pyculib_sorting. While it's possible to wrap templated device-wide function calls, I wondered if there was a reasonable way to support the block and warp-wide algorithms, which are defined by templating Trait/Policy classes?
As I understand numba, it only supports the CUDA python specification, so supporting templated C++ code in a python style does not seem obvious to me. However, I don't fully understand numba's internal typing mechanism, so it isn't clear to me whether its easy to extend the CUDA python spec, or via some extension in a user library.
The text was updated successfully, but these errors were encountered: