Skip to content

Deviations from CUDA

jgbit edited this page Jun 7, 2020 · 3 revisions

There are a couple of deviation from the CUDA runtime specification.

SetDevice

Every thread is required to call SetDevice before any other function is evoked with the except of queries to the number of available devices and their properties. SetDevice will find or create the associated logical device with the calling thread. Two or more threads can operate on the same logical device at the same time (see implementation notes on Contention).

Kernel launches

Kernels in VUDA are simply Vulkan shader modules created at run-time by taking a code in the SPIR-V intermediate language. VUDA does not know or care what the original source of the kernel was - may it be GLSL, HLSL, OpenCL or something else. There is always an associated overhead with the first launch of a kernel that specifies a unique shader module, since the shader module has to be created.

The syntax for kernel launches differs from that of CUDA, but mimics cudaLaunchKernel closely.

There are two different ways of launching a kernel in VUDA:

  • Passing the explicit filename of a SPIRV-V binary file, e.g. *.spv. this will introduce an overhead from the file access.

  • Passing an embedded string directly to VUDA. One way of creating an embeddable source string is to use the unix tool xxd, e.g. for a SPIR-V binary with filename "kernel.spv":

    xxd -i kernel.spv kernel.spv.h

    This will create an unsigned char array that can be passed to a VUDA launch call (see the embedded code sample). Alternatively, clspv enables options for emitting the binary as a C initializer list creation for embedding of a kernel.

Cache and shared memory configurations

There is no way to specify a preferred cache configuration by calling cudaFuncSetCacheConfig, e.g. to increase the size of the L1 cache on account of having a smaller shared memory cache or vice versa. The amount of shared memory is determined by the Vulkan specification. Likewise, there is no way of specifying the cacheability of global memory accesses, i.e. through compiler arguments –Xptxas –dlcm=cg or -Xptxas -dlcm=ca. To summaries, there is no way to performing the 2x2 experiment with load caching and L1 size {ca, cg} x {small L1, large L1}. Finally, there is no way of changing the shared memory bank size via a function like cudaFuncSetSharedMemConfig.

Device properties

Some device properties are not available through GetDeviceProperties as they are vendor specific or simply does not have an appropriate counterpart in Vulkan.