Implementation Details

File structure

VUDA is an include only header library and works by including the file vuda.h located in the root of the library. Any source file can include VUDA as it supports multiple translation units. The library has three subfolders

/api, exposes the library interface.
/setup, contains macros, type definitions and debug implementation.
/state, contains the actual implementation detail.

The internal layout

VUDA has three singleton classes

Instance: VUDA uses a single Vulkan instance, which is created on the first VUDA function call.
Interface logical devices: mapping of physcial devices to logical devices.
Interface thread info: mapping of thread to logical devices.

Lock contention

There are multiple senarios where an application can experience lock contention (not an exhaustive list).

If multiple threads attempts to assign a new target device at the same time via SetDevice. Creating a logical device requires an exclusive lock.
If multiple threads attempts to launch the same kernel for the first time at the same time. Creating a shader module requires an exclusive lock.
Submitting to the same stream from multiple threads at the same time. Appending to a stream/queue requires an exclusive lock.
Most creation and freeing of resources requires an exclusive lock, e.g. internal memory allocations, streams/queues and events.

Streams and queues

Vulkan can expose several queue families for a device. VUDA will look for a dedicated family that exposes compute and transfer capabilities. For this family it will create as many queues as are exposed. These will serve as streams.

For host to device transfers VUDA will look for a dedicated family that exposes the transfer capability (todo)

Synchronization behaviors

All memcpy/memset operations in VUDA conform to the API synchronization behavior.

VUDA does not have any notion of a (legacy) default stream. All streams are equivalent and will operate as CUDA in the per-thread default stream setting (see stream synchronization behavior).

Memory heap allocations

VUDA has a simple memory allocator which allocates in chunks of 256MB from each heap in use, unless a specific allocation has a larger requirement.

Below is a how the different memory allocations are mapped to specific Vulkan memory types.

cudaMalloc allocates from memory type with DEVICE_LOCAL (usually not HOST_VISIBLE unless dedicated GPU like Intel HD Graphics or AMD APU). This type is guaranteed to exist by the Vulkan specifications.
cudaHostAlloc with cudaHostAllocWriteCombined allocates HOST_VISIBLE and HOST_COHERENT. With this memory type CPU writes are write-combined and CPU reads are uncached. This type is also used internally for staging an upload to device. This type is guaranteed to exist by the Vulkan specifications.
cudaHostAlloc with cudaHostAllocDefault (or cudaMallocHost) allocates HOST_VISIBLE and HOST_CACHED (not necessarily HOST_COHERENT). If no such memory type exists it will fallback to HOST_VISIBLE and HOST_COHERENT. With this type of memory CPU reads and writes go through CPU cache hierarchy. This type is used internally for staging a download from device.
(todo) Some devices have special heaps that are specially suited for host to device transfers.

Device memory heaps and types can be inspected in the Vulkan hardware database.

License

The VUDA source code is released under the MIT license, see LICENSE.md for the full text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly