In [1]:
GPU_PKG_NAME = "AMDGPU"; include("common_definitions.jl");

Not everything is peachy in kernel land; some things that you can easily do on the Julia host, you can't so easily do when executing on the GPU.

For example, Julia on the host has access to a fast RNG that can be called from multiple threads:

In [20]:
X = rand(4, 8)

4×8 Matrix{Float64}:
 0.992543  0.300562   0.247534  0.248511  …  0.517549  0.705719  0.253429
 0.396288  0.0460697  0.595105  0.390061     0.291689  0.865059  0.453826
 0.928109  0.959506   0.139073  0.786756     0.12094   0.682744  0.173941
 0.831055  0.701795   0.318094  0.452323     0.490569  0.221822  0.416229

When using a GPU computing library, it's pretty easy to use the vendor's RNG library to get random numbers quite easily:

In [2]:
X = GPUMOD.rand(4, 8)

4×8 ROCMatrix{Float32}:
 0.257441  0.553337  0.519232    0.486345  …  0.975905  0.034095  0.398251
 0.973872  0.310398  0.273645    0.457593     0.440681  0.481178  0.373012
 0.10861   0.895063  0.370645    0.084692     0.198199  0.608729  0.581116
 0.5859    0.270586  0.00915819  0.486293     0.060693  0.756005  0.29821

However, note well that this allocation is being driven by the host; allocating random numbers directly from a GPU kernel is much trickier, and only became convenient recently (and only for CUDA users).

In [14]:
using BenchmarkTools

if GPU_PKG_NAME == "CUDA"
    @kernel function kernel(X)
        idx = @index(Global, Linear)
        X[idx] += GPUMOD.rand()
    end
    k = kernel(GpuBackend)
    function bench()
        kernels = [k(X; ndrange=32) for i in 1:100]
        wait.(kernels);
    end
    @benchmark bench()
else
    println(":'(")
end

:'(


For the AMD and Intel users out there, or for CUDA users who can't use this functionality, we can fall back to generating random numbers on the CPU, and explicitly passing them into the GPU kernel. We'll make sure to allocate one random number for each thread that'll be launched; if you launch multiple blocks, or with multiple dimensions, make sure to account for that!

In [11]:
@kernel function kernel(X, R)
    idx = @index(Global, Linear)
    X[idx] += R[idx]
end
k = kernel(GpuBackend)
function bench()
    kernels = [k(X, GPUMOD.rand(32); ndrange=32)]
    wait.(kernels);
end
@benchmark bench()

BenchmarkTools.Trial: 2087 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.349 ms[22m[39m … [35m  3.400 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.360 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.387 ms[22m[39m ± [32m130.514 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[34m [39m[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▂[39m▂[39m▂[39m▂[39m▁[39

This will work OK, but it won't give you very good performance if you have to allocate random numbers on the CPU before every kernel launch. Instead, we could pre-allocate a large buffer of random numbers, and use an integer index to determine which numbers to use.

In [12]:
@kernel function kernel(X, R, ridx)
    idx = @index(Global, Linear)
    X[idx] += R[ridx, idx]
end
k = kernel(GpuBackend)
function bench()
    R = GPUMOD.rand(100, 32)
    kernels = [k(X, R, i; ndrange=32) for i in 1:100]
    wait.(kernels);
end
@benchmark bench()

BenchmarkTools.Trial: 585 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m7.339 ms[22m[39m … [35m53.477 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 13.03%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m7.450 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m8.548 ms[22m[39m ± [32m 6.011 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.97% ±  2.32%

  [34m█[39m[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▄[32m▁[39m[39m▄[39m▄[39m▄[

Clearly, this still leaves something to be desired; what happens when we run out of random numbers? We could only launch as many kernels at a time as we have unique random numbers, refill the random array, and continue launching kernels. That works, although you'll have to wait for all of the currently-executing kernels to finish (to be safe).

Is there anything further we can do? If you're on an AMD GPU, you can do something a bit fancier: you could use the hostcall mechanism to have the GPU request random numbers from the CPU as-needed.

In [7]:
if GPU_PKG_NAME == "AMDGPU"
    @kernel function kernel(X, hc)
        idx = @index(Global, Linear)
        R = hostcall!(hc)
        X[idx] += R[idx]
    end
    k = kernel(GpuBackend)

    hc = HostCall(ROCDeviceArray{Float32,1,1}, Tuple{}; continuous=true) do
        R = AMDGPU.rand(32)
        return rocconvert(R) # make it device-compatible
    end

    function bench()
        kernels = [k(X, hc; ndrange=32) for i in 1:100]
        wait.(kernels);
    end
    @benchmark bench()
else
    println(":(")
end

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.112 s[22m[39m … [35m 1.129 s[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.126 s             [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.124 s[22m[39m ± [32m6.626 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m█[34m [39m[39m [39m [39m [39m [39m [39m█[39m [39m█[39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [39m█[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39

This approach written as-is will perform poorly, since the hostcall task is currently single-threaded, and is spending most of its time communicating with the GPU; real applications will generally make far larger kernels invocations, which should make this approach feasible in certain situations.


Regardless of which kind of GPU you use, situations like these inevitably come up, and aren't just related to random numbers. You should always aim to structure your program to let the CPU do as little work as possible, and let the GPU do the heavy lifting. This might be accomplished by pre-allocating large buffers all at once, using task parallelism to minimize the latency of CPU-bound operations, and even trying your hand at re-implementing functionality (like RNGs) as GPU kernels.

(Of course, it's probably best if you don't implement an RNG by hand for anything security-related!)