In [1]:
using Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()

using CUDA, BenchmarkTools


[32m[1m  Activating[22m[39m project at `~/course`
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mPrecompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mPrecompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]


# Array programming

In this notebook, I'll explain how to use the `CuArray` type to program the GPU. This is a convenient programming model that does not require detailed knowledge of the GPU, but there's still some noteworthy tips and tricks that can significantly impact performance.

It all starts with the `CuArray` type, which serves a dual purpose:

- a managed container for GPU memory
- a way to dispatch to operations that execute on the GPU

In [2]:
A = CuArray([1. 2.; 3. 4.])


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0  2.0
 3.0  4.0

A common shorthand way to create a `CuArray` is to call the `cu` function. It behaves like an recursive, but opiniated constructor:
- it descends into structures, e.g., `cu(Adjoint([1, 2])) == Adjoint(CuArray([1, 2]))`
- it convers slow `Float64` into much faster `Float32`

In [3]:
cu([1. 2.; 3. 4.]')


2×2 adjoint(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}) with eltype Float32:
 1.0  3.0
 2.0  4.0

In [4]:
# compare to
CuArray([1. 2.; 3. 4.]')


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0  3.0
 2.0  4.0

Memory management will be discussed in detail in a later notebook, but for now it's enough to remember that a CuArray is **a CPU object representing memory on the GPU**. It will be automatically freed when all references have been removed, and the garbage collector runs.

The goal of `CuArray` is to make it easy to program GPUs using array operations:

In [5]:
# this will automatically use CUBLAS
A * A


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
  7.0  10.0
 15.0  22.0

In [6]:
# whereas this operation will use a native broadcast kernel
A .* A


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0   4.0
 9.0  16.0

This works by specializing certain methods with a GPU-specialized implementation, either for:
- compatibility: not all CPU implementations work on the GPU
- performance: GPUs have a different programming model so might require optimized implementations

This generally works pretty well, the goal is to get as close to the CPU `Array` type's functionality as possible, and entire applications have been built on top of CuArray's array functionality.

## Higher-order functionality

The broadcast expression `A .* A` may look like a simple, special-purpose element-wise multiplication, but is syntactical sugar for a much more generic operation:

In [6]:
elwise_op(a, b) = a * b
broadcast(elwise_op, A, A)


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0   4.0
 9.0  16.0

This is an example of a higher-order operation, i.e., an operation that takes a function as an argument. This is a very powerful concept, because it makes it possible to *compose* the library definition of an operation, here `broadcast`, with user-provided code. This is possible in Julia because we have a JIT compiler, and in many cases makes it possible to write custom GPU code without ever having to write a kernel function.

Passing user code to a function in Julia can also be done using the `do-block` syntax:

In [7]:
broadcast(A, A) do a, b
    a * b
end


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0   4.0
 9.0  16.0

But the most convenient syntax is of course the dot syntax, where e.g. `f.(A .+ B)` is equivalent to broadcasting a function that computes `f(a + b)` for each element `a` in `A` and `b` in `B`. It's important to note that this dot expression performing multiple operations resulted in only a single broadcast invocation, i.e., we have **syntactical dot fusion**.

Julia defines a variety of these higher-order operations, many of which are implemented by the GPU back-ends. For example, there's also `map`, similar to `broadcast` but without the, well, broadcasting property that allows for mismatching sizes:

In [9]:
map(A) do a
    a * 2
end


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 2.0  4.0
 6.0  8.0

### Reductions

Another important higher-order operation is `mapreduce`, which can be used to map & reduce any part of an N-dimensional array. For example, one of the simplest invocations:

In [10]:
A = cu([1, 2])
reduce(+, A)


3

This reduces to a scalar, which requires synchronizing the GPU. Instead, you can also synchronize to a one-element array, which will then only synchronize the GPU when fetching the contents of that array. This is done by specifying the `dims` keyword, which specifies which dimensions to reduce over:

In [11]:
reduce(+, A; dims=1)


1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 3

The `dims` keyword also makes it possible to perform multiple reductions at once, e.g., in the case the latest dimension of an array represents the batch:

In [13]:
A = CUDA.rand(10, 10, 3)
reduce(+, A; dims=[1,2])


1×1×3 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}:
[:, :, 1] =
 53.496254

[:, :, 2] =
 47.58113

[:, :, 3] =
 46.423695

The `reduce` operation is part of a family of operations that all build on the `mapreduce` operation, with specializations like `sum`, `prod`, `any`, `all`, etc.

A version of `reduce` that maintains the intermediate results is `accumulate`:

In [26]:
A = cu(ones(5))
accumulate(+, A)


5-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 2.0
 3.0
 4.0
 5.0

## Module-level functionality

Some Julia APIs do not take an array argument, and as such cannot be specialized for GPU execution. Examples include: `rand`, `fill`, `zeros`, etc. For these functions, CUDA provides unexported replacements, e.g., `CUDA.rand`, `CUDA.fill`, etc.

In [5]:
CUDA.zeros(1)


1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.0

In [8]:
CUDA.rand(Float64, 2, 2)


2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.810514  0.969421
 0.403801  0.672175

## Common issues

Before trying this out, let's take a look at some issues that are common when using arrays to program the GPU.

### Scalar iteration

A key performance issue comes from the fact that a `CuArray` instance is a CPU object representing a chunk of memory on the GPU. That means we invoke the GPU for every CPU operation invoked on a CuArray. That is OK for array operations, where the GPU will have to do a bunch of work, but is very bad when you have CPU code performing a bunch of small scalar operations:

In [None]:
A = CuArray(1:10)
A_sum = zero(eltype(A))
for I in eachindex(A)
    A_sum += A[I]
end
A_sum


55

Because of this kind of programming pattern, iterating the array and fetching one scalar at a time (hence 'scalar iteration'), being so slow CUDA.jl warns about it. With the above snippet, the situation is actually even worse: Not only does every iteration require a GPU operation to fetch an element, the `getindex` call is also the only array operation meaning that the actual summation won't even run on the GPU!

The solution here is to use the `sum` function that performs the entire operation as a single step.

To disallow scalar iteration, use the `allowscalar` function:

In [None]:
CUDA.allowscalar(false)
A[1]


LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

You should generally always enable this option! It's not by default in interactive sessions because it simplifies porting CPU code, and it's easy to trigger scalar iteration from non performance-sensitive paths (e.g. display methods):

In [None]:
A'


1×10 adjoint(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}) with eltype Int64:
 1  2  3  4  5  6  7  8  9  10

In [None]:
view(A', :, :)


ErrorException: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

Because of how Julia's type system works, it's easy to trigger non GPU-specialized methods when using array wrappers. Still, for non-interactive code it's recommended to always disable scalar iteration.

Sometimes, however, scalar iteration is perfectly fine. For example, when fetching the result of a reduction:

In [None]:
A = CUDA.rand(1024)
R = sum(A; dims=1)
CUDA.@allowscalar R[]


The above can be useful when the result of a reduction isn't immediately used, because reducing to an array can be executed asynchronously, while reducing to a scalar cannot.

It's also possible to avoid the issue of scalar iteration altogether by using unified memory, but more on that in a later notebook.

### Calling into C libraries

Another common issue arises when calling CPU-specific code, e.g. in some C library, using a GPU array. This generally does not work, because GPU pointers are not dereferencable on the CPU. To prevent this from crashing, we introduce a GPU-specific pointer type and disallow conversions:

In [None]:
ccall(:whatever, Nothing, (Ptr{Float32},), CUDA.rand(1))


LoadError: ArgumentError: cannot take the CPU address of a CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}

In that case, either you need to use different (supported) array operations, or fix the implementation in CUDA.jl. Such a fix can mean using functions from a CUDA library, using existing operations, or writing your own kernel.

Once again, if you need to support this, you can also consider using unified memory. More on that later.

## Exercise: Matrix RMSE

As a simple exercise, try to implement a function that computes the RMSE of two matrices on the GPU using array operations:

$$

    RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^N (A_i - B_i)^2}

$$

Benchmark the implementation against the CPU version.

Let's start with a CPU implementation:

In [11]:
rmse(A, B) = sqrt(sum((A-B).^2) / length(A))

A = rand(1024, 1024)
B = rand(1024, 1024)
rmse(A, B)


0.40838313f0

To 'port' this to the GPU, just change the type of the input arrays to `CuArray` and the computation of C just works:

In [12]:
dA = CuArray(A)
dB = CuArray(B)
rmse(dA, dB)


0.40838313f0

Results look identical, so let's try benchmarking.

In [13]:
using BenchmarkTools
@benchmark rmse($A, $B)


BenchmarkTools.Trial: 3531 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.268 ms[22m[39m … [35m  4.252 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 12.15%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.305 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.408 ms[22m[39m ± [32m239.687 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m6.75% ± 10.84%

  [39m [39m█[39m [34m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▅[39m█[39m█[34m▄[39m[39m▅[3

In [14]:
@benchmark rmse($dA, $dB)


BenchmarkTools.Trial: 4026 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m332.168 μs[22m[39m … [35m742.883 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.57%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m867.591 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m  1.235 ms[22m[39m ± [32m 16.529 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.16% ± 0.01%

  [39m▁[39m▃[39m▁[39m▂[39m▂[39m▁[39m▄[39m▃[39m [39m▅[39m▅[39m▂[39m▅[39m▃[39m▂[39m▇[39m▂[39m▅[39m▅[39m▄[39m▇[39m▄[39m▅[39m▁[39m▃[39m▆[39m▄[39m▄[39m▄[39m▂[34m▅[39m[39m▄[39m▆[39m▆[39m▃[39m▅[39m▄[39m▅[39m▂[39m▂[39m [39m▂[39m▅[39m▅[39m▇[39m▂[39m▄[39m█[39m▅[39m▁[39m▄[32m▄[39m[39m▂[39m▃[39m▂[39m▁[39m▃[39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[39

Impressive speed-up! Of course, note that we're only measuring the actual computation, and not the time to transfer the data.