In [2]:
# Choose a backend
# CUDA, AMD, or CPU

# If you have no GPU you can still follow along
# You might want to install a Kernel with threads enable
# `IJulia.installkernel("Julia 1.6.2 Threads", "--threads=auto")` and restart
# this notebook with that kernel.

const BACKEND = :CUDA

:CUDA

In [11]:
using Pkg
Pkg.activate(string(BACKEND, "Env"))

# Install packages
# pkg"add KernelAbstractions, Adapt"

# if BACKEND == :CUDA
#     pkg"add CUDAKernels, CUDA"
# elseif BACKEND == :AMD
#     pkg"add ROCMKernels, AMDGPU"
# end

[32m[1m  Activating[22m[39m environment at `~/juliacon21-gpu_workshop/kernelabstractions/CUDAEnv/Project.toml`


In [12]:
using KernelAbstractions, Adapt

In [13]:
if BACKEND == :CUDA
    using CUDA, CUDAKernels
    const ArrayT = CuArray
    const Device = CUDADevice
elseif BACKEND == :AMD
    using AMDGPU, ROCMKernels
    const ArrayT = CuArray
    const Device = CUDADevice
else BACKEND == :CPU
    const ArrayT = Array
    const Device = CPU
end

CUDADevice

# Writing your first kernel in KernelAbstractions

Let's implement the classic: $z = ax + y$

In [14]:
@kernel function saxpy!(z, α, x, y)
    I = @index(Global)
    @inbounds z[I] = α * x[I] + y[I]
end

saxpy! (generic function with 5 methods)

Inside `@kernel` you have access to a dialect that let's you implement GPU style
kernels. One example is the `@index` macro that calulcates a valid index. You can
ask for your current `Global`, `Group`, or `Local` index. 

They are also available in different styles, more about that later.

# How are indicies derived

KernelAbstractions has two concepts used to derive indicies.

1. `workgroupsize`
2. `ndrange`

The `workgroupsize` is a local block of threads that are co-executed. The `ndrange`
specifies the global index space. This index space will be subdivided by the `workgroupsize`
and each group will be executed in parallel.

In [15]:
ndrange = (128,)
workgroupsize = (16,)
blocks, workgroupsize, dynamic = KernelAbstractions.NDIteration.partition(ndrange, workgroupsize)

((8,), (16,), KernelAbstractions.NDIteration.NoDynamicCheck())

Both `ndrange` and `workgroupsize` can be of arbitrarily dimensionality

In [16]:
ndrange = (128,256,512)
workgroupsize = (16,)
blocks, workgroupsize, dynamic = KernelAbstractions.NDIteration.partition(ndrange, workgroupsize)

((8, 256, 512), (16, 1, 1), KernelAbstractions.NDIteration.NoDynamicCheck())

# Launching Kernels

Before we launch any kernel we first need to instantiate it for our backend.

In [17]:
kernel = saxpy!(Device())

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

Note that we now have a `Kernel` object with some information in the type domain.

The first argument is the device used, the second and third contain about information
about the launch configuration. Here both are `DynamicSize` meaning none was given.
Lastly there is the type of the actual function going to be called.

## Static vs Dynamic launch configuration

In [18]:
# Let's allocate some data

x = adapt(ArrayT, rand(64, 32))
y = adapt(ArrayT, rand(64, 32))
z = similar(x)
nothing

In [19]:
kernel(z, 0.01, x, y)

LoadError:     Can not partition kernel!

    You created a dynamically sized kernel, but forgot to provide runtime
    parameters for the kernel. Either provide them statically if known
    or dynamically.
    NDRange(Static):  KernelAbstractions.NDIteration.DynamicSize
    NDRange(Dynamic): nothing
    Workgroupsize(Static):  KernelAbstractions.NDIteration.DynamicSize
    Workgroupsize(Dynamic): nothing


In [None]:
kernel(z, 0.01, x, y, ndrange=size(z), workgroupsize=32)

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x0000000006b918c0, CuContext(0x0000000004fbff40, instance cc1d00557c414dc1)))

In [20]:
# I can leave the workgroupsize up to the backend.

kernel(z, 0.01, x, y, ndrange=size(z))

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x00000000089ffa20, CuContext(0x0000000005100ac0, instance 6afe723c4151a0c0)))

If I wanted I could have instantiated the kernel with static size information.

In [21]:
kernel = saxpy!(Device(), (16,))

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [22]:
iterspace, dynamic = KernelAbstractions.partition(kernel, size(x), nothing)
nothing

In [23]:
KernelAbstractions.NDIteration.blocks(iterspace)

4×32 CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)  CartesianIndex(1, 2)  …  CartesianIndex(1, 32)
 CartesianIndex(2, 1)  CartesianIndex(2, 2)     CartesianIndex(2, 32)
 CartesianIndex(3, 1)  CartesianIndex(3, 2)     CartesianIndex(3, 32)
 CartesianIndex(4, 1)  CartesianIndex(4, 2)     CartesianIndex(4, 32)

In [24]:
KernelAbstractions.NDIteration.workitems(iterspace)

16×1 CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)
 CartesianIndex(2, 1)
 CartesianIndex(3, 1)
 CartesianIndex(4, 1)
 CartesianIndex(5, 1)
 CartesianIndex(6, 1)
 CartesianIndex(7, 1)
 CartesianIndex(8, 1)
 CartesianIndex(9, 1)
 CartesianIndex(10, 1)
 CartesianIndex(11, 1)
 CartesianIndex(12, 1)
 CartesianIndex(13, 1)
 CartesianIndex(14, 1)
 CartesianIndex(15, 1)
 CartesianIndex(16, 1)

In [48]:
kernel = saxpy!(Device(), (16,), (1024, 32))

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.StaticSize{(1024, 32)}, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [49]:
# Launching a mismatched kernel should error
# Fixme(vchuravy)

kernel(z, 0.01, x, y, ndrange=size(z))

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x000000001d79e990, CuContext(0x0000000004fbff40, instance cc1d00557c414dc1)))

## Dependencies

Note how the previous examples all returned an `<:Event`. KernelAbstractions launches
all kernels asynchronously and it is up to the programmer (YOU!) to ensure that kernels
are properly synchronized with the host, other kernels, and GPUArrays.

In [26]:
kernel = saxpy!(Device())

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [30]:
# 1. Allocate data
x = adapt(ArrayT, rand(64, 32))
y = adapt(ArrayT, rand(64, 32))
z = similar(x)

# Note: CUDA.jl uses asynchronous allocations, and we are moving data from host
#       to the device.

allocation_event = Event(Device())

# 2. Kernel event, kernel needs to synchronize against allocation and data
#    movement from above.

kernel_event = kernel(z, 0.01, x, y;
                      ndrange=size(z), dependencies=allocation_event)

# 3.
# Scenario A: reading `z` from the host
wait(kernel_event)
adapt(Array, z)

# Scenario B: Using `z` in the next kernel
kernel_event = kernel(x, 0.01, z, y;
               ndrange=size(z), dependencies=kernel_event)

# Note: We need to wait on `x` now

# Scenario C: Using `z` as part of GPUArrays
wait(Device(), kernel_event)
zz = z.^2 # Broadcast expression is dependent on `z`
nothing

### Dependencies and CUDA task based programming


# Using the memory hierarchy on a GPU