In [22]:
using Pkg
# Install packages
# pkg"add KernelAbstractions, Adapt"

# Choose a backend
# CUDA:
# pkg"add CUDAKernels, CUDA"

# AMD: 
# pkg"add ROCMKernels, AMDGPU"

# If you have no GPU you can still follow along
# You might want to install a Kernel with threads enable
# `IJulia.installkernel("Julia 1.6.2 Threads", "--threads=auto")` and restart
# this notebook with that kernel.

In [27]:
using KernelAbstractions, Adapt

In [7]:
const BACKEND = :CUDA

:CUDA

In [8]:
if BACKEND == :CUDA
    using CUDA, CUDAKernels
    const ArrayT = CuArray
    const Device = CUDADevice
elseif BACKEND == :AMD
    using AMDGPU, ROCMKernels
    const ArrayT = CuArray
    const Device = CUDADevice
else BACKEND == :CPU
    const ArrayT = Array
    const Device = CPU
end

CUDADevice

# Writing your first kernel in KernelAbstractions

Let's implement the classic: $z = ax + y$

In [11]:
@kernel function saxpy!(z, α, x, y)
    I = @index(Global)
    @inbounds z[I] = α * x[I] + y[I]
end

saxpy! (generic function with 5 methods)

Inside `@kernel` you have access to a dialect that let's you implement GPU style
kernels. One example is the `@index` macro that calulcates a valid index. You can
ask for your current `Global`, `Group`, or `Local` index. 

They are also available in different styles, more about that later.

# How are indicies derived

KernelAbstractions has two concepts used to derive indicies.

1. `workgroupsize`
2. `ndrange`

The `workgroupsize` is a local block of threads that are co-executed. The `ndrange`
specifies the global index space. This index space will be subdivided by the `workgroupsize`
and each group will be executed in parallel.

In [43]:
ndrange = (128,)
workgroupsize = (16,)
blocks, workgroupsize, dynamic = KernelAbstractions.NDIteration.partition(ndrange, workgroupsize)

((8,), (16,), KernelAbstractions.NDIteration.NoDynamicCheck())

Both `ndrange` and `workgroupsize` can be of arbitrarily dimensionality

In [41]:
ndrange = (128,256,512)
workgroupsize = (16,)
blocks, workgroupsize, dynamic = KernelAbstractions.NDIteration.partition(ndrange, workgroupsize)

((8, 256, 512), (16, 1, 1), KernelAbstractions.NDIteration.NoDynamicCheck())

# Launching Kernels

Before we launch any kernel we first need to instantiate it for our backend.

In [25]:
kernel = saxpy!(Device())

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

Note that we now have a `Kernel` object with some information in the type domain.

The first argument is the device used, the second and third contain about information
about the launch configuration. Here both are `DynamicSize` meaning none was given.
Lastly there is the type of the actual function going to be called.

## Static vs Dynamic launch configuration

In [31]:
# Let's allocate some data

x = adapt(ArrayT, rand(64, 32))
y = adapt(ArrayT, rand(64, 32))
z = similar(x)
nothing

In [29]:
kernel(z, 0.01, x, y)

In [33]:
kernel(z, 0.01, x, y, ndrange=size(z), workgroupsize=32)

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x0000000006b918c0, CuContext(0x0000000004fbff40, instance cc1d00557c414dc1)))

In [34]:
# I can leave the workgroupsize up to the backend.

kernel(z, 0.01, x, y, ndrange=size(z))

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x000000001bcfdfc0, CuContext(0x0000000004fbff40, instance cc1d00557c414dc1)))

If I wanted I could have instantiated the kernel with static size information.

In [36]:
kernel = saxpy!(Device(), (16,))

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [47]:
iterspace, dynamic = KernelAbstractions.partition(kernel, size(x), nothing)
nothing

In [44]:
KernelAbstractions.NDIteration.blocks(iterspace)

4×32 CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)  CartesianIndex(1, 2)  …  CartesianIndex(1, 32)
 CartesianIndex(2, 1)  CartesianIndex(2, 2)     CartesianIndex(2, 32)
 CartesianIndex(3, 1)  CartesianIndex(3, 2)     CartesianIndex(3, 32)
 CartesianIndex(4, 1)  CartesianIndex(4, 2)     CartesianIndex(4, 32)

In [45]:
KernelAbstractions.NDIteration.workitems(iterspace)

16×1 CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)
 CartesianIndex(2, 1)
 CartesianIndex(3, 1)
 CartesianIndex(4, 1)
 CartesianIndex(5, 1)
 CartesianIndex(6, 1)
 CartesianIndex(7, 1)
 CartesianIndex(8, 1)
 CartesianIndex(9, 1)
 CartesianIndex(10, 1)
 CartesianIndex(11, 1)
 CartesianIndex(12, 1)
 CartesianIndex(13, 1)
 CartesianIndex(14, 1)
 CartesianIndex(15, 1)
 CartesianIndex(16, 1)

In [48]:
kernel = saxpy!(Device(), (16,), (1024, 32))

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.StaticSize{(1024, 32)}, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [49]:
# Launching a mismatched kernel should error
# Fixme(vchuravy)

kernel(z, 0.01, x, y, ndrange=size(z))

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x000000001d79e990, CuContext(0x0000000004fbff40, instance cc1d00557c414dc1)))

## Dependencies

Note how the previous examples all returned an `Event`

### Dependencies and CUDA task based programming

## Integrating with GPUArrays

# Using the memory hierarchy on a GPU