In [2]:
# Choose a backend
# CUDA, AMD, or CPU

# If you have no GPU you can still follow along
# You might want to install a Kernel with threads enable
# `IJulia.installkernel("Julia 1.6.2 Threads", "--threads=auto")` and restart
# this notebook with that kernel.

const BACKEND = :CUDA

:CUDA

In [5]:
using Pkg
Pkg.activate(string(BACKEND, "Env"))

# Install packages
# pkg"add KernelAbstractions, Adapt, OffsetArrays"

# if BACKEND == :CUDA
#     pkg"add CUDAKernels, CUDA"
# elseif BACKEND == :AMD
#     pkg"add ROCMKernels, AMDGPU"
# end

[32m[1m  Activating[22m[39m environment at `~/juliacon21-gpu_workshop/kernelabstractions/CUDAEnv/Project.toml`


In [4]:
using KernelAbstractions, Adapt, OffsetArrays

In [11]:
if BACKEND == :CUDA
    using CUDA, CUDAKernels
    const ArrayT = CuArray
    const Device = CUDADevice
elseif BACKEND == :AMD
    using AMDGPU, ROCMKernels
    const ArrayT = CuArray
    const Device = CUDADevice
else BACKEND == :CPU
    const ArrayT = Array
    const Device = CPU
end

CUDADevice

# Writing your first kernel in KernelAbstractions

Let's implement the classic: $z = ax + y$

In [14]:
@kernel function saxpy!(z, α, x, y)
    I = @index(Global)
    @inbounds z[I] = α * x[I] + y[I]
end

saxpy! (generic function with 5 methods)

Inside `@kernel` you have access to a dialect that let's you implement GPU style
kernels. One example is the `@index` macro that calulcates a valid index. You can
ask for your current `Global`, `Group`, or `Local` index. 

They are also available in different styles, more about that later.

# How are indicies derived

KernelAbstractions has two concepts used to derive indicies.

1. `workgroupsize`
2. `ndrange`

The `workgroupsize` is a local block of threads that are co-executed. The `ndrange`
specifies the global index space. This index space will be subdivided by the `workgroupsize`
and each group will be executed in parallel.

In [15]:
ndrange = (128,)
workgroupsize = (16,)
blocks, workgroupsize, dynamic = KernelAbstractions.NDIteration.partition(ndrange, workgroupsize)

((8,), (16,), KernelAbstractions.NDIteration.NoDynamicCheck())

Both `ndrange` and `workgroupsize` can be of arbitrarily dimensionality

In [16]:
ndrange = (128,256,512)
workgroupsize = (16,)
blocks, workgroupsize, dynamic = KernelAbstractions.NDIteration.partition(ndrange, workgroupsize)

((8, 256, 512), (16, 1, 1), KernelAbstractions.NDIteration.NoDynamicCheck())

# Launching Kernels

Before we launch any kernel we first need to instantiate it for our backend.

In [17]:
kernel = saxpy!(Device())

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

Note that we now have a `Kernel` object with some information in the type domain.

The first argument is the device used, the second and third contain about information
about the launch configuration. Here both are `DynamicSize` meaning none was given.
Lastly there is the type of the actual function going to be called.

## Static vs Dynamic launch configuration

In [18]:
# Let's allocate some data

x = adapt(ArrayT, rand(64, 32))
y = adapt(ArrayT, rand(64, 32))
z = similar(x)
nothing

In [19]:
kernel(z, 0.01, x, y)

LoadError:     Can not partition kernel!

    You created a dynamically sized kernel, but forgot to provide runtime
    parameters for the kernel. Either provide them statically if known
    or dynamically.
    NDRange(Static):  KernelAbstractions.NDIteration.DynamicSize
    NDRange(Dynamic): nothing
    Workgroupsize(Static):  KernelAbstractions.NDIteration.DynamicSize
    Workgroupsize(Dynamic): nothing


In [None]:
kernel(z, 0.01, x, y, ndrange=size(z), workgroupsize=32)

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x0000000006b918c0, CuContext(0x0000000004fbff40, instance cc1d00557c414dc1)))

In [20]:
# I can leave the workgroupsize up to the backend.

kernel(z, 0.01, x, y, ndrange=size(z))

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x00000000089ffa20, CuContext(0x0000000005100ac0, instance 6afe723c4151a0c0)))

If I wanted I could have instantiated the kernel with static size information.

In [21]:
kernel = saxpy!(Device(), (16,))

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [22]:
iterspace, dynamic = KernelAbstractions.partition(kernel, size(x), nothing)
nothing

In [23]:
KernelAbstractions.NDIteration.blocks(iterspace)

4×32 CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)  CartesianIndex(1, 2)  …  CartesianIndex(1, 32)
 CartesianIndex(2, 1)  CartesianIndex(2, 2)     CartesianIndex(2, 32)
 CartesianIndex(3, 1)  CartesianIndex(3, 2)     CartesianIndex(3, 32)
 CartesianIndex(4, 1)  CartesianIndex(4, 2)     CartesianIndex(4, 32)

In [24]:
KernelAbstractions.NDIteration.workitems(iterspace)

16×1 CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)
 CartesianIndex(2, 1)
 CartesianIndex(3, 1)
 CartesianIndex(4, 1)
 CartesianIndex(5, 1)
 CartesianIndex(6, 1)
 CartesianIndex(7, 1)
 CartesianIndex(8, 1)
 CartesianIndex(9, 1)
 CartesianIndex(10, 1)
 CartesianIndex(11, 1)
 CartesianIndex(12, 1)
 CartesianIndex(13, 1)
 CartesianIndex(14, 1)
 CartesianIndex(15, 1)
 CartesianIndex(16, 1)

In [48]:
kernel = saxpy!(Device(), (16,), (1024, 32))

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.StaticSize{(1024, 32)}, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [49]:
# Launching a mismatched kernel should error
# Fixme(vchuravy)

kernel(z, 0.01, x, y, ndrange=size(z))

CUDAKernels.CudaEvent(CuEvent(Ptr{Nothing} @0x000000001d79e990, CuContext(0x0000000004fbff40, instance cc1d00557c414dc1)))

## Dependencies

Note how the previous examples all returned an `<:Event`. KernelAbstractions launches
all kernels asynchronously and it is up to the programmer (YOU!) to ensure that kernels
are properly synchronized with the host, other kernels, and GPUArrays.

In [26]:
kernel = saxpy!(Device())

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_saxpy!)}(gpu_saxpy!)

In [30]:
# 1. Allocate data
x = adapt(ArrayT, rand(64, 32))
y = adapt(ArrayT, rand(64, 32))
z = similar(x)

# Note: CUDA.jl uses asynchronous allocations, and we are moving data from host
#       to the device.

allocation_event = Event(Device())

# 2. Kernel event, kernel needs to synchronize against allocation and data
#    movement from above.

kernel_event = kernel(z, 0.01, x, y;
                      ndrange=size(z), dependencies=allocation_event)

# 3.
# Scenario A: reading `z` from the host
wait(kernel_event)
adapt(Array, z)

# Scenario B: Using `z` in the next kernel
kernel_event = kernel(x, 0.01, z, y;
               ndrange=size(z), dependencies=kernel_event)

# Note: We need to wait on `x` now

# Scenario C: Using `z` as part of GPUArrays
wait(Device(), kernel_event)
zz = z.^2 # Broadcast expression is dependent on `z`
nothing

### Dependencies and CUDA task based programming

CUDA proper organises concurrent execution of kernels on so called `Streams`.
Launching a kernel on a stream means that it will execute serially with respect
to other kernels on that stream, and concurrently with kernels launched on different
streams.

KernelAbstractions dependency system was partly motivated by exposing a consistent
concurrent kernel execution story that would work across multiple GPU and CPU platforms.

CUDA.jl 3.0 now uses asynchronous allocations and task based concurrency, which partly
obviates the need for KernelAbstractions dependency system. Each Julia task now 
has a task-local stream that is used to both launch user kernels as well as library code.

From a KernelAbstraction perspective the dependency management has not changes,
since KernelAbstractions kernels will launch concurrently to the task-local stream.

# Using the memory hierarchy on a GPU

One of the main motivation behind KernelAbstractions is making the memory hierarchy
available, while still being able to execute the code on the CPU. The primary accessors
are `@localmem` to define a memory region local to a group of threads (called shared memory in CUDA),
and `@synchronize` to synchronize memory accesses on that memory.

Let's write a diffusion kernel:

In [12]:
@kernel function diffusion!(data, a, dt, dx, dy)
    i, j = @index(Global, NTuple)

    @inbounds begin
        dij   = data[i,j]
        dim1j = data[i-1, j]
	    dijm1 = data[i, j-1]
	    dip1j = data[i+1, j]
        dijp1 = data[i, j+1]

        dij += a * dt * (
		    (dim1j - 2 * dij + dip1j)/dx^2 +
		    (dijm1 - 2 * dij + dijp1)/dy^2)
        
        data[i, j] = dij
    end
end

diffusion! (generic function with 5 methods)

In [13]:
N  = 64
dx = 0.01 # x-grid spacing
dy = 0.01 # y-grid spacing
a  = 0.001
dt = dx^2 * dy^2 / (2.0 * a * (dx^2 + dy^2)) # Largest stable time step

0.024999999999999998

In [17]:
domain = OffsetArray(zeros(N+2, N+2), 0:(N+1), 0:(N+1))
domain[16:32, 16:32] .= 5
domain = adapt(ArrayT, domain)
nothing

In [15]:
diffusion_kernel = diffusion!(Device())

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_diffusion!)}(gpu_diffusion!)

In [18]:
wait(diffusion_kernel(domain, a, dt, dx, dy;
                 ndrange=(N, N)))

#### What is wrong with this kernel?

1. Think about data races
2. Consider the data access pattern 

#### Task

Implement a mirror boundary condition

In [19]:
@kernel function diffusion_lmem!(out, @Const(data), a, dt, dx, dy)
    i, j   = @index(Global, NTuple)
    li, lj = @index(Local, NTuple)
    lmem = @localmem eltype(data) (@groupsize()[1] + 2, @groupsize()[2] + 2)
    @uniform ldata = OffsetArray(lmem, 0:(@groupsize()[1]+1), 0:(@groupsize()[2]+1))

    # Load data from global to local buffer
    @inbounds begin
        ldata[li, lj] = data[i, j]
        if i == 1
            ldata[li-1, lj] = data[i-1, j]
        end
        if i == @groupsize()[1]
            ldata[li+1, lj] = data[i+1, j]
        end
        if j == 1
            ldata[li, lj-1] = data[i, j-1]
        end
        if j == @groupsize()[2]
            ldata[li, lj+1] = data[i, j+1]
        end
    end
    @synchronize()

    @inbounds begin
        dij   = ldata[li,   lj]
        dim1j = ldata[li-1, lj]
	    dijm1 = ldata[li, lj-1]
	    dip1j = ldata[li+1, lj]
        dijp1 = ldata[li, lj+1]

        dij += a * dt * (
		    (dim1j - 2 * dij + dip1j)/dx^2 +
		    (dijm1 - 2 * dij + dijp1)/dy^2)
        
        out[i, j] = dij
    end
end

diffusion_lmem! (generic function with 5 methods)

#### What did we change?

1. Split output from input so that we don't have across workgroup races
2. Loaded data into a local memory buffer
3. Marked `data` with `@Const`

##### What is `@uniform`

Necessary annotation for CPU execution. It means that the variable is shared
across the workgroup. Similarily `@private` introduces a variable that is private
per thread of the workgroup. You will need to use it if you need to hold on to data across an `@synchronize`.

##### Index kinds
In this example we used the index kind `NTuple`. Other options are `Linear`, or
`Cartesian`. A `Linear` index is simply an integer and a `Cartesian` is a
n-dimensional index object. Only use `NTuple` if you need to statically reason
about the dimensionality of your workgroup.

In [21]:
out = similar(domain)
nothing

In [23]:
# Using a static workgroupsize here is important so that the `@groupsize`
# can use that information
diffusion_lmem_kernel = diffusion_lmem!(Device(), (16, 16))

KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_diffusion_lmem!)}(gpu_diffusion_lmem!)

In [24]:
wait(diffusion_lmem_kernel(out, domain, a, dt, dx, dy;
                           ndrange=(N, N)))

### Other topics:
- Reflection
- @print


In [30]:
@kernel function kernel_print()
    I = @index(Global)
    @print("Hello from thread ", I, "!\n")
end

kernel_print (generic function with 5 methods)

In [31]:
wait(kernel_print(Device())(ndrange=(4,)))

Hello from thread 1!
Hello from thread 2!
Hello from thread 3!
Hello from thread 4!


In [26]:
@macroexpand @kernel function saxpy!(z, α, x, y)
    I = @index(Global)
    @inbounds z[I] = α * x[I] + y[I]
end

quote
    function cpu_saxpy!(__ctx__, z, α, x, y; )
        let
            $(Expr(:aliasscope))
            begin
                var"##N#303" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
                begin
                    #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:263 =#
                    for var"##I#302" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
                        #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:264 =#
                        (KernelAbstractions.__validindex)(__ctx__, var"##I#302") || continue
                        #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:265 =#
                        I = KernelAbstractions.__index_Global_Linear(__ctx__, var"##I#302")
                        #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:266 =#
                        begin
                            $(Expr(:inbounds, true))


In [28]:
@ka_code_typed diffusion_lmem_kernel(out, domain, a, dt, dx, dy, ndrange=(N, N))

CodeInfo(
[90m1 ───[39m %1   = Core.getfield(##overdub_arguments#258, 2)[36m::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{2, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Nothing}}[39m
[90m│    [39m %2   = Core.getfield(##overdub_arguments#258, 4)[36m::OffsetMatrix{Float64, CuDeviceMatrix{Float64, 1}}[39m
[90m│    [39m %3   = Core.getfield(##overdub_arguments#258, 5)[36m::Float64[39m
[90m│    [39m %4   = Core.getfield(##overdub_arguments#258, 6)[36m::Float64[39m
[90m│    [39m %5   = Core.getfield(##overdub_arguments#258, 7)[36m::Float64[39m
[90m│    [39m %6   = Core.getfield(##overdub_arguments#258, 8)[36m::Float64[39m
[90m│    [39m %7   = Base.getfield(

In [29]:
@ka_code_llvm diffusion_lmem_kernel(out, domain, a, dt, dx, dy, ndrange=(N, N))

LoadError: @ka_code_llvm does not support GPU kernels