# Enzyme with GPU 

Now, we want to take the previous examples and do similar examples on GPU 

* 1) with CUDA.jl 
* 2) with KernelAbstractions

With CUDA, as far as I know, only writing your own kernels is supported, not cuBLAS. All of my attempts to differentiate e.g. a standard `mul!` don't work. 

## Enzyme with CUDA.jl

We take the last example from the CPU notebook and try to do it on GPU with `CUDA.jl`

For this purpose we have to write CUDA kernels

In [1]:
pwd()

"/p/tmp/maxgelbr/code/SpeedyExperiments.jl/scripts"

In [2]:
import Pkg 
Pkg.activate(".") # make sure this is really the right environment
using Enzyme, Test, CUDA, SpeedyExperiments, LinearAlgebra, Adapt
CUDA.allowscalar(false)

[32m[1m  Activating[22m[39m project at `/p/tmp/maxgelbr/code/SpeedyExperiments.jl/scripts`
┌ Info: Precompiling SpeedyExperiments [0d28f6d9-48d7-458f-b3e7-42cde83a05c7]
└ @ Base loading.jl:1423
[33m[1m│ [22m[39m- If you have SpeedyExperiments checked out for development and have
[33m[1m│ [22m[39m  added CUDAKernels as a dependency but haven't updated your primary
[33m[1m│ [22m[39m  environment's manifest file, try `Pkg.resolve()`.
[33m[1m│ [22m[39m- Otherwise you may need to report an issue with SpeedyExperiments


Should be the scripts folder, otherwise change it. If you are using it for the first time, you might need to `]dev ..` the package again. In the `SpeedyExperiments` there are some utilities for GPU usage, like e.g:

In [3]:
?DeviceArray

search: [0m[1mD[22m[0m[1me[22m[0m[1mv[22m[0m[1mi[22m[0m[1mc[22m[0m[1me[22m[0m[1mA[22m[0m[1mr[22m[0m[1mr[22m[0m[1ma[22m[0m[1my[22m Cu[0m[1mD[22m[0m[1me[22m[0m[1mv[22m[0m[1mi[22m[0m[1mc[22m[0m[1me[22m[0m[1mA[22m[0m[1mr[22m[0m[1mr[22m[0m[1ma[22m[0m[1my[22m [0m[1mD[22m[0m[1me[22m[0m[1mv[22m[0m[1mi[22m[0m[1mc[22m[0m[1me[22mSp[0m[1ma[22m[0m[1mr[22mseA[0m[1mr[22mr[0m[1ma[22m[0m[1my[22m Cu[0m[1mD[22m[0m[1me[22m[0m[1mv[22m[0m[1mi[22m[0m[1mc[22m[0m[1me[22mM[0m[1ma[22mt[0m[1mr[22mix



```
DeviceArray(x)
```

Returns a `CuArray` when CUDA is used, otherwise a regular `Array`


We adjust the example from the other notebook, but this time we use the GPU. In order to save some time writing the kernel, we just do an elementwise multiplication. Now let's try to take the derivative, for this we use `autodiff_deferred` instead of the regular `autodiff`. The syntax is the same, but it is adjusted for GPU usage. Note that we have to execute this as a kernel as well. 

In [4]:
function element_mul_kernel!(C, A, B)
    i = threadIdx().x
    C[i] = A[i]*B[i]
    return nothing
end

function grad_mul_kernel!(C, dC, A, dA, B, dB)
    Enzyme.autodiff_deferred(element_mul_kernel!, Const, Duplicated(C, dC), Duplicated(A, dA), Duplicated(B, dB))
    return nothing
end

X_1 = DeviceArray(rand(3,3))
∂X_1 = zero(X_1) # input, hence zero 
X_2 = DeviceArray(rand(size(X_1)...))
∂X_2 = zero(X_2) # input, hence zero
Y = DeviceArray(zeros(size(X_1)...))
∂Y = fill!(similar(Y), 1); # output, hence something

In [5]:
@cuda threads=length(X_1) grad_mul_kernel!(Y, ∂Y, X_1, ∂X_1, X_2, ∂X_2)

CUDA.HostKernel{typeof(grad_mul_kernel!), NTuple{6, CuDeviceMatrix{Float64, 1}}}(grad_mul_kernel!, CuFunction(Ptr{Nothing} @0x000000000310c010, CuModule(Ptr{Nothing} @0x0000000005c623d0, CuContext(0x0000000002b402c0, instance e0ff3f39bda82691))), CUDA.KernelState(Ptr{Nothing} @0x00002b20e5a00000))

In [6]:
@test Y ≈ X_1 .* X_2

[32m[1mTest Passed[22m[39m
  Expression: Y ≈ X_1 .* X_2
   Evaluated: [0.518367141735663 0.04514307627053391 0.183222686204985; 0.43110858915650235 0.3155912165251231 0.13808752179289513; 0.2107084030051926 0.06296262389166143 0.1821020051992087] ≈ [0.518367141735663 0.04514307627053391 0.183222686204985; 0.43110858915650235 0.3155912165251231 0.13808752179289513; 0.2107084030051926 0.06296262389166143 0.1821020051992087]

In [7]:
@test ∂X_1 ≈ X_2

[32m[1mTest Passed[22m[39m
  Expression: ∂X_1 ≈ X_2
   Evaluated: [0.5825494050811111 0.9523803367679741 0.3894609012488429; 0.5902793013826094 0.5402133039887466 0.8606560948422022; 0.23473898524613512 0.08905694353173121 0.7137234003704515] ≈ [0.5825494050811111 0.9523803367679741 0.3894609012488429; 0.5902793013826094 0.5402133039887466 0.8606560948422022; 0.23473898524613512 0.08905694353173121 0.7137234003704515]

In [8]:
@test ∂X_2 ≈ X_1

[32m[1mTest Passed[22m[39m
  Expression: ∂X_2 ≈ X_1
   Evaluated: [0.8898252014582151 0.04740026072328707 0.4704520675052727; 0.7303467835424993 0.5841974164555143 0.1604444825528284; 0.8976284990933854 0.7069928676502096 0.25514366644653985] ≈ [0.8898252014582151 0.04740026072328707 0.4704520675052727; 0.7303467835424993 0.5841974164555143 0.1604444825528284; 0.8976284990933854 0.7069928676502096 0.25514366644653985]

Now, we do the same but with a struct.

 If we use CUDA with custom structs, we have to make sure to use `Adapt` to make our structs avaliable to CUDA, [as explained here](https://cuda.juliagpu.org/stable/tutorials/custom_structs/). The struct also has to have (parametric) types defined, i.e without the `{S,T,U}` we would get an error!

In [9]:
struct PreComputeMul{S,T,U} 
    Y::S
    X_1::T
    X_2::U
    
    function PreComputeMul(X,Y,Z) 
        @assert length(X) == length(Y) == length(Z)
        new{typeof(X),typeof(Y),typeof(Z)}(X,Y,Z)
    end 
end

function Adapt.adapt_structure(to, m::PreComputeMul)
    Y = Adapt.adapt_structure(to, m.Y)
    X_1 = Adapt.adapt_structure(to, m.X_1)
    X_2 = Adapt.adapt_structure(to, m.X_2)
    PreComputeMul(Y, X_1, X_2)
end

function element_mul_kernel!(C::PreComputeMul)
    i = threadIdx().x
    C.Y[i] = C.X_1[i]*C.X_2[i]
    return nothing
end

function compute_elementwise_mul!(X::PreComputeMul)
    @cuda threads=length(X.Y) element_mul_kernel!(X.Y, X.X_1, X.X_2)
end 

compute_elementwise_mul! (generic function with 1 method)

In [10]:
X_1 = DeviceArray(rand(3,3))
∂X_1 = zero(X_1) # input, hence zero 
X_2 = DeviceArray(rand(size(X_1)...))
∂X_2 = zero(X_2) # input, hence zero
Y = DeviceArray(zeros(size(X_1)...))
∂Y = fill!(similar(Y), 1) # output, hence something

X = PreComputeMul(Y, X_1, X_2)
∂X = PreComputeMul(∂Y, ∂X_1, ∂X_2)

PreComputeMul{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}([1.0 1.0 1.0; 1.0 1.0 1.0; 1.0 1.0 1.0], [0.0 0.0 0.0; 0.0 0.0 0.0; 0.0 0.0 0.0], [0.0 0.0 0.0; 0.0 0.0 0.0; 0.0 0.0 0.0])

The first way to compute this would also work without the `adapt_structure`:

In [11]:
@cuda threads=length(X.Y) element_mul_kernel!(X.Y, X.X_1, X.X_2)

CUDA.HostKernel{typeof(element_mul_kernel!), Tuple{CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}}}(element_mul_kernel!, CuFunction(Ptr{Nothing} @0x0000000011905e60, CuModule(Ptr{Nothing} @0x0000000011905cb0, CuContext(0x0000000002b402c0, instance e0ff3f39bda82691))), CUDA.KernelState(Ptr{Nothing} @0x00002b20e5a00000))

But for kernels that actually use the structs, we need it:

In [12]:
@cuda threads=length(X.Y) element_mul_kernel!(X)

CUDA.HostKernel{typeof(element_mul_kernel!), Tuple{PreComputeMul{CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}}}}(element_mul_kernel!, CuFunction(Ptr{Nothing} @0x00000000115165b0, CuModule(Ptr{Nothing} @0x0000000007fb2fb0, CuContext(0x0000000002b402c0, instance e0ff3f39bda82691))), CUDA.KernelState(Ptr{Nothing} @0x00002b20e5a00000))

Test if the implementation works on GPU

In [13]:
@test X.Y ≈ X.X_1 .* X.X_2

[32m[1mTest Passed[22m[39m
  Expression: X.Y ≈ X.X_1 .* X.X_2
   Evaluated: [0.12712573852891057 0.23446725910034724 0.012028230568762222; 0.16900630393232668 0.6076334912688396 0.26773287122182476; 0.04430743480845625 0.06708683278909719 0.41290929450218006] ≈ [0.12712573852891057 0.23446725910034724 0.012028230568762222; 0.16900630393232668 0.6076334912688396 0.26773287122182476; 0.04430743480845625 0.06708683278909719 0.41290929450218006]

Now, we can take the derivative similar to the example without the struct

In [14]:
function grad_elementwise_mul_kernel!(A, dA)
    Enzyme.autodiff_deferred(element_mul_kernel!, Const, Duplicated(A, dA))
    return nothing
end

grad_elementwise_mul_kernel! (generic function with 1 method)

In [15]:
@cuda threads=length(X.Y) grad_elementwise_mul_kernel!(X, ∂X)

CUDA.HostKernel{typeof(grad_elementwise_mul_kernel!), Tuple{PreComputeMul{CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}}, PreComputeMul{CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}}}}(grad_elementwise_mul_kernel!, CuFunction(Ptr{Nothing} @0x00000000121f7a10, CuModule(Ptr{Nothing} @0x000000001163bac0, CuContext(0x0000000002b402c0, instance e0ff3f39bda82691))), CUDA.KernelState(Ptr{Nothing} @0x00002b20e5a00000))

In [16]:
@test ∂X_1 ≈ X_2

[32m[1mTest Passed[22m[39m
  Expression: ∂X_1 ≈ X_2
   Evaluated: [0.6757225470713533 0.6629617469870471 0.48239405723117035; 0.3404947156585928 0.708240345050117 0.28412692076096935; 0.1254104958726736 0.07725581087465527 0.7143499334334222] ≈ [0.6757225470713533 0.6629617469870471 0.48239405723117035; 0.3404947156585928 0.708240345050117 0.28412692076096935; 0.1254104958726736 0.07725581087465527 0.7143499334334222]

In [17]:
@test ∂X_2 ≈ X_1

[32m[1mTest Passed[22m[39m
  Expression: ∂X_2 ≈ X_1
   Evaluated: [0.1881330422965547 0.35366634676273145 0.02493445014186424; 0.4963551449115172 0.8579481464386808 0.9423002597035267; 0.353299255378438 0.8683726444596267 0.5780210442767031] ≈ [0.1881330422965547 0.35366634676273145 0.02493445014186424; 0.4963551449115172 0.8579481464386808 0.9423002597035267; 0.353299255378438 0.8683726444596267 0.5780210442767031]

## Enzyme with KernelAbstractions.jl 

Now we do the same but with `KernelAbstractions`. `KernelAbstractions` has the advantage that it works both on CPU and GPU, as we will also see in these examples. 

In [18]:
using KernelAbstractions, CUDAKernels, KernelGradients 

The syntax for the kernel is almost the same! The `@Const` marks input arguments that are not allowed to be mutated or aliases of other input arguments. It would work without it, but this optimizes the kernel. 


In [19]:
@kernel function KA_element_mul_kernel!(C, @Const(A), @Const(B))
    i, j = @index(Global, NTuple)
    C[i,j] = A[i,j] * B[i,j]
end

KA_element_mul_kernel! (generic function with 5 methods)

We have to launch it a bit different though. First we define a wrapper around the kernel

In [20]:
function element_mul!(a, b, c)
    @assert size(a) == size(b) == size(c)
  
    device = KernelAbstractions.get_device(a) # here we determine if the array is on GPU or CPU
    n = device isa GPU ? 256 : 4   # we split how 
    kernel! = KA_element_mul_kernel!(device, n)
    kernel!(a, b, c, ndrange=size(c)) 
end


element_mul! (generic function with 1 method)

Then, we launch it. With KernelAbstractions, we have to wait for all computations to be done manually.

In [21]:
X_1 = DeviceArray(rand(256,256))
X_2 = DeviceArray(rand(size(X_1)...))
Y = DeviceArray(zeros(size(X_1)...))

ev = element_mul!(Y, X_1, X_2)
wait(ev)

In [22]:
@test Y ≈ X_1 .* X_2

[32m[1mTest Passed[22m[39m
  Expression: Y ≈ X_1 .* X_2
   Evaluated: [0.3337103560525394 0.29402818493817334 … 0.06369297065406779 0.3525451580850144; 0.5848881294428624 0.0018374382075833579 … 0.6849936555907357 0.20579007416570225; … ; 0.17890218595999355 0.1726866399127674 … 0.33404988632629096 0.019546283728313463; 0.015449132913712192 0.5957471748490338 … 0.019021526400236976 0.42824716696059906] ≈ [0.3337103560525394 0.29402818493817334 … 0.06369297065406779 0.3525451580850144; 0.5848881294428624 0.0018374382075833579 … 0.6849936555907357 0.20579007416570225; … ; 0.17890218595999355 0.1726866399127674 … 0.33404988632629096 0.019546283728313463; 0.015449132913712192 0.5957471748490338 … 0.019021526400236976 0.42824716696059906]

If we'd want to, we could also write this a bit more compact: 

In [30]:
?SpeedyExperiments.device

```
device()
```

Return currently used device for KernelAbstractions, either `CPU` or `CUDADevice`


In [23]:
X_1 = DeviceArray(rand(256,256))
X_2 = DeviceArray(rand(size(X_1)...))
Y = DeviceArray(zeros(size(X_1)...))

const device = SpeedyExperiments.device()
n = device isa GPU ? 256 : 4   # we split how 

wait(KA_element_mul_kernel!(device(), n)(Y, X_1, X_2, ndrange=size(Y)))

In [24]:
@test Y ≈ X_1 .* X_2

[32m[1mTest Passed[22m[39m
  Expression: Y ≈ X_1 .* X_2
   Evaluated: [0.2305129485809078 0.31983404024174666 … 0.8747413994716363 0.7215888755411229; 0.5359386887744276 0.24342308941446172 … 0.0554909996797386 0.018149699166728067; … ; 0.018582370211715143 0.17224335049959472 … 0.697381590135942 0.0322861905012264; 0.03082025174373685 0.5592708971320206 … 0.3280267978897599 0.10224962869549795] ≈ [0.2305129485809078 0.31983404024174666 … 0.8747413994716363 0.7215888755411229; 0.5359386887744276 0.24342308941446172 … 0.0554909996797386 0.018149699166728067; … ; 0.018582370211715143 0.17224335049959472 … 0.697381590135942 0.0322861905012264; 0.03082025174373685 0.5592708971320206 … 0.3280267978897599 0.10224962869549795]

Now, let's compute the derivatives for this we have to call `autodiff` with the kernel function which in turn is called with the device and the work group size. Then we can call the kernel event created by `autodiff` with the `Duplicated` inputs 

In [25]:
X_1 = DeviceArray(rand(256,256))
∂X_1 = zero(X_1) # input, hence zero 
X_2 = DeviceArray(rand(size(X_1)...))
∂X_2 = zero(X_2) # input, hence zero
Y = DeviceArray(zeros(size(X_1)...))
∂Y = fill!(similar(Y), 1); # output, hence something

In [26]:
∇! = autodiff(KA_element_mul_kernel!(device(), n))
wait(∇!(Duplicated(Y, ∂Y), Duplicated(X_1,∂X_1), Duplicated(X_2,∂X_2); ndrange=size(Y)))

In [27]:
@test ∂X_2 ≈ X_1

[32m[1mTest Passed[22m[39m
  Expression: ∂X_2 ≈ X_1
   Evaluated: [0.3001683813417353 0.33651105804500103 … 0.7295775379907252 0.7061833667391895; 0.47716068498995634 0.6277487742370701 … 0.23263185499492556 0.25489388396277823; … ; 0.23988070270197315 0.1709625408576816 … 0.7560901137758668 0.8881407587692896; 0.36252468468521826 0.5508976978441852 … 0.3456348272514411 0.5863218842991311] ≈ [0.3001683813417353 0.33651105804500103 … 0.7295775379907252 0.7061833667391895; 0.47716068498995634 0.6277487742370701 … 0.23263185499492556 0.25489388396277823; … ; 0.23988070270197315 0.1709625408576816 … 0.7560901137758668 0.8881407587692896; 0.36252468468521826 0.5508976978441852 … 0.3456348272514411 0.5863218842991311]

In [28]:
@test ∂X_1 ≈ X_2

[32m[1mTest Passed[22m[39m
  Expression: ∂X_1 ≈ X_2
   Evaluated: [0.34226680300317047 0.634724995713039 … 0.6093939470517675 0.36332649317030075; 0.8225014923248329 0.5508262962663629 … 0.22134173809291824 0.931128813263804; … ; 0.9546090339230084 0.41419958027403914 … 0.29882720500222815 0.883099791145679; 0.3222634071419628 0.5638991498576006 … 0.5608314859908383 0.6926284720247491] ≈ [0.34226680300317047 0.634724995713039 … 0.6093939470517675 0.36332649317030075; 0.8225014923248329 0.5508262962663629 … 0.22134173809291824 0.931128813263804; … ; 0.9546090339230084 0.41419958027403914 … 0.29882720500222815 0.883099791145679; 0.3222634071419628 0.5638991498576006 … 0.5608314859908383 0.6926284720247491]

Now, let's do this with a struct. First just the regular forward execution:

In [40]:
struct KAPreComputeMul{S,T,U} 
    Y::S
    X_1::T
    X_2::U
    
    function KAPreComputeMul(X,Y,Z) 
        @assert size(X) == size(Y) == size(Z)
        new{typeof(X),typeof(Y),typeof(Z)}(X,Y,Z)
    end 
end

@kernel function KAstruct_element_mul_kernel!(A)
    i, j = @index(Global, NTuple)
    A.Y[i,j] = A.X_1[i,j] * A.X_2[i,j]
end




In [37]:
X_1 = DeviceArray(rand(256,256))
X_2 = DeviceArray(rand(size(X_1)...))
Y = DeviceArray(zeros(size(X_1)...))

X = PreComputeMul(Y, X_1, X_2);

In [41]:
wait(KAstruct_element_mul_kernel!(device(), n)(X, ndrange=size(X.Y)))

In [42]:
@test Y ≈ X_1 .* X_2

[32m[1mTest Passed[22m[39m
  Expression: Y ≈ X_1 .* X_2
   Evaluated: [0.7101595915781801 0.008120219688601554 … 0.060897683383751386 0.2754529449739506; 0.14205550322857488 0.6473728339280315 … 0.5340696792862343 0.08308460257713372; … ; 0.5787314607833485 0.30471006301420117 … 0.24251569087196242 0.07301991058396372; 0.44550218641539463 0.37441406755553897 … 0.0013741518998793985 0.7474249479979399] ≈ [0.7101595915781801 0.008120219688601554 … 0.060897683383751386 0.2754529449739506; 0.14205550322857488 0.6473728339280315 … 0.5340696792862343 0.08308460257713372; … ; 0.5787314607833485 0.30471006301420117 … 0.24251569087196242 0.07301991058396372; 0.44550218641539463 0.37441406755553897 … 0.0013741518998793985 0.7474249479979399]

Great, now let's also compute some derivatives here:

In [43]:
X_1 = DeviceArray(rand(256,256))
∂X_1 = zero(X_1) # input, hence zero 
X_2 = DeviceArray(rand(size(X_1)...))
∂X_2 = zero(X_2) # input, hence zero
Y = DeviceArray(zeros(size(X_1)...))
∂Y = fill!(similar(Y), 1) # output, hence something

X = PreComputeMul(Y, X_1, X_2)
∂X = PreComputeMul(∂Y, ∂X_1, ∂X_2)

PreComputeMul{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}([1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0; … ; 1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0])

In [44]:
∇! = autodiff(KAstruct_element_mul_kernel!(device(), n))
wait(∇!(Duplicated(X, ∂X); ndrange=size(Y)))

In [45]:
@test ∂X_2 ≈ X_1

[32m[1mTest Passed[22m[39m
  Expression: ∂X_2 ≈ X_1
   Evaluated: [0.15391591677219585 0.646132074657528 … 0.5414569874027779 0.9861771809225744; 0.8413944025227638 0.5735657203260538 … 0.2298950840328039 0.8395484685139407; … ; 0.6099415975391191 0.5083973024770879 … 0.21089068454151105 0.26577656580681575; 0.09945135277963224 0.34378484866526393 … 0.19222273798221923 0.7089355837673377] ≈ [0.15391591677219585 0.646132074657528 … 0.5414569874027779 0.9861771809225744; 0.8413944025227638 0.5735657203260538 … 0.2298950840328039 0.8395484685139407; … ; 0.6099415975391191 0.5083973024770879 … 0.21089068454151105 0.26577656580681575; 0.09945135277963224 0.34378484866526393 … 0.19222273798221923 0.7089355837673377]

In [46]:
@test ∂X_1 ≈ X_2

[32m[1mTest Passed[22m[39m
  Expression: ∂X_1 ≈ X_2
   Evaluated: [0.6873005577895346 0.4717280706893876 … 0.7711996393728344 0.06964887144950993; 0.1304670292398673 0.1352036301140137 … 0.1295697835184515 0.9733663940943449; … ; 0.8660550077294288 0.6877076928486632 … 0.6646838321316613 0.3120682840128023; 0.7708405757472858 0.601594311518908 … 0.8565597958017475 0.3152767598966769] ≈ [0.6873005577895346 0.4717280706893876 … 0.7711996393728344 0.06964887144950993; 0.1304670292398673 0.1352036301140137 … 0.1295697835184515 0.9733663940943449; … ; 0.8660550077294288 0.6877076928486632 … 0.6646838321316613 0.3120682840128023; 0.7708405757472858 0.601594311518908 … 0.8565597958017475 0.3152767598966769]