In [1]:
## Setting up a custom stylesheet in IJulia
file = open("./../style.css") # A .css file in the same folder as this notebook file
styl = read(file, String) # Read the file
HTML("$styl") # Output as HTML

## General limitations


- poor Float64 performance: GPU
- no recursion : CUDA.jl, though CUDA’s support is limited by the GPU
- kernel must return nothing : CUDA
- no kernel varargs: CUDA, and so far I have not seen CUDA.jl kernels with VarArgs
- no strings : CUDA (aside: a workaround is using arrays of Char, but I haven’t seen CUDA.jl examples of those so far)
- must have type-inferred code : CUDA C is statically typed
- no garbage collection on device : CUDA C has manual memory management
- kernel cannot allocate, and only isbits types in device arrays: CUDA C has no garbage collection, and Julia has no manual deallocations, let alone on the - - device to deal with data that live independently of the CuArray.
- no try-catch-finally in kernel: CUDA C does not support exception handling on device (v11.5.1 docs, Programming Guide I.4.7)
- no scalar indexing of CuArray: just a scalar getindex in host code must put scalar on CPU, and CUDA.jl does not launch kernel for scalar setindex e.g. a[1] - - += 1 because it’s not worth it
- calls to CPU-only runtime library: GPU can’t have a version of every low-level CPU function Julia has

# Unupported kernel operations

The most frequent issues with the GPU stack come from expected CPU functionality not being implemented or supported on the GPU. We have different array operations that leads to an error:

## Unsupported array operations


For example with array expressions operations often lead to iterating a fallback functionality, which triggers the following scalar iteration error:

In [2]:
using CUDA, LinearAlgebra

eigen(CUDA.rand(2,2))

│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore /home/mvanzulli/.julia/packages/GPUArraysCore/rSIl2/src/GPUArraysCore.jl:81


LoadError: ArgumentError: cannot take the CPU address of a CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

In this case the generic.jl function uses for loops which produces an undesire overhead. Hence are not allowed, in this case we should use CUBLAS  

With array operations, it's also easy to get an error when using unsupported data with GPU kernels. Essentially, every value passed to a kernel needs to be an isbits type. An easy way to violate this, is to pass a CPU array to a GPU array operation:

In [3]:
a = CUDA.rand(2,2)
b = rand(2)
broadcast(a, b) do x, y
    x + y
end

LoadError: GPU compilation of kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, var"#1#2", Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, var"#1#2", Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}}} which is not isbits.
    .2 is of type Base.Broadcast.Extruded{Vector{Float64}, Tuple{Bool}, Tuple{Int64}} which is not isbits.
      .x is of type Vector{Float64} which is not isbits.



## Unsupported kernel operations


In device code, i.e. code that actually runs on the GPU as opposed to a CPU method (like eigen) that's implemented using GPU kernels, the story is a little more complicated. Essentially, not all of the Julia language is supported. If you use unsupported functionality, you will see a compilation error:

In [4]:
a = CUDA.rand(1)
broadcast(a) do x
    print(x)
end

LoadError: InvalidIRError: compiling kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceVector{Nothing, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, var"#3#4", Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_subtype)
Stacktrace:
 [1] [0m[1mprint[22m
[90m   @ [39m[90m./[39m[90m[4mcoreio.jl:3[24m[39m
 [2] [0m[1m#3[22m
[90m   @ [39m[90m./[39m[90m[4mIn[4]:3[24m[39m
 [3] [0m[1m_broadcast_getindex_evalf[22m
[90m   @ [39m[90m./[39m[90m[4mbroadcast.jl:670[24m[39m
 [4] [0m[1m_broadcast_getindex[22m
[90m   @ [39m[90m./[39m[90m[4mbroadcast.jl:643[24m[39m
 [5] [0m[1mgetindex[22m
[90m   @ [39m[90m./[39m[90m[4mbroadcast.jl:597[24m[39m
 [6] [0m[1mbroadcast_kernel[22m
[90m   @ [39m[90m~/.julia/packages/GPUArrays/gok9K/src/host/[39m[90m[4mbroadcast.jl:67[24m[39m
Reason: unsupported dynamic function invocation (call to print)
Stacktrace:
 [1] [0m[1mprint[22m
[90m   @ [39m[90m./[39m[90m[4mcoreio.jl:3[24m[39m
 [2] [0m[1m#3[22m
[90m   @ [39m[90m./[39m[90m[4mIn[4]:3[24m[39m
 [3] [0m[1m_broadcast_getindex_evalf[22m
[90m   @ [39m[90m./[39m[90m[4mbroadcast.jl:670[24m[39m
 [4] [0m[1m_broadcast_getindex[22m
[90m   @ [39m[90m./[39m[90m[4mbroadcast.jl:643[24m[39m
 [5] [0m[1mgetindex[22m
[90m   @ [39m[90m./[39m[90m[4mbroadcast.jl:597[24m[39m
 [6] [0m[1mbroadcast_kernel[22m
[90m   @ [39m[90m~/.julia/packages/GPUArrays/gok9K/src/host/[39m[90m[4mbroadcast.jl:67[24m[39m
[36m[1mHint[22m[39m[36m: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code[39m

To help debugging this, there's multiple stacktraces being displayed: one pointing to each unsupported operation in a GPU kernel, and finally a host stack trace pointing to the CPU code that invoked the kernel. Often, that's sufficient to find and resolve the issue.

Sometimes though, the issue is more subtle:

In [5]:
function bad_kernel(a::AbstractArray)
    a[threadId().x] = 0
    return
end

@cuda bad_kernel(CuArray([1]))

LoadError: InvalidIRError: compiling kernel #bad_kernel(CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported use of an undefined name (use of 'threadId')
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
Reason: unsupported dynamic function invocation (call to getproperty)
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
[36m[1mHint[22m[39m[36m: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code[39m

To inspect we can use `@code_warntype` or `code_llvm`:

In [6]:
@device_code_warntype @cuda bad_kernel(CuArray([1]))

PTX CompilerJob of kernel #bad_kernel(CuDeviceVector{Int64, 1}) for sm_80

MethodInstance for bad_kernel(::CuDeviceVector{Int64, 1})
  from bad_kernel(a::AbstractArray) in Main at In[5]:1
Arguments
  #self#[36m::Core.Const(bad_kernel)[39m
  a[36m::CuDeviceVector{Int64, 1}[39m
Body[36m::Nothing[39m
[90m1 ─[39m %1 = Main.threadId()[91m[1m::Any[22m[39m
[90m│  [39m %2 = Base.getproperty(%1, :x)[91m[1m::Any[22m[39m
[90m│  [39m      Base.setindex!(a, 0, %2)
[90m└──[39m      return nothing



LoadError: InvalidIRError: compiling kernel #bad_kernel(CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported use of an undefined name (use of 'threadId')
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
Reason: unsupported dynamic function invocation (call to getproperty)
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] [0m[1mbad_kernel[22m
[90m   @ [39m[90m./[39m[90m[4mIn[5]:2[24m[39m
[36m[1mHint[22m[39m[36m: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code[39m

From this view, it's much more clear that `threadId` was a typo and should be `threadIdx` instead. Note that in case your issue lies in a function that's called by the kernel, and isn't directly visible by the `@device_code_warntype` output, you can use Cthulhu.jl to interactively explore code by using `@device_code_warntype interactive=true` ... (the equivalent of Cthulhu's `@descend_code_warntype`).

In [7]:
function good_kernel(a)
    @inbounds a[threadIdx().x] = 0
    return
end

@device_code_llvm @cuda good_kernel(CuArray([1]))

; PTX CompilerJob of kernel #good_kernel(CuDeviceVector{Int64, 1}) for sm_80
[90m;  @ In[7]:1 within `good_kernel`[39m
[95mdefine[39m [95mptx_kernel[39m [36mvoid[39m [93m@_Z22julia_good_kernel_657013CuDeviceArrayI5Int64Li1ELi1EE[39m[33m([39m[33m[[39m[33m1[39m [0mx [36mi64[39m[33m][39m [0m%state[0m, [33m{[39m [36mi8[39m [95maddrspace[39m[33m([39m[33m1[39m[33m)[39m[0m*[0m, [36mi64[39m[0m, [33m[[39m[33m1[39m [0mx [36mi64[39m[33m][39m[0m, [36mi64[39m [33m}[39m [0m%0[33m)[39m [95mlocal_unnamed_addr[39m [0m#1 [33m{[39m
[91mconversion:[39m
  [0m%.fca.0.extract [0m= [96m[1mextractvalue[22m[39m [33m{[39m [36mi8[39m [95maddrspace[39m[33m([39m[33m1[39m[33m)[39m[0m*[0m, [36mi64[39m[0m, [33m[[39m[33m1[39m [0mx [36mi64[39m[33m][39m[0m, [36mi64[39m [33m}[39m [0m%0[0m, [33m0[39m
[90m;  @ In[7]:2 within `good_kernel`[39m
[90m; ┌ @ /home/mvanzulli/.julia/packages/CUDA/tTK8Y/src/device/intrinsics/i

We can inspect more deeply even assembly code:

In [8]:
CUDA.code_ptx(good_kernel, Tuple{CuDeviceVector{Int64, 1}})
CUDA.code_sass(good_kernel, Tuple{CuDeviceVector{Int64, 1}})


[90m//[39;49;00m
[90m// Generated by LLVM NVPTX Back-End[39;49;00m
[90m//[39;49;00m

[94m.version[39;49;00m [94m7.0[39;49;00m
[94m.target[39;49;00m [91msm_80[39;49;00m
[94m.address_size[39;49;00m [94m64[39;49;00m

	[90m// .globl	julia_good_kernel_6690  // -- Begin function julia_good_kernel_6690[39;49;00m
                                        [90m// @julia_good_kernel_6690[39;49;00m
[94m.visible[39;49;00m [94m.func[39;49;00m [91mjulia_good_kernel_6690[39;49;00m(
	[94m.param[39;49;00m [96m.b64[39;49;00m [91mjulia_good_kernel_6690_param_0[39;49;00m
)
{
	[94m.reg[39;49;00m [96m.b32[39;49;00m 	[91m%r[39;49;00m<[94m2[39;49;00m>;
	[94m.reg[39;49;00m [96m.b64[39;49;00m 	[91m%rd[39;49;00m<[94m6[39;49;00m>;

[90m// %bb.0:                               // %top[39;49;00m
	[94mld.param.u64[39;49;00m 	[91m%rd1[39;49;00m, [[91mjulia_good_kernel_6690_param_0[39;49;00m];
	[94mmov.u32[39;49;00m 	[91m%r1[39;49;00m, [91m%tid[39;49;00m.[91

## Avoiding race condition problems 

It's also easy to run into issues related to parallel programming: threads need to be synchronized, memories initialized, etc. Mistakes are easy to make, for example, let's look at this kernel to transpose a matrix:

In [9]:
const tile_dim = 16

function gpu_transpose(input::CuMatrix)
    function kernel(input::AbstractMatrix{T}, output::AbstractMatrix{T}) where {T}
        # shared memory buffer so that operations to global memory are linear and can be coalesced
        block = @cuStaticSharedMem(T, (tile_dim, tile_dim))

        # read
        x = tile_dim * (blockIdx().x - 1) + threadIdx().x
        y = tile_dim * (blockIdx().y - 1) + threadIdx().y
        if x <= size(input, 1) && y <= size(input, 2)
            block[threadIdx().y, threadIdx().x] = input[x, y]
        end
        
        # write
        x = tile_dim * (blockIdx().y - 1) + threadIdx().x
        y = tile_dim * (blockIdx().x - 1) + threadIdx().y
        if x <= size(output, 1) && y <= size(output, 2)
            output[x, y] = block[threadIdx().x, threadIdx().y]
        end
        return
    end
    
    output = similar(input, reverse(size(input)))
    @cuda threads=(tile_dim, tile_dim) blocks=(cld(size(input, 1), tile_dim), cld(size(input, 2), tile_dim)) kernel(input, output)
    output
end

a = CuArray(reshape(1:1024, 32, 32))
b = gpu_transpose(a)
display(b)
println("Valid transpose: ", Array(b) == Array(a)')

32×32 CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}:
   1    2    3    4    5    6    7  …     0     0     0     0    31    32
  33   34   35   36   37   38   39        0     0     0     0    63    64
  65   66   67   68   69   70   71        0     0     0     0    95    96
  97   98   99  100    0    0    0        0     0     0     0   127   128
 129  130  131  132  133  134  135        0     0     0     0   159   160
 161  162  163  164  165  166  167  …     0     0     0     0   191   192
 193  194  195  196  197  198  199        0     0     0     0   223   224
 225  226  227  228  229  230  231        0     0     0     0   255   256
 257  258  259  260  261  262  263        0     0     0     0   287   288
 289  290  291  292  293  294  295        0     0     0     0   319   320
 321  322  323  324  325  326  327  …   347   348   349   350   351   352
 353  354  355  356  357  358  359      379   380   381   382   383   384
 385  386  387  388  389  390  391      411   412   413   414   

Valid transpose: false


If we execute this kernel a couple of times, we see some strange values in the matrix. This may be caused by a race condition, which NVIDIA has tools for to discover.

With the compute sanitizer, https://docs.nvidia.com/cuda/compute-sanitizer/index.html, one can run julia under several sanitizer tools detecting issues like memory errors, race conditions, or missing synchronization. The sanitizer is shipped as part of CUDA.jl, but cannot be launched from within a session:

In [10]:
# We obtain the kernel path 
CUDA.compute_sanitizer()

"/home/mvanzulli/.julia/artifacts/913584335ab836f9781a0325178d0949c193f50b/bin/compute-sanitizer"

In [11]:
# we create a new scritp
using CUDA
a = CUDA.ones(1024,1024)
b = gpu_transpose(a)
@assert Array(b) == Array(a)

LoadError: AssertionError: Array(b) == Array(a)

and we can execute:
```bash
/home/mvanzulli/.julia/artifacts/913584335ab836f9781a0325178d0949c193f50b/bin/compute-sanitizer --launch-timeout=0 --report-api-errors=no --tool=racecheck julia /home/mvanzulli/Repositories/gitHub/Julia_by_Notebooks/cuda/transpose_race_conflict.jl
```

the result will be 

```bash
========= COMPUTE-SANITIZER
32×32 CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}:
========= Error: Race reported between Write access at 0x410 in /home/mvanzulli/.julia/packages/LLVM/WjSQG/src/interop/base.jl:45:julia_kernel_1680(CuDeviceArray<Int64, (int)2, (int)1>, CuDeviceArray<Int64, (int)2, (int)1>)
=========     and Read access at 0x750 in int.jl:87:julia_kernel_1680(CuDeviceArray<Int64, (int)2, (int)1>, CuDeviceArray<Int64, (int)2, (int)1>) [7680 hazards]
=========
```

So we have a race condition problem. Note the `--launch-timeout=0` argument, to give Julia some time to perform the first CUDA API call, and `--report-api-errors=no` to hide probably unrelated API errors (that can come from a variety of places, including NVIDIA's own libraries). 

To fix this we just add a `syncthreads()` between read and write operations and that's it

### Define an array inside a kernel for each thread 

Since this is not supported many fallbacks arise: 
- Use StaticArrays
- Use Shared memory 

In [52]:
using CUDA, StaticArrays
const mat_indexes_rows = 1
const mat_indexes_cols = 3
function bug_kernel(a::AbstractArray)
    
    b = threadIdx().x
    c = threadIdx().y
    d = threadIdx().z
    @cuprintln("(b, c, d) is ($b, $c, $d)")
        # define a vector
    f = [b+1, c+1, d+1]
    f = b+1, c+1, d+1
    @cuprintln("(f) is ($f)")

    return nothing
end

# Define box
N = 3
box = (N, N, N)
# Define grid
dim_block = (2, 2, 2)
# Define num threads
num_threads = cld.(box, dim_block)

random_array_for_thread = MMatrix{3, 3, Int32}(undef)

@device_code_typed CUDA.@cuda(
    threads = num_threads,
    blocks = dim_block,
    bug_kernel(CuArray(random_array_for_thread))
)

random_array_for_thread

3×3 MMatrix{3, 3, Int32, 9} with indices SOneTo(3)×SOneTo(3):
 1           0  2
 0  -566635184  0
 1       32556  1

(b, c, d) is (1, 1, 1)
(b, c, d) is (2, 1, 1)
(b, c, d) is (1, 2, 1)
(b, c, d) is (2, 2, 1)
(b, c, d) is (1, 1, 2)
(b, c, d) is (2, 1, 2)
(b, c, d) is (1, 2, 2)
(b, c, d) is (2, 2, 2)
(b, c, d) is (1, 1, 1)
(b, c, d) is (2, 1, 1)
(b, c, d) is (1, 2, 1)
(b, c, d) is (2, 2, 1)
(b, c, d) is (1, 1, 2)
(b, c, d) is (2, 1, 2)
(b, c, d) is (1, 2, 2)
(b, c, d) is (2, 2, 2)
(b, c, d) is (1, 1, 1)
(b, c, d) is (2, 1, 1)
(b, c, d) is (1, 2, 1)
(b, c, d) is (2, 2, 1)
(b, c, d) is (1, 1, 2)
(b, c, d) is (2, 1, 2)
(b, c, d) is (1, 2, 2)
(b, c, d) is (2, 2, 2)
(b, c, d) is (1, 1, 1)
(b, c, d) is (2, 1, 1)
(b, c, d) is (1, 2, 1)
(b, c, d) is (2, 2, 1)
(b, c, d) is (1, 1, 2)
(b, c, d) is (2, 1, 2)
(b, c, d) is (1, 2, 2)
(b, c, d) is (2, 2, 2)
(b, c, d) is (1, 1, 1)
(b, c, d) is (2, 1, 1)
(b, c, d) is (1, 2, 1)
(b, c, d) is (2, 2, 1)
(b, c, d) is (1, 1, 2)
(b, c, d) is (2, 1, 2)
(b, c, d) is (1, 2, 2)
(b, c, d) is (2, 2, 2)
(b, c, d) is (1, 1, 1)
(b, c, d) is (2, 1, 1)
(b, c, d) is (1, 2, 1)
(b, c, d) i