In [2]:
# Setting up a custom stylesheet in IJulia
file = open("./../style.css") # A .css file in the same folder as this notebook file
styl = read(file, String) # Read the file
HTML("$styl") # Output as HTML

## CUDA.jl (based on [CUDA.jl/ docs](https://cuda.juliagpu.org/stable/))

<h2>In this notebook</h2>

- [Set up](#Set-up)


# Set up

The Julia CUDA works with NVIDIA driver however we don't need to install the entire CUDA toolkit, this will be automatically done just adding CUDA:

In [None]:
# Install the pkg 
using Pkg; 
Pkg.add("CUDA")

# get the tool version 
using CUDA 
CUDA.versioninfo()

# test pkg
Pkg.test("CUDA")

# Really simple example

In this first simple example we test the main differences on CPU and GPU implementation of a a multiply operation. Let's start with CPU implementation and creating a test problem for large array 


In [None]:
N = 2^22 # size of both vec
x = fill(1.0f0, N) # x vec
y = fill(2.0f0, N) # y vec

# result and test
r = x.*y

using Test 
@test (all(r.==x[1]*y[1]))

Let's know implemente a CPU paralalization with serial `serial_cpu_multiply` and `parallel_cpu_multiply` function : 


In [None]:
# Select the number of cpu threads 
JULIA_NUM_THREADS = 6

# Declare parallel function cpu
function serial_cpu_multiply(x,y)
    for i in eachindex(x,y)
        @inbounds r[i] = x[i]*y[i]
    end
    return r 
end


# Declare parallel function cpu
function parallel_cpu_multiply(x,y)
    Threads.@threads for i in eachindex(x,y)
        @inbounds r[i] = x[i]*y[i]
    end
    return r 
end

# Execute function for y and x vec
r_serial_cpu = serial_cpu_multiply(x,y)
r_parallel_cpu = parallel_cpu_multiply(x,y)

# Run test 

@test (all(r_parallel_cpu.==(x[1]*y[1])) && all(r_serial_cpu.==(x[1]*y[1])))


Let's now measure the exution time with `BenchmarkTools` pkg

In [None]:
# Load and install pkg
Pkg.add("BenchmarkTools")

# Let's use it 
using BenchmarkTools

# Serial and parallel cpu multiply
@btime serial_cpu_multiply(x,y)
@btime parallel_cpu_multiply(x,y)

Let's now implement it using GPU:

In [None]:
# load CUDA 
using CUDA 

# define a vecotr on the GPU 
x_d = CUDA.fill(1.0f0, N)
y_d = CUDA.fill(2.0f0, N)
r_d = CUDA.fill(0.0f0, N)


# define the function 
function multiply_gpu(y,x)
    CUDA.@sync begin 
        return y.*x
    end
end

# exute and measure time
@btime multiply_gpu(x_d, y_d)

The `@sync` macro is the interesting thing here. This will force the CPU to wait untill the GPU ends up its work and at that point will continue. But most of the time you don't need to synchronize explicitly: many operations, like copying memory from the GPU to the CPU, implicitly synchronize execution. 

This way to perform high level computation is okay but we need to dive into to perform specific stuff under the hood. Let's implement our kernel to do that task:

In [None]:
# create my kernel
function multiply_gpu_kernel!(x, y, r)
    for i in 1:length(x)
         r[i] = x[i]*y[i]
    end
    return nothing
end


# execute the kerenel with autmatic number of threads and blocks 
@cuda multiply_gpu_kernel!(x_d, y_d, r_d)

# lets create a benchmark function to test it 
function bench_multiply_gpu(x_d, y_d, r_d)
    CUDA.@sync begin
        @cuda multiply_gpu_kernel!(x_d, y_d, r_d)
    end
end

@btime bench_multiply_gpu(x_d, y_d, r_d)

After using  `CuArrays` `x_d` and `y_d`, we can lunch our kernel launch via `@cuda`. The `@cuda` macro statement, it will compile the kernel `(bench_multiply_gpu!)` for execution on the GPU. Once compiled, future invocations are fast. You can see what `@cuda` expands to using `?@cuda` from the Julia prompt.

```
cuda
  @cuda [kwargs...] func(args...)

  High-level interface for executing code on a GPU. The @cuda macro should
  prefix a call, with func a callable function or object that should return
  nothing. It will be compiled to a CUDA function upon first use, and to a
  certain extent arguments will be converted and managed automatically using
  cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel
  launch on the current CUDA context.

  Several keyword arguments are supported that influence the behavior of
  @cuda.

    •  launch: whether to launch this kernel, defaults to true. If
       false the returned kernel object should be launched by calling
       it and passing arguments again.

    •  dynamic: use dynamic parallelism to launch device-side kernels,
       defaults to false.

    •  arguments that influence kernel compilation: see cufunction and
       dynamic_cufunction

    •  arguments that influence kernel launch: see CUDA.HostKernel and
       CUDA.DeviceKernel
```


# Profiling 

Often is really important obtain a profiling for our GPU program, to check coalsceed access, race conditions problems, memory managment acces, etc. For that, we can call `nvprof` tool from NVIDIA. On a Unix system we should execute:

In [None]:
$ nvprof --profile-from-start off /path/to/julia

The `/path/to/julia` is the path to julia binary. Note that we don't initialize immediately the profiler but we can call the CUDA API's with the macro @profile:

In [1]:
CUDA.@profile bench_multiply_gpu(x_d, y_d, r_d)

LoadError: LoadError: UndefVarError: CUDA not defined
in expression starting at In[1]:1

But nvprof is not longer used for GPUs with compute capalities newer than 7.0, instead we need nsys (Nsight system), to set nsys to julia we run: 

In [None]:
$ nsys lunch julia