In [None]:
# Setting up a custom stylesheet in IJulia
file = open("./../style.css") # A .css file in the same folder as this notebook file
styl = read(file, String) # Read the file
HTML("$styl") # Output as HTML

## CUDA.jl (based on [CUDA.jl/ docs](https://cuda.juliagpu.org/stable/))

<h2>In this notebook</h2>

- [Set up](#Set-up)
- [Product simple example](#Really-simple-example)


# Set up

The Julia CUDA works with NVIDIA driver however we don't need to install the entire CUDA toolkit, this will be automatically done just adding CUDA:

In [None]:
# Run julia using the flag 
```bash
julia --threads auto 
```

# Install the pkg 
import Pkg; 
Pkg.add("CUDA")

# get the tool version 
import CUDA 
CUDA.versioninfo()

# test pkg
Pkg.test("CUDA")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`


CUDA toolkit 11.7, artifact installation
NVIDIA driver 515.48.7, for CUDA 11.7
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+515.48.7
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.3
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce RTX 3060 Laptop GPU (sm_86, 5.687 GiB / 6.000 GiB available)


[32m[1m     Testing[22m[39m CUDA
[32m[1m      Status[22m[39m `/tmp/jl_RTiu1V/Project.toml`
 [90m [79e6a3ab] [39mAdapt v3.3.3
 [90m [ab4f0b2a] [39mBFloat16s v0.2.0
 [90m [052768ef] [39mCUDA v3.11.0
 [90m [864edb3b] [39mDataStructures v0.18.13
 [90m [7a1cc6ca] [39mFFTW v1.4.6
 [90m [0c68f7d7] [39mGPUArrays v8.3.2
 [90m [a98d9a8b] [39mInterpolations v0.13.6
 [90m [872c559c] [39mNNlib v0.8.8
 [90m [276daf66] [39mSpecialFunctions v2.1.6
 [90m [a759f4b9] [39mTimerOutputs v0.5.20
 [90m [ade2ca70] [39mDates `@stdlib/Dates`
 [90m [8ba89e20] [39mDistributed `@stdlib/Distributed`
 [90m [37e2e46d] [39mLinearAlgebra `@stdlib/LinearAlgebra`
 [90m [de0858da] [39mPrintf `@stdlib/Printf`
 [90m [3fa0cd96] [39mREPL `@stdlib/REPL`
 [90m [9a3f8284] [39mRandom `@stdlib/Random`
 [90m [2f01184e] [39mSparseArrays `@stdlib/SparseArrays`
 [90m [10745b16] [39mStatistics `@stdlib/Statistics`
 [90m [8dfed614] [39mTest `@stdlib/Test`
[32m[1m      Status[22m[39m `/tmp

[0m                                                  | [37m         | ---------------- GPU ---------------- | ---------------- CPU ---------------- |[39m
[37mTest[39m[37m                                     (Worker) | [39m[37mTime (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |[39m
[37minitialization[39m[37m                                (2) | [39m[37m    5.98 | [39m[37m  0.00 | [39m[37m 0.0 | [39m[37m      0.00 | [39m[37m  137.00 | [39m[37m  0.10 | [39m[37m 1.7 | [39m[37m    508.95 | [39m[37m  897.57 |[39m
[37mgpuarrays/indexing scalar[39m[37m                     (2) | [39m[37m   28.52 | [39m[37m  0.00 | [39m[37m 0.0 | [39m[37m      0.01 | [39m[37m  149.00 | [39m[37m  1.57 | [39m[37m 5.5 | [39m[37m   4711.30 | [39m[37m  897.57 |[39m
[37mgpuarrays/reductions/reducedim![39m[37m               (2) | [39m[37m  102.31 | [39m[37m  0.00 | [39m[37m 0.0 | [39m[37m      1.03 | [39m[37m  151.0

# Really simple example

In this first simple example we test the main differences on CPU and GPU implementation of a a multiply operation. Let's start with CPU implementation and creating a test problem for large array 


In [None]:
N = 2^22 # size of both vec
x = fill(1.0f0, N) # x vec
y = fill(2.0f0, N) # y vec

# result and test
r = x.*y

using Test 
@test (all(r.==x[1]*y[1]))

```julia
Test Passed
```

Let's know implemente a CPU paralalization with serial `serial_cpu_multiply` and `parallel_cpu_multiply` function : 


In [None]:
# Select the number of cpu threads 
JULIA_NUM_THREADS = 6

# Declare parallel function cpu
function serial_cpu_multiply(x,y)
    for i in eachindex(x,y)
        @inbounds r[i] = x[i]*y[i]
    end
    return r 
end


# Declare parallel function cpu
function parallel_cpu_multiply(x,y)
    Threads.@threads for i in eachindex(x,y)
        @inbounds r[i] = x[i]*y[i]
    end
    return r 
end

# Execute function for y and x vec
r_serial_cpu = serial_cpu_multiply(x,y)
r_parallel_cpu = parallel_cpu_multiply(x,y)

# Run test 

@test (all(r_parallel_cpu.==(x[1]*y[1])) && all(r_serial_cpu.==(x[1]*y[1])))


```julia
Test Passed
```

Let's now measure the exution time with `BenchmarkTools` pkg

In [None]:
# Load and install pkg
#Pkg.add("BenchmarkTools")

# Let's use it 
using BenchmarkTools

# Serial and parallel cpu multiply
@btime serial_cpu_multiply(x,y)
@btime parallel_cpu_multiply(x,y)

```julia
68.364 ms (8388097 allocations: 127.99 MiB)
```
and 
```julia
26.242 ms (8388138 allocations: 128.00 MiB)
```

Let's now implement it using GPU:

In [None]:
# load CUDA 
using CUDA 

# define a vecotr on the GPU 
x_d = CUDA.fill(1.0f0, N)
y_d = CUDA.fill(2.0f0, N)
r_d = CUDA.fill(0.0f0, N)


# define the function 
function multiply_gpu(y,x)
    CUDA.@sync begin 
        return y.*x
    end
end

# exute and measure time
@btime multiply_gpu(x_d, y_d)

We obtain a much faster implementation! 
```julia
3.427 μs (40 allocations: 2.83 KiB)
```

The `@sync` macro is the interesting thing here. This will force the CPU to wait untill the GPU ends up its work and at that point will continue. But most of the time you don't need to synchronize explicitly: many operations, like copying memory from the GPU to the CPU, implicitly synchronize execution. 

This way to perform high level computation is okay but we need to dive into to perform specific stuff under the hood. Let's implement our kernel to do that task:

In [None]:
# create my kernel
function multiply_gpu_kernel!(x, y, r)
    for i in 1:length(x)
         r[i] = x[i]*y[i]
    end
    return nothing
end


# execute the kerenel with autmatic number of threads and blocks 
@cuda multiply_gpu_kernel!(x_d, y_d, r_d)

# lets create a benchmark function to test it 
function bench_multiply_gpu(x_d, y_d, r_d)
    CUDA.@sync begin
        @cuda multiply_gpu_kernel!(x_d, y_d, r_d)
    end
end

@btime bench_multiply_gpu(x_d, y_d, r_d)

```julia
435.169 ms (61 allocations: 3.89 KiB)
```
Thats a really slower version than the other implementation, what happen ? 

After using  `CuArrays` `x_d` and `y_d`, we can lunch our kernel launch via `@cuda`. The `@cuda` macro statement, it will compile the kernel `(bench_multiply_gpu!)` for execution on the GPU. Once compiled, future invocations are fast. You can see what `@cuda` expands to using `?@cuda` from the Julia prompt.

```
cuda
  @cuda [kwargs...] func(args...)

  High-level interface for executing code on a GPU. The @cuda macro should
  prefix a call, with func a callable function or object that should return
  nothing. It will be compiled to a CUDA function upon first use, and to a
  certain extent arguments will be converted and managed automatically using
  cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel
  launch on the current CUDA context.

  Several keyword arguments are supported that influence the behavior of
  @cuda.

    •  launch: whether to launch this kernel, defaults to true. If
       false the returned kernel object should be launched by calling
       it and passing arguments again.

    •  dynamic: use dynamic parallelism to launch device-side kernels,
       defaults to false.

    •  arguments that influence kernel compilation: see cufunction and
       dynamic_cufunction

    •  arguments that influence kernel launch: see CUDA.HostKernel and
       CUDA.DeviceKernel
```


# Profiling 

Often is really important obtain a profiling for our GPU program, to check coalsceed access, race conditions problems, memory managment acces, etc. For that, we can call `nvprof` tool from NVIDIA. On a Unix system we should execute:

In [None]:
$ nvprof --profile-from-start off /path/to/julia

The `/path/to/julia` is the path to julia binary. Note that we don't initialize immediately the profiler but we can call the CUDA API's with the macro @profile:

In [None]:
CUDA.@profile bench_multiply_gpu(x_d, y_d, r_d)

But nvprof is not longer used for GPUs with compute capalities newer than 7.0, instead we need nsys (Nsight system), to set nsys to julia we run: 

```bash
nsys launch julia --threads auto  
```
then we can execute the previous code but adding the macro CUDA.@profile before we lunch the kernel:
```julia 
julia> using CUDA

julia> a = CUDA.rand(1024,1024,1024);

julia> sin.(a);

julia> CUDA.@profile sin.(a);
```

Then we open Nshight System using nsys-cli 